Deployment Patterns

Overview¶

Solti-Monitoring is designed for distributed small deployments - multiple independent monitoring servers serving 5-50 hosts each, rather than large centralized installations.

Design Philosophy: - Prefer multiple small monitoring servers over single large clusters - Keep data close to where it's generated (regional/site-based) - Use WireGuard for secure remote collection - S3 object storage for cost-effective long-term retention (Loki only) - NFS mounts for InfluxDB shared storage

Primary Pattern: Hub-and-Spoke (Distributed Small Deployment)¶

Architecture: - Hub: Central monitoring server (InfluxDB + Loki + Grafana) - Spokes: Monitored hosts running Telegraf + Alloy collectors - Transport: WireGuard VPN for secure data shipping - Storage: S3 for Loki logs, NFS for InfluxDB metrics

Characteristics: - Serves 5-50 monitored hosts per hub - Regional deployment (e.g., one hub per data center, VPC, or geographic region) - Independent operation (each hub can function without others) - Low operational complexity

Reference Implementation:

Hub: monitor11.example.com (Proxmox VM, local infrastructure)
  ├── InfluxDB v2 OSS (NFS storage, 30-day retention)
  ├── Loki (S3 storage via MinIO)
  ├── Grafana (local visualization)
  └── WireGuard endpoint (10.10.0.11)

Spoke: ispconfig3-server.example.com (Linode VPS, remote)
  ├── Telegraf → monitor11:8086 (via WireGuard)
  ├── Alloy → monitor11:3100 (via WireGuard)
  └── Monitored services: Apache, ISPConfig, Fail2ban, Gitea, Mail, Bind9

Why This Pattern: - Security: WireGuard encrypts all monitoring traffic - Scalability: Add more spokes without hub changes - Resilience: Hub failure only affects visibility, not spoke functionality - Cost: S3 storage cheaper than local disk for long-term logs - Simplicity: No cluster management, no distributed consensus

Deployment Sizing¶

Small Hub (5-10 Hosts)¶

Hub Resources: - 2 CPU cores - 4-8 GB RAM - 50 GB local disk (system + InfluxDB index) - 100 GB NFS mount (InfluxDB data) - 500 GB S3 bucket (Loki logs, 30-day retention)

Retention: - Metrics: 30 days - Logs: 30 days

Example:

# ansible-playbook vars for small hub
influxdb_retention_days: 30
loki_retention_days: 30
loki_s3_bucket_quota: 500GB

Medium Hub (10-50 Hosts)¶

Hub Resources: - 4 CPU cores - 8-16 GB RAM - 100 GB local disk - 250 GB NFS mount (InfluxDB data) - 1 TB S3 bucket (Loki logs, 60-day retention)

Retention: - Metrics: 60 days - Logs: 60 days

Characteristics: - More frequent data ingestion - More concurrent Grafana users - More complex dashboard queries

Large Hub (50+ Hosts)¶

Not Currently Recommended:

Instead of scaling to a large centralized hub, deploy additional regional hubs:

Region 1: hub1 (50 hosts)
Region 2: hub2 (50 hosts)
Region 3: hub3 (50 hosts)

Benefits: - Fault isolation (one hub failure doesn't affect all hosts) - Geographic locality (faster queries, lower latency) - Simpler operations (each hub is independently manageable) - Cost efficiency (no expensive distributed storage)

Future Consideration: - Optional central aggregation layer (query across all hubs) - Currently NOT implemented in solti-monitoring

Storage Patterns¶

Pattern 1: NFS for InfluxDB + S3 for Loki (Current Production)¶

InfluxDB v2 OSS: - Data directory: NFS mount (e.g., /mnt/nfs/influxdb) - Enables sharing across multiple InfluxDB instances (if needed) - Simple backup via NFS snapshots

Loki: - Index: Local disk (fast access) - Chunks: S3 object storage (MinIO) - Cost-effective long-term retention

Example:

# InfluxDB with NFS
influxdb_data_path: /mnt/nfs/influxdb
influxdb_retention_days: 30

# Loki with S3
loki_s3_endpoint: s3-server.example.com:8010
loki_s3_bucket: loki11
loki_s3_access_key: "{{ vault_loki_s3_access_key }}"
loki_s3_secret_key: "{{ vault_loki_s3_secret_key }}"

Pattern 2: All Local Storage (Development/Testing)¶

Use Case: Development, testing, isolated systems

Configuration:

# InfluxDB local storage
influxdb_data_path: /var/lib/influxdb

# Loki local storage (no S3)
loki_storage_type: filesystem
loki_filesystem_path: /var/lib/loki

Limitations: - No sharing across multiple instances - Higher disk I/O on hub server - Backup requires rsync/tar

Pattern 3: NFS for Both (Future Option)¶

Use Case: Unified storage backend, no S3 dependency

Configuration:

# InfluxDB with NFS
influxdb_data_path: /mnt/nfs/influxdb

# Loki with NFS (no S3)
loki_storage_type: filesystem
loki_filesystem_path: /mnt/nfs/loki

Considerations: - Requires high-performance NFS server - May be costlier than S3 for large log volumes - Simpler architecture (one storage backend)

Network Security Patterns¶

Pattern 1: WireGuard VPN (Current Production)¶

Architecture: - Hub runs WireGuard endpoint - Spokes connect to hub via WireGuard tunnel - All monitoring traffic encrypted and authenticated

Configuration:

# Hub (monitor11)
wireguard_enabled: true
wireguard_listen_port: 51820
wireguard_address: 10.10.0.11/24

# Spoke (ispconfig3)
wireguard_enabled: true
wireguard_endpoint: monitor11.example.com:51820
wireguard_address: 10.10.0.20/24
wireguard_allowed_ips: 10.10.0.0/24

# Monitoring targets
telegraf_outputs_influxdb_endpoint: http://10.10.0.11:8086
alloy_loki_endpoint: http://10.10.0.11:3100

Benefits: - Encrypted transport (ChaCha20-Poly1305) - Mutual authentication (public/private keys) - NAT traversal (works from behind firewalls) - Low overhead (kernel-level performance)

Pattern 2: Direct Access (Trusted Networks)¶

Architecture: - Spokes connect directly to hub without VPN - Suitable for trusted internal networks only

Configuration:

# No WireGuard
telegraf_outputs_influxdb_endpoint: http://monitor11.example.com:8086
alloy_loki_endpoint: http://monitor11.example.com:3100

Use Case: - Internal corporate network - Data center with physical security - Development/testing environments

Security Note: InfluxDB and Loki use token-based authentication, but traffic is NOT encrypted without TLS/VPN.

Pattern 3: Reverse Proxy + TLS (Future Option)¶

Architecture: - Hub behind Traefik/nginx reverse proxy - TLS termination at proxy - Certificate-based authentication

Not Currently Implemented: - Would integrate with solti-ensemble Traefik role - Would enable public HTTPS endpoints for InfluxDB/Loki - Would require certificate management (ACME/Let's Encrypt)

Deployment Workflow¶

Step 1: Deploy Hub¶

cd mylab

# Deploy InfluxDB + Loki on monitor11
ansible-playbook \
  --become-password-file ~/.secrets/lavender.pass \
  playbooks/svc-monitor11-metrics.yml  # InfluxDB + Telegraf

ansible-playbook \
  --become-password-file ~/.secrets/lavender.pass \
  playbooks/svc-monitor11-logs.yml     # Loki + Alloy

# Deploy Grafana (local orchestrator)
ansible-playbook \
  --become-password-file ~/.secrets/lavender.pass \
  playbooks/svc-grafana.yml

Step 2: Configure WireGuard (if needed)¶

# Deploy WireGuard on hub
ansible-playbook \
  --become-password-file ~/.secrets/lavender.pass \
  playbooks/svc-monitor11-wireguard.yml

# Deploy WireGuard on spoke
ansible-playbook \
  --become-password-file ~/.secrets/lavender.pass \
  playbooks/ispconfig3/wireguard.yml

Step 3: Deploy Spokes¶

# Deploy Telegraf + Alloy on ispconfig3
ansible-playbook \
  --become-password-file ~/.secrets/lavender.pass \
  playbooks/ispconfig3/22-ispconfig3-alloy.yml

ansible-playbook \
  --become-password-file ~/.secrets/lavender.pass \
  playbooks/ispconfig3/ispconfig3-monitor.yml

Step 4: Verify Data Flow¶

# Check InfluxDB metrics
curl -s http://monitor11.example.com:8086/health

# Check Loki logs
curl -s -G "http://monitor11.example.com:3100/loki/api/v1/query" \
  --data-urlencode 'query={hostname="ispconfig3-server.example.com"}' \
  --data-urlencode 'limit=10'

# View in Grafana
open http://localhost:3000  # or https://grafana.example.com:8080

Step 5: Create Dashboards¶

Use programmatic dashboard creation:

# See CLAUDE.md "Creating Grafana Dashboards Programmatically"
./bin/create-fail2ban-dashboard.py
./bin/create-alloy-dashboard.py
./bin/create-docker-dashboard.py

Self-Monitoring¶

Hubs monitor themselves:

# monitor11 monitors itself
- hosts: monitor11.example.com
  roles:
    - jackaltx.solti_monitoring.influxdb
    - jackaltx.solti_monitoring.loki
    - jackaltx.solti_monitoring.telegraf  # Collects metrics from monitor11
    - jackaltx.solti_monitoring.alloy     # Collects logs from monitor11
  vars:
    telegraf_outputs_influxdb_endpoint: http://localhost:8086
    alloy_loki_endpoint: http://localhost:3100

What Gets Monitored: - InfluxDB container metrics (CPU, memory, disk I/O) - Loki container metrics - System metrics (monitor11 host) - Service logs (InfluxDB, Loki, WireGuard)

High Availability Considerations¶

Current Deployment: Single Hub (No HA)

Future Options:

Active-Passive HA
Two hubs with shared NFS storage for InfluxDB
Shared S3 storage for Loki
Failover via DNS or load balancer
Not currently implemented
Active-Active Regional Hubs
Deploy multiple independent hubs (regional/site-based)
Each hub operates independently
Spokes configured with multiple endpoints (failover)
Preferred approach for distributed deployments

Recommendation: For small deployments, accept single hub risk and focus on fast recovery: - Regular backups (NFS snapshots, S3 versioning) - Hub rebuild playbooks tested regularly - Monitoring data loss acceptable (not business-critical)

Scaling Strategy¶

Vertical Scaling (Not Recommended): - Increase hub resources (CPU, RAM, storage) - Works up to ~50 hosts - Becomes expensive and risky (single point of failure)

Horizontal Scaling (Recommended): - Deploy additional regional hubs - Each hub serves 10-30 hosts - Independent operation - Simple operations

Example:

Initial: 1 hub (10 hosts)
Growth:  2 hubs (25 hosts each)
Growth:  3 hubs (30 hosts each)
Growth:  5 hubs (20 hosts each)

Benefits: - Linear cost scaling - Fault isolation - Geographic distribution - Simple operations (no cluster management)

Summary¶

Current Production Pattern: - Hub: monitor11 (InfluxDB + Loki + Grafana) - Spoke: ispconfig3 (Telegraf + Alloy) - Transport: WireGuard VPN - Storage: NFS (InfluxDB), S3 (Loki) - Sizing: Small hub (1-10 hosts)

Design Principles: - Distributed small deployments over centralized large deployments - WireGuard for security - S3 for cost-effective log storage - NFS for InfluxDB shared storage - Independent hubs for resilience

Next Steps: - Add more spokes to existing hub (up to 30-50 hosts) - Deploy additional regional hubs as needed - Optional: Implement central aggregation layer (query across hubs)