Deployment Patterns
Overview¶
Solti-Monitoring is designed for distributed small deployments - multiple independent monitoring servers serving 5-50 hosts each, rather than large centralized installations.
Design Philosophy: - Prefer multiple small monitoring servers over single large clusters - Keep data close to where it's generated (regional/site-based) - Use WireGuard for secure remote collection - S3 object storage for cost-effective long-term retention (Loki only) - NFS mounts for InfluxDB shared storage
Primary Pattern: Hub-and-Spoke (Distributed Small Deployment)¶
Architecture: - Hub: Central monitoring server (InfluxDB + Loki + Grafana) - Spokes: Monitored hosts running Telegraf + Alloy collectors - Transport: WireGuard VPN for secure data shipping - Storage: S3 for Loki logs, NFS for InfluxDB metrics
Characteristics: - Serves 5-50 monitored hosts per hub - Regional deployment (e.g., one hub per data center, VPC, or geographic region) - Independent operation (each hub can function without others) - Low operational complexity
Reference Implementation:
Hub: monitor11.example.com (Proxmox VM, local infrastructure)
├── InfluxDB v2 OSS (NFS storage, 30-day retention)
├── Loki (S3 storage via MinIO)
├── Grafana (local visualization)
└── WireGuard endpoint (10.10.0.11)
Spoke: ispconfig3-server.example.com (Linode VPS, remote)
├── Telegraf → monitor11:8086 (via WireGuard)
├── Alloy → monitor11:3100 (via WireGuard)
└── Monitored services: Apache, ISPConfig, Fail2ban, Gitea, Mail, Bind9
Why This Pattern: - Security: WireGuard encrypts all monitoring traffic - Scalability: Add more spokes without hub changes - Resilience: Hub failure only affects visibility, not spoke functionality - Cost: S3 storage cheaper than local disk for long-term logs - Simplicity: No cluster management, no distributed consensus
Deployment Sizing¶
Small Hub (5-10 Hosts)¶
Hub Resources: - 2 CPU cores - 4-8 GB RAM - 50 GB local disk (system + InfluxDB index) - 100 GB NFS mount (InfluxDB data) - 500 GB S3 bucket (Loki logs, 30-day retention)
Retention: - Metrics: 30 days - Logs: 30 days
Example:
# ansible-playbook vars for small hub
influxdb_retention_days: 30
loki_retention_days: 30
loki_s3_bucket_quota: 500GB
Medium Hub (10-50 Hosts)¶
Hub Resources: - 4 CPU cores - 8-16 GB RAM - 100 GB local disk - 250 GB NFS mount (InfluxDB data) - 1 TB S3 bucket (Loki logs, 60-day retention)
Retention: - Metrics: 60 days - Logs: 60 days
Characteristics: - More frequent data ingestion - More concurrent Grafana users - More complex dashboard queries
Large Hub (50+ Hosts)¶
Not Currently Recommended:
Instead of scaling to a large centralized hub, deploy additional regional hubs:
Benefits: - Fault isolation (one hub failure doesn't affect all hosts) - Geographic locality (faster queries, lower latency) - Simpler operations (each hub is independently manageable) - Cost efficiency (no expensive distributed storage)
Future Consideration: - Optional central aggregation layer (query across all hubs) - Currently NOT implemented in solti-monitoring
Storage Patterns¶
Pattern 1: NFS for InfluxDB + S3 for Loki (Current Production)¶
InfluxDB v2 OSS:
- Data directory: NFS mount (e.g., /mnt/nfs/influxdb)
- Enables sharing across multiple InfluxDB instances (if needed)
- Simple backup via NFS snapshots
Loki: - Index: Local disk (fast access) - Chunks: S3 object storage (MinIO) - Cost-effective long-term retention
Example:
# InfluxDB with NFS
influxdb_data_path: /mnt/nfs/influxdb
influxdb_retention_days: 30
# Loki with S3
loki_s3_endpoint: s3-server.example.com:8010
loki_s3_bucket: loki11
loki_s3_access_key: "{{ vault_loki_s3_access_key }}"
loki_s3_secret_key: "{{ vault_loki_s3_secret_key }}"
Pattern 2: All Local Storage (Development/Testing)¶
Use Case: Development, testing, isolated systems
Configuration:
# InfluxDB local storage
influxdb_data_path: /var/lib/influxdb
# Loki local storage (no S3)
loki_storage_type: filesystem
loki_filesystem_path: /var/lib/loki
Limitations: - No sharing across multiple instances - Higher disk I/O on hub server - Backup requires rsync/tar
Pattern 3: NFS for Both (Future Option)¶
Use Case: Unified storage backend, no S3 dependency
Configuration:
# InfluxDB with NFS
influxdb_data_path: /mnt/nfs/influxdb
# Loki with NFS (no S3)
loki_storage_type: filesystem
loki_filesystem_path: /mnt/nfs/loki
Considerations: - Requires high-performance NFS server - May be costlier than S3 for large log volumes - Simpler architecture (one storage backend)
Network Security Patterns¶
Pattern 1: WireGuard VPN (Current Production)¶
Architecture: - Hub runs WireGuard endpoint - Spokes connect to hub via WireGuard tunnel - All monitoring traffic encrypted and authenticated
Configuration:
# Hub (monitor11)
wireguard_enabled: true
wireguard_listen_port: 51820
wireguard_address: 10.10.0.11/24
# Spoke (ispconfig3)
wireguard_enabled: true
wireguard_endpoint: monitor11.example.com:51820
wireguard_address: 10.10.0.20/24
wireguard_allowed_ips: 10.10.0.0/24
# Monitoring targets
telegraf_outputs_influxdb_endpoint: http://10.10.0.11:8086
alloy_loki_endpoint: http://10.10.0.11:3100
Benefits: - Encrypted transport (ChaCha20-Poly1305) - Mutual authentication (public/private keys) - NAT traversal (works from behind firewalls) - Low overhead (kernel-level performance)
Pattern 2: Direct Access (Trusted Networks)¶
Architecture: - Spokes connect directly to hub without VPN - Suitable for trusted internal networks only
Configuration:
# No WireGuard
telegraf_outputs_influxdb_endpoint: http://monitor11.example.com:8086
alloy_loki_endpoint: http://monitor11.example.com:3100
Use Case: - Internal corporate network - Data center with physical security - Development/testing environments
Security Note: InfluxDB and Loki use token-based authentication, but traffic is NOT encrypted without TLS/VPN.
Pattern 3: Reverse Proxy + TLS (Future Option)¶
Architecture: - Hub behind Traefik/nginx reverse proxy - TLS termination at proxy - Certificate-based authentication
Not Currently Implemented:
- Would integrate with solti-ensemble Traefik role
- Would enable public HTTPS endpoints for InfluxDB/Loki
- Would require certificate management (ACME/Let's Encrypt)
Deployment Workflow¶
Step 1: Deploy Hub¶
cd mylab
# Deploy InfluxDB + Loki on monitor11
ansible-playbook \
--become-password-file ~/.secrets/lavender.pass \
playbooks/svc-monitor11-metrics.yml # InfluxDB + Telegraf
ansible-playbook \
--become-password-file ~/.secrets/lavender.pass \
playbooks/svc-monitor11-logs.yml # Loki + Alloy
# Deploy Grafana (local orchestrator)
ansible-playbook \
--become-password-file ~/.secrets/lavender.pass \
playbooks/svc-grafana.yml
Step 2: Configure WireGuard (if needed)¶
# Deploy WireGuard on hub
ansible-playbook \
--become-password-file ~/.secrets/lavender.pass \
playbooks/svc-monitor11-wireguard.yml
# Deploy WireGuard on spoke
ansible-playbook \
--become-password-file ~/.secrets/lavender.pass \
playbooks/ispconfig3/wireguard.yml
Step 3: Deploy Spokes¶
# Deploy Telegraf + Alloy on ispconfig3
ansible-playbook \
--become-password-file ~/.secrets/lavender.pass \
playbooks/ispconfig3/22-ispconfig3-alloy.yml
ansible-playbook \
--become-password-file ~/.secrets/lavender.pass \
playbooks/ispconfig3/ispconfig3-monitor.yml
Step 4: Verify Data Flow¶
# Check InfluxDB metrics
curl -s http://monitor11.example.com:8086/health
# Check Loki logs
curl -s -G "http://monitor11.example.com:3100/loki/api/v1/query" \
--data-urlencode 'query={hostname="ispconfig3-server.example.com"}' \
--data-urlencode 'limit=10'
# View in Grafana
open http://localhost:3000 # or https://grafana.example.com:8080
Step 5: Create Dashboards¶
Use programmatic dashboard creation:
# See CLAUDE.md "Creating Grafana Dashboards Programmatically"
./bin/create-fail2ban-dashboard.py
./bin/create-alloy-dashboard.py
./bin/create-docker-dashboard.py
Self-Monitoring¶
Hubs monitor themselves:
# monitor11 monitors itself
- hosts: monitor11.example.com
roles:
- jackaltx.solti_monitoring.influxdb
- jackaltx.solti_monitoring.loki
- jackaltx.solti_monitoring.telegraf # Collects metrics from monitor11
- jackaltx.solti_monitoring.alloy # Collects logs from monitor11
vars:
telegraf_outputs_influxdb_endpoint: http://localhost:8086
alloy_loki_endpoint: http://localhost:3100
What Gets Monitored: - InfluxDB container metrics (CPU, memory, disk I/O) - Loki container metrics - System metrics (monitor11 host) - Service logs (InfluxDB, Loki, WireGuard)
High Availability Considerations¶
Current Deployment: Single Hub (No HA)
Future Options:
- Active-Passive HA
- Two hubs with shared NFS storage for InfluxDB
- Shared S3 storage for Loki
- Failover via DNS or load balancer
-
Not currently implemented
-
Active-Active Regional Hubs
- Deploy multiple independent hubs (regional/site-based)
- Each hub operates independently
- Spokes configured with multiple endpoints (failover)
- Preferred approach for distributed deployments
Recommendation: For small deployments, accept single hub risk and focus on fast recovery: - Regular backups (NFS snapshots, S3 versioning) - Hub rebuild playbooks tested regularly - Monitoring data loss acceptable (not business-critical)
Scaling Strategy¶
Vertical Scaling (Not Recommended): - Increase hub resources (CPU, RAM, storage) - Works up to ~50 hosts - Becomes expensive and risky (single point of failure)
Horizontal Scaling (Recommended): - Deploy additional regional hubs - Each hub serves 10-30 hosts - Independent operation - Simple operations
Example:
Initial: 1 hub (10 hosts)
Growth: 2 hubs (25 hosts each)
Growth: 3 hubs (30 hosts each)
Growth: 5 hubs (20 hosts each)
Benefits: - Linear cost scaling - Fault isolation - Geographic distribution - Simple operations (no cluster management)
Summary¶
Current Production Pattern: - Hub: monitor11 (InfluxDB + Loki + Grafana) - Spoke: ispconfig3 (Telegraf + Alloy) - Transport: WireGuard VPN - Storage: NFS (InfluxDB), S3 (Loki) - Sizing: Small hub (1-10 hosts)
Design Principles: - Distributed small deployments over centralized large deployments - WireGuard for security - S3 for cost-effective log storage - NFS for InfluxDB shared storage - Independent hubs for resilience
Next Steps: - Add more spokes to existing hub (up to 30-50 hosts) - Deploy additional regional hubs as needed - Optional: Implement central aggregation layer (query across hubs)