Architecture
System Architecture¶
Solti-Monitoring implements a modern, distributed monitoring architecture based on two parallel pipelines.
Monitoring Pipelines¶
Metrics Pipeline (Telegraf → InfluxDB)¶
graph LR
A[Monitored Host<br/>Telegraf Collector] -->|Metrics via HTTP/S| B[Monitoring Server<br/>InfluxDB Storage]
B -->|Query| C[Grafana<br/>Visualization]
style A fill:#e1f5ff
style B fill:#fff4e1
style C fill:#f0e1ff
Flow: 1. Telegraf collects metrics from local system/applications 2. Metrics sent to InfluxDB via HTTP(S) 3. InfluxDB stores metrics in time-series database 4. Grafana queries InfluxDB for visualization
Logging Pipeline (Alloy → Loki)¶
graph LR
A[Monitored Host<br/>Alloy Pipeline<br/>Parse - Filter - Label] -->|Structured Logs via HTTP/S| B[Monitoring Server<br/>Loki Storage]
B -->|LogQL Query| C[Grafana<br/>Visualization]
style A fill:#e1f5ff
style B fill:#fff4e1
style C fill:#f0e1ff
Flow: 1. Alloy collects raw logs from journald, files, containers 2. Alloy processes logs: parses fields, adds labels, filters noise 3. Structured logs sent to Loki via HTTP(S) with normalized labels 4. Loki stores logs with label-based indexing (low cardinality) 5. Grafana/Claude query Loki using consistent label schema
Combined Architecture¶
graph TB
subgraph "Monitored Hosts"
T[Telegraf<br/>Metrics Collector]
A[Alloy<br/>Observability Pipeline<br/>Parse - Filter - Label]
end
subgraph "Monitoring Server"
I[InfluxDB<br/>Metrics Storage]
L[Loki<br/>Log Storage]
G[Grafana<br/>Unified Visualization]
end
T -->|Metrics| I
A -->|Logs| L
I -->|Query| G
L -->|Query| G
style T fill:#e1f5ff
style A fill:#e1f5ff
style I fill:#fff4e1
style L fill:#fff4e1
style G fill:#f0e1ff
Component Details¶
InfluxDB (Metrics Storage)¶
Purpose: Time-series database optimized for metrics
Version: InfluxDB v2 OSS (Open Source)
Key Features: - High-performance time-series storage - Flux query language for analysis - Retention policies for automatic data cleanup - Built-in downsampling and aggregation
Storage Options: - Local disk (default) - NFS mounts (shared storage for data directory)
Note: InfluxDB v2 OSS does not support S3 object storage. For scalable/shared storage, mount an NFS volume to the InfluxDB data directory.
API: - Port: 8086 - Protocol: HTTP/HTTPS - Authentication: Token-based
Loki (Log Storage)¶
Purpose: Log aggregation system with label-based indexing
Key Features: - Label-based indexing (not full-text) - Cost-effective storage - LogQL query language - Multi-tenancy support - Horizontal scaling
Storage Options: - Local filesystem - NFS mounts - S3-compatible object storage
API: - Port: 3100 - Protocol: HTTP/HTTPS - Authentication: Optional (recommended for production)
Telegraf (Metrics Collector)¶
Purpose: Plugin-driven metrics collection agent
Key Features: - 300+ input plugins - Minimal resource footprint - Configurable collection intervals - Local buffering and retry logic - Plugin-based architecture
Common Inputs: - System metrics (CPU, memory, disk, network) - Application metrics (Docker, databases, web servers) - Custom scripts and commands
Alloy (Observability Pipeline)¶
Purpose: Programmable observability data processor from Grafana Labs
Key Capabilities: - Label Engineering: Custom label extraction and normalization to control cardinality - Data Filtering: Reduce signal-to-noise ratio by filtering irrelevant log entries - Structured Parsing: Parse unstructured logs into queryable fields (journald, syslog, JSON) - Multi-Source Collection: Unified collection from journald, files, containers, syslog - Dynamic Configuration: River language enables conditional logic and transformations
Use Cases in Solti-Monitoring: - Parse fail2ban journald logs to extract jail, action, and IP fields - Filter verbose DNS queries to keep only security-relevant events - Normalize mail service logs across Postfix/Dovecot for consistent querying - Add contextual labels (service_type, hostname) for dashboard filtering - Control Loki cardinality by limiting label dimensions
Why Alloy vs Simple Forwarders: - Enables AI-assisted dashboard analysis (Claude queries with predictable labels) - Reduces Loki storage costs by filtering noise before ingestion - Creates consistent labeling schema across heterogeneous services - Reference: See sprint reports on Alloy+Bind9 integration and dashboard development
Log Sources: - Journald (systemd services) - File tailing (application logs) - Docker container logs - Syslog
Grafana (Visualization)¶
Purpose: Unified observability platform
Key Features: - Multi-datasource dashboards - Alerting and notifications - User management and RBAC - Templating and variables - Plugin ecosystem
Supported Datasources: - InfluxDB (metrics) - Loki (logs) - Prometheus, Elasticsearch, and 100+ others
Current Implementation¶
Production Deployments¶
monitor11.example.com (Proxmox VM): - InfluxDB + Telegraf (metrics) - Loki + Alloy (logs) - Grafana (visualization) - WireGuard endpoint for remote collectors
ispconfig3-server.example.com (Linode VPS): - Telegraf + Alloy collectors - Ships to monitor11 via WireGuard - Monitors: Apache, ISPConfig, Fail2ban, Gitea, Mail, Bind9
Storage Backend¶
InfluxDB v2 OSS: - Storage: NFS mount for data directory - Retention: 30-day policy configured in bucket settings - Note: InfluxDB v2 OSS does not support S3 object storage
Loki: - Object Storage: s3-server.example.com:8010 (MinIO S3-compatible) - Bucket: loki11 - Advantages: Cost-effective, scalable log storage
AI-Assisted Observability Workflow¶
A key design goal: enable Claude Code to analyze dashboards and query data programmatically.
How Alloy Enables This:
- Predictable Label Schema: Alloy normalizes labels (e.g.,
service_type,hostname,jail) so Claude can construct queries without guessing - Parsed Fields: Pre-extracted fields (IP addresses, actions, timestamps) enable structured querying
- Reduced Noise: Filtering at collection time means Claude queries return relevant results, not spam
Example Workflow:
User: "Show me fail2ban activity for the last 24 hours"
↓
Claude: Constructs Loki query using known labels
→ {service_type="fail2ban", hostname="ispconfig3-server.example.com"}
→ Parses results using known field names (jail, banned_ip, action)
↓
Claude: Generates dashboard panels programmatically via Grafana API
Claude: Analyzes patterns, suggests improvements
Reports: - Alloy+Bind9 Integration Sprint - Label design for DNS query logs - Alloy Dashboard Creation - Programmatic dashboard generation - Fail2ban Dashboard Debugging - Query troubleshooting workflow
Next Steps¶
Planned Enhancements¶
- Alloy Config Validation
- Pre-deployment config testing (test playbook implemented)
-
Live config reload exploration
-
Multi-Site Deployment
- Expand beyond monitor11/ispconfig3
-
Standardize Alloy processing pipelines across hosts
-
Advanced Dashboards
- Service-specific dashboards (Fail2ban ✅, Mail, DNS)
- SLI/SLO tracking with Alloy-derived metrics
-
Capacity planning views
-
Alerting
- Grafana alerting rules based on Alloy-parsed events
- Multi-channel notifications (email, Mattermost)
- Alert escalation policies