Skip to content

Architecture

System Architecture

Solti-Monitoring implements a modern, distributed monitoring architecture based on two parallel pipelines.

Monitoring Pipelines

Metrics Pipeline (Telegraf → InfluxDB)

graph LR
    A[Monitored Host<br/>Telegraf Collector] -->|Metrics via HTTP/S| B[Monitoring Server<br/>InfluxDB Storage]
    B -->|Query| C[Grafana<br/>Visualization]

    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#f0e1ff

Flow: 1. Telegraf collects metrics from local system/applications 2. Metrics sent to InfluxDB via HTTP(S) 3. InfluxDB stores metrics in time-series database 4. Grafana queries InfluxDB for visualization

Logging Pipeline (Alloy → Loki)

graph LR
    A[Monitored Host<br/>Alloy Pipeline<br/>Parse - Filter - Label] -->|Structured Logs via HTTP/S| B[Monitoring Server<br/>Loki Storage]
    B -->|LogQL Query| C[Grafana<br/>Visualization]

    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#f0e1ff

Flow: 1. Alloy collects raw logs from journald, files, containers 2. Alloy processes logs: parses fields, adds labels, filters noise 3. Structured logs sent to Loki via HTTP(S) with normalized labels 4. Loki stores logs with label-based indexing (low cardinality) 5. Grafana/Claude query Loki using consistent label schema

Combined Architecture

graph TB
    subgraph "Monitored Hosts"
        T[Telegraf<br/>Metrics Collector]
        A[Alloy<br/>Observability Pipeline<br/>Parse - Filter - Label]
    end

    subgraph "Monitoring Server"
        I[InfluxDB<br/>Metrics Storage]
        L[Loki<br/>Log Storage]
        G[Grafana<br/>Unified Visualization]
    end

    T -->|Metrics| I
    A -->|Logs| L
    I -->|Query| G
    L -->|Query| G

    style T fill:#e1f5ff
    style A fill:#e1f5ff
    style I fill:#fff4e1
    style L fill:#fff4e1
    style G fill:#f0e1ff

Component Details

InfluxDB (Metrics Storage)

Purpose: Time-series database optimized for metrics

Version: InfluxDB v2 OSS (Open Source)

Key Features: - High-performance time-series storage - Flux query language for analysis - Retention policies for automatic data cleanup - Built-in downsampling and aggregation

Storage Options: - Local disk (default) - NFS mounts (shared storage for data directory)

Note: InfluxDB v2 OSS does not support S3 object storage. For scalable/shared storage, mount an NFS volume to the InfluxDB data directory.

API: - Port: 8086 - Protocol: HTTP/HTTPS - Authentication: Token-based

Loki (Log Storage)

Purpose: Log aggregation system with label-based indexing

Key Features: - Label-based indexing (not full-text) - Cost-effective storage - LogQL query language - Multi-tenancy support - Horizontal scaling

Storage Options: - Local filesystem - NFS mounts - S3-compatible object storage

API: - Port: 3100 - Protocol: HTTP/HTTPS - Authentication: Optional (recommended for production)

Telegraf (Metrics Collector)

Purpose: Plugin-driven metrics collection agent

Key Features: - 300+ input plugins - Minimal resource footprint - Configurable collection intervals - Local buffering and retry logic - Plugin-based architecture

Common Inputs: - System metrics (CPU, memory, disk, network) - Application metrics (Docker, databases, web servers) - Custom scripts and commands

Alloy (Observability Pipeline)

Purpose: Programmable observability data processor from Grafana Labs

Key Capabilities: - Label Engineering: Custom label extraction and normalization to control cardinality - Data Filtering: Reduce signal-to-noise ratio by filtering irrelevant log entries - Structured Parsing: Parse unstructured logs into queryable fields (journald, syslog, JSON) - Multi-Source Collection: Unified collection from journald, files, containers, syslog - Dynamic Configuration: River language enables conditional logic and transformations

Use Cases in Solti-Monitoring: - Parse fail2ban journald logs to extract jail, action, and IP fields - Filter verbose DNS queries to keep only security-relevant events - Normalize mail service logs across Postfix/Dovecot for consistent querying - Add contextual labels (service_type, hostname) for dashboard filtering - Control Loki cardinality by limiting label dimensions

Why Alloy vs Simple Forwarders: - Enables AI-assisted dashboard analysis (Claude queries with predictable labels) - Reduces Loki storage costs by filtering noise before ingestion - Creates consistent labeling schema across heterogeneous services - Reference: See sprint reports on Alloy+Bind9 integration and dashboard development

Log Sources: - Journald (systemd services) - File tailing (application logs) - Docker container logs - Syslog

Grafana (Visualization)

Purpose: Unified observability platform

Key Features: - Multi-datasource dashboards - Alerting and notifications - User management and RBAC - Templating and variables - Plugin ecosystem

Supported Datasources: - InfluxDB (metrics) - Loki (logs) - Prometheus, Elasticsearch, and 100+ others

Current Implementation

Production Deployments

monitor11.example.com (Proxmox VM): - InfluxDB + Telegraf (metrics) - Loki + Alloy (logs) - Grafana (visualization) - WireGuard endpoint for remote collectors

ispconfig3-server.example.com (Linode VPS): - Telegraf + Alloy collectors - Ships to monitor11 via WireGuard - Monitors: Apache, ISPConfig, Fail2ban, Gitea, Mail, Bind9

Storage Backend

InfluxDB v2 OSS: - Storage: NFS mount for data directory - Retention: 30-day policy configured in bucket settings - Note: InfluxDB v2 OSS does not support S3 object storage

Loki: - Object Storage: s3-server.example.com:8010 (MinIO S3-compatible) - Bucket: loki11 - Advantages: Cost-effective, scalable log storage

AI-Assisted Observability Workflow

A key design goal: enable Claude Code to analyze dashboards and query data programmatically.

How Alloy Enables This:

  1. Predictable Label Schema: Alloy normalizes labels (e.g., service_type, hostname, jail) so Claude can construct queries without guessing
  2. Parsed Fields: Pre-extracted fields (IP addresses, actions, timestamps) enable structured querying
  3. Reduced Noise: Filtering at collection time means Claude queries return relevant results, not spam

Example Workflow:

User: "Show me fail2ban activity for the last 24 hours"
Claude: Constructs Loki query using known labels
  → {service_type="fail2ban", hostname="ispconfig3-server.example.com"}
  → Parses results using known field names (jail, banned_ip, action)
Claude: Generates dashboard panels programmatically via Grafana API
Claude: Analyzes patterns, suggests improvements

Reports: - Alloy+Bind9 Integration Sprint - Label design for DNS query logs - Alloy Dashboard Creation - Programmatic dashboard generation - Fail2ban Dashboard Debugging - Query troubleshooting workflow

Next Steps

Planned Enhancements

  1. Alloy Config Validation
  2. Pre-deployment config testing (test playbook implemented)
  3. Live config reload exploration

  4. Multi-Site Deployment

  5. Expand beyond monitor11/ispconfig3
  6. Standardize Alloy processing pipelines across hosts

  7. Advanced Dashboards

  8. Service-specific dashboards (Fail2ban ✅, Mail, DNS)
  9. SLI/SLO tracking with Alloy-derived metrics
  10. Capacity planning views

  11. Alerting

  12. Grafana alerting rules based on Alloy-parsed events
  13. Multi-channel notifications (email, Mattermost)
  14. Alert escalation policies