Integration Points

Overview¶

Solti-Monitoring provides monitoring stack roles (InfluxDB, Loki, Telegraf, Alloy, Grafana) and consumes container orchestration and shared services from other collections.

What Solti-Monitoring Provides¶

Monitoring Stack Components¶

InfluxDB v2 OSS - Time-series metrics storage - Flux query language - HTTP API on port 8086 - Token-based authentication

Loki - Log aggregation storage - LogQL query language - HTTP API on port 3100 - S3-compatible object storage support

Telegraf - Metrics collection agent - 300+ input plugins (system, Docker, databases, web servers) - Multiple output formats (InfluxDB, Prometheus, file)

Alloy - Observability data pipeline - Log parsing, filtering, and label engineering - Reduces Loki cardinality and storage costs - Enables predictable label schemas for programmatic queries

Grafana (Orchestrator Only) - Visualization dashboards - Multi-datasource support (InfluxDB, Loki, Prometheus) - Programmatic dashboard creation via HTTP API

Integration Points¶

All components integrate via HTTP APIs and standardized protocols:

Telegraf → InfluxDB (HTTP, port 8086)
Alloy → Loki (HTTP, port 3100)
Grafana → InfluxDB/Loki (HTTP queries)

What Solti-Monitoring Consumes¶

Container Orchestration (solti-containers)¶

Solti-Monitoring does NOT provide container orchestration.

We consume container infrastructure from solti-containers:

Podman: Rootless containers, quadlet systemd integration
Testing containers: Pre-configured test targets (MySQL, Redis, etc.)
Molecule scenarios: Isolated test environments

Example:

# solti-monitoring roles deploy INTO containers
# managed by solti-containers orchestration

- hosts: monitoring_servers
  roles:
    - role: jackaltx.solti_containers.podman  # Provides container runtime
    - role: jackaltx.solti_monitoring.influxdb  # Deploys as podman container

Shared Services (solti-ensemble)¶

Purpose: Infrastructure services shared across all SOLTI collections

Integration: - MariaDB: Database for application metadata (Grafana dashboards, user data) - HashiVault: Secrets management for InfluxDB tokens, Loki credentials - ACME/Traefik: TLS certificates and reverse proxy for monitoring APIs

Example:

# Use HashiVault for InfluxDB admin token
- hosts: monitoring_servers
  roles:
    - role: jackaltx.solti_ensemble.hashivault
    - role: jackaltx.solti_monitoring.influxdb
      vars:
        influxdb_admin_token: "{{ lookup('vault', 'influxdb/admin_token') }}"

Documentation (solti-docs)¶

Purpose: SOLTI framework architecture and testing methodology

Integration: - Testing patterns (molecule scenarios) - Reporting standards (sprint reports in Reports/) - Architectural philosophy (SOLTI = Systems Oriented Laboratory Testing Integration)

External Integrations¶

Grafana Dashboard Development¶

Programmatic dashboard creation via HTTP API:

import requests, json

# Get admin password
with open('~/.secrets/grafana.admin.pass', 'r') as f:
    password = f.read().strip()

# Create dashboard via API
dashboard = {
    "dashboard": {...},  # Dashboard JSON
    "message": "Created via API",
    "overwrite": True
}

response = requests.post(
    "http://localhost:3000/api/dashboards/db",
    auth=('admin', password),
    json=dashboard
)

Reference: See CLAUDE.md "Grafana Dashboard Development Workflow" and "Creating Grafana Dashboards Programmatically"

Prometheus Federation (Optional)¶

Telegraf can expose metrics in Prometheus format:

telegraf_outputs_prometheus_enabled: true
telegraf_prometheus_listen: "0.0.0.0:9273"

Then scrape from Prometheus:

# prometheus.yml
scrape_configs:
  - job_name: 'telegraf'
    static_configs:
      - targets: ['monitor11:9273']

Note: We do NOT currently deploy Prometheus in production. This is documented for potential future use.

Elasticsearch (Not Currently Used)¶

Alloy can theoretically forward to Elasticsearch, but we do not currently use Elasticsearch in solti-monitoring.

If needed in the future, Alloy supports multiple outputs:

# NOT currently deployed
alloy_additional_outputs:
  - type: elasticsearch
    endpoint: "https://es.example.com:9200"

API Integration¶

InfluxDB API¶

Write metrics programmatically:

from influxdb_client import InfluxDBClient

client = InfluxDBClient(
    url="http://monitor11.example.com:8086",
    token="YOUR_TOKEN",
    org="example-org"
)

write_api = client.write_api()
write_api.write(
    bucket="telegraf",
    record="custom_metric,host=app1 value=42"
)

Query metrics:

query_api = client.query_api()
query = '''
from(bucket: "telegraf")
  |> range(start: -1h)
  |> filter(fn: (r) => r["service"] == "alloy")
'''
result = query_api.query(query=query)

Loki API¶

Query logs programmatically (used by Claude for dashboard development):

curl -s -G "http://monitor11.example.com:3100/loki/api/v1/query" \
  --data-urlencode 'query={service_type="fail2ban"}' \
  --data-urlencode 'limit=10'

Python example:

import requests, time

now_ns = int(time.time() * 1e9)
params = {
    'query': '{service_type="fail2ban"}',
    'time': now_ns
}

response = requests.get(
    "http://monitor11.example.com:3100/loki/api/v1/query",
    params=params
)
data = response.json()

Reference: See CLAUDE.md "Grafana Dashboard Development Workflow" for query testing examples

Data Export¶

InfluxDB Export¶

Via CLI:

# Export as CSV
influx query 'from(bucket:"telegraf") |> range(start:-1h)' --format csv > export.csv

# Backup entire database
influx backup /backup/influxdb/$(date +%Y%m%d)

Loki Export¶

Via API:

curl -G "http://monitor11.example.com:3100/loki/api/v1/query_range" \
  --data-urlencode 'query={hostname="ispconfig3-server.example.com"}' \
  --data-urlencode 'start=1704067200000000000' \
  --data-urlencode 'end=1704153600000000000' \
  > export.json

AI-Assisted Integration¶

Claude Code programmatic access:

User request
  ↓
Claude constructs Loki/InfluxDB query
  ↓
Claude fetches data via HTTP API
  ↓
Claude generates Grafana dashboard JSON
  ↓
Claude deploys via Grafana API

How Alloy enables this: - Predictable label schemas (service_type, hostname, jail, etc.) - Pre-parsed fields (IP addresses, actions, timestamps) - Filtered noise (Claude queries return relevant results)

Example workflow:

# 1. Claude tests query against Loki
curl -s -G "http://monitor11.example.com:3100/loki/api/v1/query" \
  --data-urlencode 'query=sum by(jail) (count_over_time({service_type="fail2ban"} [24h]))'

# 2. Claude builds dashboard panel JSON
{
  "type": "stat",
  "targets": [{
    "expr": "sum by(jail) (count_over_time({service_type=\"fail2ban\"} [24h]))"
  }]
}

# 3. Claude deploys to Grafana
curl -X POST -u admin:$PASSWORD \
  -H "Content-Type: application/json" \
  -d @dashboard.json \
  http://localhost:3000/api/dashboards/db

Reference: See CLAUDE.md "AI-Assisted Observability Workflow"

Self-Monitoring¶

Monitoring servers monitor themselves:

- hosts: monitor11.example.com
  roles:
    - jackaltx.solti_monitoring.influxdb
    - jackaltx.solti_monitoring.loki
    - jackaltx.solti_monitoring.telegraf  # Monitors the server itself
    - jackaltx.solti_monitoring.alloy     # Collects server logs
  vars:
    telegraf_outputs_influxdb_endpoint: http://localhost:8086
    alloy_loki_endpoints:
      - endpoint: http://localhost:3100

Health checks:

# InfluxDB health
curl http://monitor11.example.com:8086/health

# Loki health
curl http://monitor11.example.com:3100/ready

Backup Integration¶

Automated Backups¶

# InfluxDB backup (NFS mount)
- name: Backup InfluxDB data directory
  cron:
    name: "influxdb_backup"
    minute: "0"
    hour: "2"
    job: "rsync -a /mnt/nfs/influxdb/ /backup/influxdb/$(date +%Y%m%d)/"

# Loki S3 backend (already in object storage)
# No additional backup needed - MinIO handles replication

Disaster Recovery¶

InfluxDB: 1. Restore NFS mount with data directory 2. Start InfluxDB container 3. Verify buckets and retention policies

Loki: 1. Restore S3 bucket from MinIO backup 2. Start Loki container 3. Verify log queries return data

CI/CD Integration¶

Testing Pipeline¶

# .github/workflows/test.yml
name: Test Monitoring Roles

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Install collection
        run: ansible-galaxy collection install .

      - name: Run molecule tests
        run: |
          cd solti-monitoring
          molecule test --all

Testing matrix: - Platforms: Podman (local), Proxmox (remote), GitHub Actions (CI) - Distros: Rocky Linux 9, Debian 12, Ubuntu 24 - Scenarios: Default (single role), combined (full stack)

Deployment Pipeline¶

See orchestrator (mylab/) for automated deployment workflows:

# Test Alloy config before deploy
cd mylab
ansible-playbook playbooks/ispconfig3/91-ispconfig3-alloy-test.yml

# Deploy to production
ansible-playbook playbooks/ispconfig3/22-ispconfig3-alloy.yml

Reference: See CLAUDE.md "Alloy Test/Deploy Workflow"

Current Production Integrations¶

monitor11.example.com (Monitoring Server)¶

Components: - InfluxDB (metrics storage, NFS backend) - Loki (log storage, S3 backend via MinIO) - Grafana (visualization, local Podman) - WireGuard (VPN endpoint for remote collectors)

Consumers: - ispconfig3-server.example.com (Telegraf + Alloy ship data via WireGuard)

ispconfig3-server.example.com (Monitored Host)¶

Components: - Telegraf (collects metrics) - Alloy (collects and processes logs)

Monitored services: - Apache, ISPConfig, Fail2ban, Gitea, Postfix/Dovecot, Bind9

Shipping: - Metrics → monitor11:8086 (via WireGuard) - Logs → monitor11:3100 (via WireGuard)

Next Steps¶

Chapter 2: Quick Start - Deployment patterns
Chapter 4: Component Guides - Individual role documentation
Chapter 7: Dashboard Development - Programmatic dashboard creation
Chapter 8: Testing - Molecule scenarios and validation