Loki

Overview¶

Grafana Loki is a log aggregation system inspired by Prometheus. It stores logs efficiently and provides powerful querying via LogQL.

Purpose¶

Store logs from Alloy collectors
Provide query API for Grafana dashboards
Enable log aggregation across multiple hosts
Support S3-compatible object storage backends

Installation¶

The loki role deploys Loki as a Podman container using systemd quadlets:

- role: jackaltx.solti_monitoring.loki
  vars:
    loki_version: "2.9"
    loki_port: 3100
    loki_retention: "30d"

Key Configuration Options¶

Basic Configuration¶

loki_version: "2.9"                  # Loki version
loki_port: 3100                      # HTTP API port
loki_retention: "30d"                # Log retention period
loki_max_chunk_age: "2h"             # Maximum chunk age before flush

Storage Backends¶

Local filesystem (default):

loki_storage_type: "filesystem"
loki_data_path: "/var/lib/loki"

S3-compatible storage:

loki_storage_type: "s3"
loki_s3_endpoint: "storage.example.com:8010"
loki_s3_bucket: "loki11"
loki_s3_access_key: "{{ vault_s3_access }}"
loki_s3_secret_key: "{{ vault_s3_secret }}"
loki_s3_region: "us-east-1"          # Optional

Container Configuration¶

Deployed via Podman with systemd quadlets:

loki_container_name: "loki"
loki_image: "docker.io/grafana/loki:2.9"
loki_restart_policy: "always"

Deployment¶

Basic Deployment¶

Deploy Loki on a single host:

---
- name: Deploy Loki server
  hosts: monitoring_servers
  become: true

  roles:
    - role: jackaltx.solti_monitoring.loki
      vars:
        loki_retention: "30d"

Run deployment:

ansible-playbook -i inventory.yml deploy-loki.yml

S3-Backed Deployment¶

Deploy with S3 storage backend:

- role: jackaltx.solti_monitoring.loki
  vars:
    loki_storage_type: "s3"
    loki_s3_endpoint: "storage.example.com:8010"
    loki_s3_bucket: "loki11"
    loki_s3_access_key: "{{ vault_s3_access }}"
    loki_s3_secret_key: "{{ vault_s3_secret }}"

Service Management¶

Systemd Quadlet¶

Loki runs as a Podman container managed by systemd:

# Check status
systemctl status loki

# Start/stop/restart
systemctl start loki
systemctl stop loki
systemctl restart loki

# View logs
journalctl -u loki -f

# Check container
podman ps | grep loki

Health Check¶

Verify Loki is running:

curl http://localhost:3100/ready

Expected response:

ready

API Access¶

Query API¶

Instant query (single value):

curl -G "http://localhost:3100/loki/api/v1/query" \
  --data-urlencode 'query={service_type="fail2ban"}' \
  --data-urlencode 'limit=10'

Range query (time series):

curl -G "http://localhost:3100/loki/api/v1/query_range" \
  --data-urlencode 'query={service_type="fail2ban"}' \
  --data-urlencode 'start=1735000000000000000' \
  --data-urlencode 'end=1735100000000000000' \
  --data-urlencode 'limit=100'

Push API¶

Alloy uses the push API to send logs:

POST http://localhost:3100/loki/api/v1/push
Content-Type: application/json

{
  "streams": [
    {
      "stream": {"service_type": "fail2ban", "hostname": "server1"},
      "values": [
        ["1735000000000000000", "Log message here"]
      ]
    }
  ]
}

Label Discovery¶

List all labels:

curl http://localhost:3100/loki/api/v1/labels

Get values for a label:

curl http://localhost:3100/loki/api/v1/label/service_type/values

LogQL Query Language¶

Basic Queries¶

Stream selector:

{service_type="fail2ban"}

Filter logs:

{service_type="fail2ban"} |= "Ban"
{service_type="apache"} != "200"

Regex filter:

{service_type="fail2ban"} |~ "Ban.*sshd"

Log Parsing¶

Extract fields with regex:

{service_type="fail2ban"}
| regexp `\[(?P<jail>[^\]]+)\]\s+(?P<action>Ban|Unban)\s+(?P<ip>\d+\.\d+\.\d+\.\d+)`
| jail="sshd"

JSON parsing:

{service_type="application"}
| json
| level="error"

Aggregations¶

Count logs over time:

count_over_time({service_type="fail2ban"} [24h])

Rate of logs:

rate({service_type="apache"} [5m])

Sum by label:

sum by(jail) (count_over_time({service_type="fail2ban"} [1h]))

Retention Configuration¶

Retention Period¶

Set retention in role variables:

loki_retention: "30d"      # Keep logs for 30 days

Compaction¶

Loki automatically compacts chunks. Configure compaction settings:

loki_compaction_interval: "10m"
loki_retention_delete_delay: "2h"

Resource Requirements¶

Minimum Requirements¶

CPU: 2 cores
Memory: 1GB RAM
Disk: 50GB+ (depends on log volume and retention)
Network: 100 Mbps

Sizing Guidance¶

Small deployment (1-10 hosts, low log volume): - 2 CPU cores - 1GB RAM - 50GB storage

Medium deployment (10-50 hosts, moderate logs): - 4 CPU cores - 2GB RAM - 200GB storage

Large deployment (50+ hosts, high volume): - 8+ CPU cores - 4GB+ RAM - 1TB+ storage or S3 backend

Performance Tuning¶

Chunk Configuration¶

Optimize chunk sizes:

loki_chunk_idle_period: "30m"
loki_chunk_retain_period: "15s"
loki_max_chunk_age: "2h"

Cache Settings¶

Tune cache for better performance:

loki_results_cache_ttl: "24h"
loki_chunk_cache_ttl: "24h"

Ingestion Rate Limits¶

Prevent resource exhaustion:

loki_ingestion_rate_mb: 4
loki_ingestion_burst_size_mb: 6

Backup and Recovery¶

Filesystem Backend¶

Manual backup:

# Stop Loki
systemctl stop loki

# Backup data directory
tar -czf loki-backup.tar.gz /var/lib/loki/

# Start Loki
systemctl start loki

S3 Backend¶

When using S3, data is automatically stored in object storage. For disaster recovery:

Deploy new Loki instance
Configure same S3 endpoint and bucket
Data automatically available

Monitoring Loki¶

Metrics Endpoint¶

Loki exposes Prometheus metrics:

curl http://localhost:3100/metrics

Key Metrics to Monitor¶

loki_ingester_chunks_created_total - Chunks created
loki_request_duration_seconds - Query latency
loki_ingester_memory_chunks - Chunks in memory
loki_panic_total - Application panics

Troubleshooting¶

Check Container Status¶

podman ps -a | grep loki
podman logs loki

Check Service Status¶

systemctl status loki
journalctl -u loki -n 100

Verify API Access¶

curl http://localhost:3100/ready
curl http://localhost:3100/metrics

Test Query¶

curl -G "http://localhost:3100/loki/api/v1/query" \
  --data-urlencode 'query={service_type="fail2ban"}' \
  --data-urlencode 'limit=5'

Common Issues¶

Container won't start: Check logs with podman logs loki
API not accessible: Verify port 3100 is open, check firewall
No logs appearing: Check Alloy collectors are configured correctly
Out of disk space: Reduce retention period or use S3 backend
High memory usage: Reduce chunk cache sizes or add more RAM
Query timeouts: Optimize queries, add time range filters

Security Considerations¶

Network Access: Restrict port 3100 to monitoring network
Authentication: Consider adding authentication via reverse proxy
TLS/SSL: Use HTTPS in production (requires reverse proxy)
S3 Credentials: Store S3 keys in Ansible Vault
Log Sanitization: Avoid logging sensitive data

Reference Deployment¶

See Reference Deployments chapter for real-world example: - monitor11.example.com - Loki with S3 backend