Skip to content

Loki

Overview

Grafana Loki is a log aggregation system inspired by Prometheus. It stores logs efficiently and provides powerful querying via LogQL.

Purpose

  • Store logs from Alloy collectors
  • Provide query API for Grafana dashboards
  • Enable log aggregation across multiple hosts
  • Support S3-compatible object storage backends

Installation

The loki role deploys Loki as a Podman container using systemd quadlets:

- role: jackaltx.solti_monitoring.loki
  vars:
    loki_version: "2.9"
    loki_port: 3100
    loki_retention: "30d"

Key Configuration Options

Basic Configuration

loki_version: "2.9"                  # Loki version
loki_port: 3100                      # HTTP API port
loki_retention: "30d"                # Log retention period
loki_max_chunk_age: "2h"             # Maximum chunk age before flush

Storage Backends

Local filesystem (default):

loki_storage_type: "filesystem"
loki_data_path: "/var/lib/loki"

S3-compatible storage:

loki_storage_type: "s3"
loki_s3_endpoint: "storage.example.com:8010"
loki_s3_bucket: "loki11"
loki_s3_access_key: "{{ vault_s3_access }}"
loki_s3_secret_key: "{{ vault_s3_secret }}"
loki_s3_region: "us-east-1"          # Optional

Container Configuration

Deployed via Podman with systemd quadlets:

loki_container_name: "loki"
loki_image: "docker.io/grafana/loki:2.9"
loki_restart_policy: "always"

Deployment

Basic Deployment

Deploy Loki on a single host:

---
- name: Deploy Loki server
  hosts: monitoring_servers
  become: true

  roles:
    - role: jackaltx.solti_monitoring.loki
      vars:
        loki_retention: "30d"

Run deployment:

ansible-playbook -i inventory.yml deploy-loki.yml

S3-Backed Deployment

Deploy with S3 storage backend:

- role: jackaltx.solti_monitoring.loki
  vars:
    loki_storage_type: "s3"
    loki_s3_endpoint: "storage.example.com:8010"
    loki_s3_bucket: "loki11"
    loki_s3_access_key: "{{ vault_s3_access }}"
    loki_s3_secret_key: "{{ vault_s3_secret }}"

Service Management

Systemd Quadlet

Loki runs as a Podman container managed by systemd:

# Check status
systemctl status loki

# Start/stop/restart
systemctl start loki
systemctl stop loki
systemctl restart loki

# View logs
journalctl -u loki -f

# Check container
podman ps | grep loki

Health Check

Verify Loki is running:

curl http://localhost:3100/ready

Expected response:

ready

API Access

Query API

Instant query (single value):

curl -G "http://localhost:3100/loki/api/v1/query" \
  --data-urlencode 'query={service_type="fail2ban"}' \
  --data-urlencode 'limit=10'

Range query (time series):

curl -G "http://localhost:3100/loki/api/v1/query_range" \
  --data-urlencode 'query={service_type="fail2ban"}' \
  --data-urlencode 'start=1735000000000000000' \
  --data-urlencode 'end=1735100000000000000' \
  --data-urlencode 'limit=100'

Push API

Alloy uses the push API to send logs:

POST http://localhost:3100/loki/api/v1/push
Content-Type: application/json

{
  "streams": [
    {
      "stream": {"service_type": "fail2ban", "hostname": "server1"},
      "values": [
        ["1735000000000000000", "Log message here"]
      ]
    }
  ]
}

Label Discovery

List all labels:

curl http://localhost:3100/loki/api/v1/labels

Get values for a label:

curl http://localhost:3100/loki/api/v1/label/service_type/values

LogQL Query Language

Basic Queries

Stream selector:

{service_type="fail2ban"}

Filter logs:

{service_type="fail2ban"} |= "Ban"
{service_type="apache"} != "200"

Regex filter:

{service_type="fail2ban"} |~ "Ban.*sshd"

Log Parsing

Extract fields with regex:

{service_type="fail2ban"}
| regexp `\[(?P<jail>[^\]]+)\]\s+(?P<action>Ban|Unban)\s+(?P<ip>\d+\.\d+\.\d+\.\d+)`
| jail="sshd"

JSON parsing:

{service_type="application"}
| json
| level="error"

Aggregations

Count logs over time:

count_over_time({service_type="fail2ban"} [24h])

Rate of logs:

rate({service_type="apache"} [5m])

Sum by label:

sum by(jail) (count_over_time({service_type="fail2ban"} [1h]))

Retention Configuration

Retention Period

Set retention in role variables:

loki_retention: "30d"      # Keep logs for 30 days

Compaction

Loki automatically compacts chunks. Configure compaction settings:

loki_compaction_interval: "10m"
loki_retention_delete_delay: "2h"

Resource Requirements

Minimum Requirements

  • CPU: 2 cores
  • Memory: 1GB RAM
  • Disk: 50GB+ (depends on log volume and retention)
  • Network: 100 Mbps

Sizing Guidance

Small deployment (1-10 hosts, low log volume): - 2 CPU cores - 1GB RAM - 50GB storage

Medium deployment (10-50 hosts, moderate logs): - 4 CPU cores - 2GB RAM - 200GB storage

Large deployment (50+ hosts, high volume): - 8+ CPU cores - 4GB+ RAM - 1TB+ storage or S3 backend

Performance Tuning

Chunk Configuration

Optimize chunk sizes:

loki_chunk_idle_period: "30m"
loki_chunk_retain_period: "15s"
loki_max_chunk_age: "2h"

Cache Settings

Tune cache for better performance:

loki_results_cache_ttl: "24h"
loki_chunk_cache_ttl: "24h"

Ingestion Rate Limits

Prevent resource exhaustion:

loki_ingestion_rate_mb: 4
loki_ingestion_burst_size_mb: 6

Backup and Recovery

Filesystem Backend

Manual backup:

# Stop Loki
systemctl stop loki

# Backup data directory
tar -czf loki-backup.tar.gz /var/lib/loki/

# Start Loki
systemctl start loki

S3 Backend

When using S3, data is automatically stored in object storage. For disaster recovery:

  1. Deploy new Loki instance
  2. Configure same S3 endpoint and bucket
  3. Data automatically available

Monitoring Loki

Metrics Endpoint

Loki exposes Prometheus metrics:

curl http://localhost:3100/metrics

Key Metrics to Monitor

  • loki_ingester_chunks_created_total - Chunks created
  • loki_request_duration_seconds - Query latency
  • loki_ingester_memory_chunks - Chunks in memory
  • loki_panic_total - Application panics

Troubleshooting

Check Container Status

podman ps -a | grep loki
podman logs loki

Check Service Status

systemctl status loki
journalctl -u loki -n 100

Verify API Access

curl http://localhost:3100/ready
curl http://localhost:3100/metrics

Test Query

curl -G "http://localhost:3100/loki/api/v1/query" \
  --data-urlencode 'query={service_type="fail2ban"}' \
  --data-urlencode 'limit=5'

Common Issues

  1. Container won't start: Check logs with podman logs loki
  2. API not accessible: Verify port 3100 is open, check firewall
  3. No logs appearing: Check Alloy collectors are configured correctly
  4. Out of disk space: Reduce retention period or use S3 backend
  5. High memory usage: Reduce chunk cache sizes or add more RAM
  6. Query timeouts: Optimize queries, add time range filters

Security Considerations

  1. Network Access: Restrict port 3100 to monitoring network
  2. Authentication: Consider adding authentication via reverse proxy
  3. TLS/SSL: Use HTTPS in production (requires reverse proxy)
  4. S3 Credentials: Store S3 keys in Ansible Vault
  5. Log Sanitization: Avoid logging sensitive data

Reference Deployment

See Reference Deployments chapter for real-world example: - monitor11.example.com - Loki with S3 backend