Performance Tuning¶

Overview¶

Performance tuning ensures monitoring infrastructure scales efficiently and maintains low resource overhead. This page covers optimization techniques for each component.

InfluxDB Performance¶

Memory Optimization¶

Cache configuration:

influxdb_cache_max_memory: "1g"           # Total cache size
influxdb_cache_snapshot_memory: "25m"     # Snapshot cache
influxdb_cache_max_entry_size: "1m"       # Max single entry

Memory allocation: - Small deployment (1-10 hosts): 2GB RAM, 512MB cache - Medium deployment (10-50 hosts): 4GB RAM, 1GB cache - Large deployment (50+ hosts): 8GB+ RAM, 2GB+ cache

Container limits:

influxdb_container_memory_limit: "4g"
influxdb_container_memory_reservation: "2g"

Write Performance¶

Batch size:

# Telegraf
telegraf_flush_interval: "10s"      # Batch data for 10 seconds
telegraf_metric_batch_size: 1000    # Up to 1000 points per batch
telegraf_metric_buffer_limit: 10000 # Buffer up to 10000 points

Compaction:

influxdb_compaction_throughput: "48m"          # Bytes/sec during compaction
influxdb_max_concurrent_compactions: 4         # Parallel compactions
influxdb_compaction_full_write_cold_duration: "4h"  # Age before full compaction

Query Performance¶

Index optimization: - Use appropriate time ranges (avoid unbounded queries) - Filter by tags before fields - Use last() instead of sorting for latest value

Optimize common queries: Create continuous queries for frequently-accessed aggregates:

option task = {name: "downsample_cpu", every: 5m}

from(bucket: "telegraf")
  |> range(start: -10m)
  |> filter(fn: (r) => r["_measurement"] == "cpu")
  |> aggregateWindow(every: 5m, fn: mean)
  |> to(bucket: "telegraf_downsampled")

Storage Optimization¶

Use S3 backend: - Reduces local disk I/O - Enables unlimited capacity - Lower cost per GB

Retention policies:

influxdb_retention: "30d"  # Shorter retention = less storage

Shard duration:

influxdb_shard_duration: "7d"  # Balance between query performance and storage

Loki Performance¶

Chunk Configuration¶

Optimize chunk sizes:

loki_chunk_idle_period: "30m"      # Flush chunks after 30m idle
loki_chunk_retain_period: "15s"    # Keep in memory 15s after flush
loki_max_chunk_age: "2h"           # Force flush after 2h
loki_chunk_target_size: 1572864    # Target 1.5MB chunks

Larger chunks = fewer writes, but higher memory usage.

Ingestion Optimization¶

Rate limits:

loki_ingestion_rate_mb: 4          # 4MB/sec per stream
loki_ingestion_burst_size_mb: 6    # Burst to 6MB
loki_max_streams_per_user: 0       # Unlimited (or set limit)
loki_max_line_size: 256000         # 256KB max line size

Distributor configuration:

loki_distributor_ring_instances: 1  # Single instance for small deployments

Query Performance¶

Cache configuration:

loki_results_cache_ttl: "24h"      # Cache query results
loki_chunk_cache_ttl: "24h"        # Cache chunks

Query limits:

loki_max_query_length: "721h"      # Max 30 days
loki_max_query_series: 500         # Limit series returned
loki_max_entries_limit: 5000       # Max log lines per query

Query optimization techniques: 1. Always use time ranges 2. Use specific label selectors 3. Avoid regex when possible 4. Use line_format to reduce data transferred 5. Use instant queries for tables (not range queries)

Storage Optimization¶

Use S3 backend:

loki_storage_type: "s3"
loki_s3_endpoint: "storage.example.com:8010"

Compaction:

loki_compaction_interval: "10m"           # Compact every 10 minutes
loki_retention_delete_delay: "2h"         # Wait 2h before deletion
loki_retention_delete_worker_count: 150   # Parallel deletion workers

Telegraf Performance¶

Collection Interval¶

Balance freshness vs overhead:

telegraf_interval: "10s"           # Collect every 10 seconds (default)
telegraf_interval: "60s"           # Less overhead, less granular

Metric Filtering¶

Drop unwanted metrics:

telegraf_processors:
  - type: "drop"
    filter:
      measurement: ["http_response_time"]
      tags:
        url: ["http://localhost/health"]

Drop high-cardinality tags:

telegraf_processors:
  - type: "drop"
    tagdrop:
      path: ["/proc/*"]  # Drop per-process metrics

Batch Configuration¶

Optimize batching:

telegraf_flush_interval: "10s"         # Send every 10 seconds
telegraf_metric_batch_size: 1000       # Batch up to 1000 metrics
telegraf_metric_buffer_limit: 10000    # Buffer 10000 if output slow

Plugin Optimization¶

Disable unused plugins:

telegraf_enable_docker: false   # Don't collect if not needed
telegraf_enable_nginx: false

Configure collection intervals per-plugin:

[[inputs.cpu]]
  interval = "10s"

[[inputs.disk]]
  interval = "60s"  # Disk usage changes slowly

Alloy Performance¶

Log Source Optimization¶

Limit log volume:

alloy_config_sources:
  - fail2ban  # Only collect what you need
  - apache
  # Don't collect debug logs in production

File position tracking: Alloy tracks file positions to avoid re-reading logs. No tuning needed.

Batch Configuration¶

Optimize log shipping:

loki.write "default" {
  endpoint {
    url = "http://localhost:3100/loki/api/v1/push"

    batch_size = 1048576      // 1MB batches
    batch_wait = "1s"         // Wait 1s to fill batch

    queue {
      capacity = 10000        // Buffer 10000 log lines
      max_backoff = "1m"      // Max retry backoff
    }
  }
}

Processing Optimization¶

Minimize regex operations:

// Expensive (processes every line)
loki.process "expensive" {
  stage.regex {
    expression = "complex_regex_pattern"
  }
}

// Better (filter first, then process)
loki.process "efficient" {
  stage.match {
    selector = '{service_type="fail2ban"}'

    stage.regex {
      expression = "\[(?P<jail>[^\]]+)\]"
    }
  }
}

System-Level Optimization¶

Disk I/O¶

Use SSD/NVMe: - 10x faster than HDD - Critical for InfluxDB and Loki write performance

Filesystem mount options:

# /etc/fstab
/dev/sdb1  /var/lib/influxdb2  ext4  noatime,nodiratime  0  2
/dev/sdc1  /var/lib/loki       ext4  noatime,nodiratime  0  2

I/O scheduling:

# For SSD
echo "noop" > /sys/block/sda/queue/scheduler

# For HDD
echo "deadline" > /sys/block/sda/queue/scheduler

Network Optimization¶

MTU size:

# Increase MTU for local network (if supported)
ip link set dev eth0 mtu 9000

TCP tuning (for high-throughput):

# /etc/sysctl.conf
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

Container Optimization¶

Resource limits:

# Prevent resource exhaustion
influxdb_container_cpu_limit: "2"
influxdb_container_memory_limit: "4g"

loki_container_cpu_limit: "2"
loki_container_memory_limit: "2g"

Restart policies:

influxdb_restart_policy: "always"
loki_restart_policy: "always"

Monitoring Performance¶

Check Resource Usage¶

CPU and memory:

# System overview
top
htop

# Container stats
podman stats influxdb loki telegraf alloy

Disk I/O:

iostat -x 1
iotop

Network:

iftop
nethogs

Performance Metrics¶

InfluxDB:

curl http://localhost:8086/metrics | grep -E "write_ok|query_req"

Loki:

curl http://localhost:3100/metrics | grep -E "ingester_chunks|request_duration"

Telegraf:

journalctl -u telegraf | grep "wrote batch"

Performance Benchmarks¶

Expected Performance¶

Small deployment (1-10 hosts): - InfluxDB: 10k writes/sec, < 100ms query latency - Loki: 1k lines/sec, < 500ms query latency - Telegraf: < 1% CPU, < 128MB RAM - Alloy: < 2% CPU, < 256MB RAM

Medium deployment (10-50 hosts): - InfluxDB: 100k writes/sec, < 200ms query latency - Loki: 10k lines/sec, < 1s query latency - Resource usage scales linearly

Troubleshooting Performance Issues¶

High CPU¶

InfluxDB: Check compaction, reduce query complexity Loki: Check regex parsing, reduce log volume Telegraf: Reduce collection frequency, disable plugins Alloy: Reduce regex operations, limit log sources

High Memory¶

InfluxDB: Reduce cache sizes Loki: Reduce chunk sizes, flush more frequently Telegraf: Reduce buffer limits Alloy: Reduce queue capacity

Slow Queries¶

InfluxDB: Add time range filters, optimize indexes Loki: Use specific label selectors, avoid unbounded queries

Best Practices Summary¶

Start small, scale up: Begin with conservative settings
Monitor metrics: Track performance over time
Tune iteratively: Change one thing at a time
Use S3 for storage: Offload disk I/O
Optimize queries: Don't scan unnecessary data
Set resource limits: Prevent runaway processes
Regular maintenance: Compact, clean up old data
Document changes: Keep notes on tuning decisions