Skip to content

Performance Tuning

Overview

Performance tuning ensures monitoring infrastructure scales efficiently and maintains low resource overhead. This page covers optimization techniques for each component.

InfluxDB Performance

Memory Optimization

Cache configuration:

influxdb_cache_max_memory: "1g"           # Total cache size
influxdb_cache_snapshot_memory: "25m"     # Snapshot cache
influxdb_cache_max_entry_size: "1m"       # Max single entry

Memory allocation: - Small deployment (1-10 hosts): 2GB RAM, 512MB cache - Medium deployment (10-50 hosts): 4GB RAM, 1GB cache - Large deployment (50+ hosts): 8GB+ RAM, 2GB+ cache

Container limits:

influxdb_container_memory_limit: "4g"
influxdb_container_memory_reservation: "2g"

Write Performance

Batch size:

# Telegraf
telegraf_flush_interval: "10s"      # Batch data for 10 seconds
telegraf_metric_batch_size: 1000    # Up to 1000 points per batch
telegraf_metric_buffer_limit: 10000 # Buffer up to 10000 points

Compaction:

influxdb_compaction_throughput: "48m"          # Bytes/sec during compaction
influxdb_max_concurrent_compactions: 4         # Parallel compactions
influxdb_compaction_full_write_cold_duration: "4h"  # Age before full compaction

Query Performance

Index optimization: - Use appropriate time ranges (avoid unbounded queries) - Filter by tags before fields - Use last() instead of sorting for latest value

Optimize common queries: Create continuous queries for frequently-accessed aggregates:

option task = {name: "downsample_cpu", every: 5m}

from(bucket: "telegraf")
  |> range(start: -10m)
  |> filter(fn: (r) => r["_measurement"] == "cpu")
  |> aggregateWindow(every: 5m, fn: mean)
  |> to(bucket: "telegraf_downsampled")

Storage Optimization

Use S3 backend: - Reduces local disk I/O - Enables unlimited capacity - Lower cost per GB

Retention policies:

influxdb_retention: "30d"  # Shorter retention = less storage

Shard duration:

influxdb_shard_duration: "7d"  # Balance between query performance and storage

Loki Performance

Chunk Configuration

Optimize chunk sizes:

loki_chunk_idle_period: "30m"      # Flush chunks after 30m idle
loki_chunk_retain_period: "15s"    # Keep in memory 15s after flush
loki_max_chunk_age: "2h"           # Force flush after 2h
loki_chunk_target_size: 1572864    # Target 1.5MB chunks

Larger chunks = fewer writes, but higher memory usage.

Ingestion Optimization

Rate limits:

loki_ingestion_rate_mb: 4          # 4MB/sec per stream
loki_ingestion_burst_size_mb: 6    # Burst to 6MB
loki_max_streams_per_user: 0       # Unlimited (or set limit)
loki_max_line_size: 256000         # 256KB max line size

Distributor configuration:

loki_distributor_ring_instances: 1  # Single instance for small deployments

Query Performance

Cache configuration:

loki_results_cache_ttl: "24h"      # Cache query results
loki_chunk_cache_ttl: "24h"        # Cache chunks

Query limits:

loki_max_query_length: "721h"      # Max 30 days
loki_max_query_series: 500         # Limit series returned
loki_max_entries_limit: 5000       # Max log lines per query

Query optimization techniques: 1. Always use time ranges 2. Use specific label selectors 3. Avoid regex when possible 4. Use line_format to reduce data transferred 5. Use instant queries for tables (not range queries)

Storage Optimization

Use S3 backend:

loki_storage_type: "s3"
loki_s3_endpoint: "storage.example.com:8010"

Compaction:

loki_compaction_interval: "10m"           # Compact every 10 minutes
loki_retention_delete_delay: "2h"         # Wait 2h before deletion
loki_retention_delete_worker_count: 150   # Parallel deletion workers

Telegraf Performance

Collection Interval

Balance freshness vs overhead:

telegraf_interval: "10s"           # Collect every 10 seconds (default)
telegraf_interval: "60s"           # Less overhead, less granular

Metric Filtering

Drop unwanted metrics:

telegraf_processors:
  - type: "drop"
    filter:
      measurement: ["http_response_time"]
      tags:
        url: ["http://localhost/health"]

Drop high-cardinality tags:

telegraf_processors:
  - type: "drop"
    tagdrop:
      path: ["/proc/*"]  # Drop per-process metrics

Batch Configuration

Optimize batching:

telegraf_flush_interval: "10s"         # Send every 10 seconds
telegraf_metric_batch_size: 1000       # Batch up to 1000 metrics
telegraf_metric_buffer_limit: 10000    # Buffer 10000 if output slow

Plugin Optimization

Disable unused plugins:

telegraf_enable_docker: false   # Don't collect if not needed
telegraf_enable_nginx: false

Configure collection intervals per-plugin:

[[inputs.cpu]]
  interval = "10s"

[[inputs.disk]]
  interval = "60s"  # Disk usage changes slowly

Alloy Performance

Log Source Optimization

Limit log volume:

alloy_config_sources:
  - fail2ban  # Only collect what you need
  - apache
  # Don't collect debug logs in production

File position tracking: Alloy tracks file positions to avoid re-reading logs. No tuning needed.

Batch Configuration

Optimize log shipping:

loki.write "default" {
  endpoint {
    url = "http://localhost:3100/loki/api/v1/push"

    batch_size = 1048576      // 1MB batches
    batch_wait = "1s"         // Wait 1s to fill batch

    queue {
      capacity = 10000        // Buffer 10000 log lines
      max_backoff = "1m"      // Max retry backoff
    }
  }
}

Processing Optimization

Minimize regex operations:

// Expensive (processes every line)
loki.process "expensive" {
  stage.regex {
    expression = "complex_regex_pattern"
  }
}

// Better (filter first, then process)
loki.process "efficient" {
  stage.match {
    selector = '{service_type="fail2ban"}'

    stage.regex {
      expression = "\[(?P<jail>[^\]]+)\]"
    }
  }
}

System-Level Optimization

Disk I/O

Use SSD/NVMe: - 10x faster than HDD - Critical for InfluxDB and Loki write performance

Filesystem mount options:

# /etc/fstab
/dev/sdb1  /var/lib/influxdb2  ext4  noatime,nodiratime  0  2
/dev/sdc1  /var/lib/loki       ext4  noatime,nodiratime  0  2

I/O scheduling:

# For SSD
echo "noop" > /sys/block/sda/queue/scheduler

# For HDD
echo "deadline" > /sys/block/sda/queue/scheduler

Network Optimization

MTU size:

# Increase MTU for local network (if supported)
ip link set dev eth0 mtu 9000

TCP tuning (for high-throughput):

# /etc/sysctl.conf
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

Container Optimization

Resource limits:

# Prevent resource exhaustion
influxdb_container_cpu_limit: "2"
influxdb_container_memory_limit: "4g"

loki_container_cpu_limit: "2"
loki_container_memory_limit: "2g"

Restart policies:

influxdb_restart_policy: "always"
loki_restart_policy: "always"

Monitoring Performance

Check Resource Usage

CPU and memory:

# System overview
top
htop

# Container stats
podman stats influxdb loki telegraf alloy

Disk I/O:

iostat -x 1
iotop

Network:

iftop
nethogs

Performance Metrics

InfluxDB:

curl http://localhost:8086/metrics | grep -E "write_ok|query_req"

Loki:

curl http://localhost:3100/metrics | grep -E "ingester_chunks|request_duration"

Telegraf:

journalctl -u telegraf | grep "wrote batch"

Performance Benchmarks

Expected Performance

Small deployment (1-10 hosts): - InfluxDB: 10k writes/sec, < 100ms query latency - Loki: 1k lines/sec, < 500ms query latency - Telegraf: < 1% CPU, < 128MB RAM - Alloy: < 2% CPU, < 256MB RAM

Medium deployment (10-50 hosts): - InfluxDB: 100k writes/sec, < 200ms query latency - Loki: 10k lines/sec, < 1s query latency - Resource usage scales linearly

Troubleshooting Performance Issues

High CPU

InfluxDB: Check compaction, reduce query complexity Loki: Check regex parsing, reduce log volume Telegraf: Reduce collection frequency, disable plugins Alloy: Reduce regex operations, limit log sources

High Memory

InfluxDB: Reduce cache sizes Loki: Reduce chunk sizes, flush more frequently Telegraf: Reduce buffer limits Alloy: Reduce queue capacity

Slow Queries

InfluxDB: Add time range filters, optimize indexes Loki: Use specific label selectors, avoid unbounded queries

Best Practices Summary

  1. Start small, scale up: Begin with conservative settings
  2. Monitor metrics: Track performance over time
  3. Tune iteratively: Change one thing at a time
  4. Use S3 for storage: Offload disk I/O
  5. Optimize queries: Don't scan unnecessary data
  6. Set resource limits: Prevent runaway processes
  7. Regular maintenance: Compact, clean up old data
  8. Document changes: Keep notes on tuning decisions