Performance Tuning¶
Overview¶
Performance tuning ensures monitoring infrastructure scales efficiently and maintains low resource overhead. This page covers optimization techniques for each component.
InfluxDB Performance¶
Memory Optimization¶
Cache configuration:
influxdb_cache_max_memory: "1g" # Total cache size
influxdb_cache_snapshot_memory: "25m" # Snapshot cache
influxdb_cache_max_entry_size: "1m" # Max single entry
Memory allocation: - Small deployment (1-10 hosts): 2GB RAM, 512MB cache - Medium deployment (10-50 hosts): 4GB RAM, 1GB cache - Large deployment (50+ hosts): 8GB+ RAM, 2GB+ cache
Container limits:
Write Performance¶
Batch size:
# Telegraf
telegraf_flush_interval: "10s" # Batch data for 10 seconds
telegraf_metric_batch_size: 1000 # Up to 1000 points per batch
telegraf_metric_buffer_limit: 10000 # Buffer up to 10000 points
Compaction:
influxdb_compaction_throughput: "48m" # Bytes/sec during compaction
influxdb_max_concurrent_compactions: 4 # Parallel compactions
influxdb_compaction_full_write_cold_duration: "4h" # Age before full compaction
Query Performance¶
Index optimization:
- Use appropriate time ranges (avoid unbounded queries)
- Filter by tags before fields
- Use last() instead of sorting for latest value
Optimize common queries: Create continuous queries for frequently-accessed aggregates:
option task = {name: "downsample_cpu", every: 5m}
from(bucket: "telegraf")
|> range(start: -10m)
|> filter(fn: (r) => r["_measurement"] == "cpu")
|> aggregateWindow(every: 5m, fn: mean)
|> to(bucket: "telegraf_downsampled")
Storage Optimization¶
Use S3 backend: - Reduces local disk I/O - Enables unlimited capacity - Lower cost per GB
Retention policies:
Shard duration:
Loki Performance¶
Chunk Configuration¶
Optimize chunk sizes:
loki_chunk_idle_period: "30m" # Flush chunks after 30m idle
loki_chunk_retain_period: "15s" # Keep in memory 15s after flush
loki_max_chunk_age: "2h" # Force flush after 2h
loki_chunk_target_size: 1572864 # Target 1.5MB chunks
Larger chunks = fewer writes, but higher memory usage.
Ingestion Optimization¶
Rate limits:
loki_ingestion_rate_mb: 4 # 4MB/sec per stream
loki_ingestion_burst_size_mb: 6 # Burst to 6MB
loki_max_streams_per_user: 0 # Unlimited (or set limit)
loki_max_line_size: 256000 # 256KB max line size
Distributor configuration:
Query Performance¶
Cache configuration:
Query limits:
loki_max_query_length: "721h" # Max 30 days
loki_max_query_series: 500 # Limit series returned
loki_max_entries_limit: 5000 # Max log lines per query
Query optimization techniques:
1. Always use time ranges
2. Use specific label selectors
3. Avoid regex when possible
4. Use line_format to reduce data transferred
5. Use instant queries for tables (not range queries)
Storage Optimization¶
Use S3 backend:
Compaction:
loki_compaction_interval: "10m" # Compact every 10 minutes
loki_retention_delete_delay: "2h" # Wait 2h before deletion
loki_retention_delete_worker_count: 150 # Parallel deletion workers
Telegraf Performance¶
Collection Interval¶
Balance freshness vs overhead:
telegraf_interval: "10s" # Collect every 10 seconds (default)
telegraf_interval: "60s" # Less overhead, less granular
Metric Filtering¶
Drop unwanted metrics:
telegraf_processors:
- type: "drop"
filter:
measurement: ["http_response_time"]
tags:
url: ["http://localhost/health"]
Drop high-cardinality tags:
Batch Configuration¶
Optimize batching:
telegraf_flush_interval: "10s" # Send every 10 seconds
telegraf_metric_batch_size: 1000 # Batch up to 1000 metrics
telegraf_metric_buffer_limit: 10000 # Buffer 10000 if output slow
Plugin Optimization¶
Disable unused plugins:
Configure collection intervals per-plugin:
Alloy Performance¶
Log Source Optimization¶
Limit log volume:
alloy_config_sources:
- fail2ban # Only collect what you need
- apache
# Don't collect debug logs in production
File position tracking: Alloy tracks file positions to avoid re-reading logs. No tuning needed.
Batch Configuration¶
Optimize log shipping:
loki.write "default" {
endpoint {
url = "http://localhost:3100/loki/api/v1/push"
batch_size = 1048576 // 1MB batches
batch_wait = "1s" // Wait 1s to fill batch
queue {
capacity = 10000 // Buffer 10000 log lines
max_backoff = "1m" // Max retry backoff
}
}
}
Processing Optimization¶
Minimize regex operations:
// Expensive (processes every line)
loki.process "expensive" {
stage.regex {
expression = "complex_regex_pattern"
}
}
// Better (filter first, then process)
loki.process "efficient" {
stage.match {
selector = '{service_type="fail2ban"}'
stage.regex {
expression = "\[(?P<jail>[^\]]+)\]"
}
}
}
System-Level Optimization¶
Disk I/O¶
Use SSD/NVMe: - 10x faster than HDD - Critical for InfluxDB and Loki write performance
Filesystem mount options:
# /etc/fstab
/dev/sdb1 /var/lib/influxdb2 ext4 noatime,nodiratime 0 2
/dev/sdc1 /var/lib/loki ext4 noatime,nodiratime 0 2
I/O scheduling:
# For SSD
echo "noop" > /sys/block/sda/queue/scheduler
# For HDD
echo "deadline" > /sys/block/sda/queue/scheduler
Network Optimization¶
MTU size:
TCP tuning (for high-throughput):
# /etc/sysctl.conf
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
Container Optimization¶
Resource limits:
# Prevent resource exhaustion
influxdb_container_cpu_limit: "2"
influxdb_container_memory_limit: "4g"
loki_container_cpu_limit: "2"
loki_container_memory_limit: "2g"
Restart policies:
Monitoring Performance¶
Check Resource Usage¶
CPU and memory:
Disk I/O:
Network:
Performance Metrics¶
InfluxDB:
Loki:
Telegraf:
Performance Benchmarks¶
Expected Performance¶
Small deployment (1-10 hosts): - InfluxDB: 10k writes/sec, < 100ms query latency - Loki: 1k lines/sec, < 500ms query latency - Telegraf: < 1% CPU, < 128MB RAM - Alloy: < 2% CPU, < 256MB RAM
Medium deployment (10-50 hosts): - InfluxDB: 100k writes/sec, < 200ms query latency - Loki: 10k lines/sec, < 1s query latency - Resource usage scales linearly
Troubleshooting Performance Issues¶
High CPU¶
InfluxDB: Check compaction, reduce query complexity Loki: Check regex parsing, reduce log volume Telegraf: Reduce collection frequency, disable plugins Alloy: Reduce regex operations, limit log sources
High Memory¶
InfluxDB: Reduce cache sizes Loki: Reduce chunk sizes, flush more frequently Telegraf: Reduce buffer limits Alloy: Reduce queue capacity
Slow Queries¶
InfluxDB: Add time range filters, optimize indexes Loki: Use specific label selectors, avoid unbounded queries
Best Practices Summary¶
- Start small, scale up: Begin with conservative settings
- Monitor metrics: Track performance over time
- Tune iteratively: Change one thing at a time
- Use S3 for storage: Offload disk I/O
- Optimize queries: Don't scan unnecessary data
- Set resource limits: Prevent runaway processes
- Regular maintenance: Compact, clean up old data
- Document changes: Keep notes on tuning decisions