Troubleshooting Guide¶
Overview¶
This guide covers common issues and their solutions. Follow the systematic approach: identify, diagnose, fix, verify.
General Troubleshooting Process¶
- Identify: What is not working?
- Diagnose: Check logs, status, connectivity
- Fix: Apply solution
- Verify: Confirm fix worked
Service Won't Start¶
InfluxDB Won't Start¶
Symptoms:
- systemctl status influxdb shows failed
- Container not running
Diagnosis:
Common causes:
-
Port conflict: Port 8086 already in use
-
Storage permission issues:
-
Invalid configuration:
-
Out of disk space:
Loki Won't Start¶
Diagnosis:
Common causes:
-
Port conflict: Port 3100 already in use
-
Configuration syntax error:
-
Storage issues: Same as InfluxDB
Telegraf Won't Start¶
Diagnosis:
Common causes:
-
Configuration syntax error:
-
Plugin error: Invalid plugin configuration
-
Permission issues: Can't read system metrics
Alloy Won't Start¶
Diagnosis:
Common causes:
-
Configuration syntax error:
-
Log file access: Can't read log files
-
Port conflict: Port 12345 in use
No Data Appearing¶
No Metrics in InfluxDB¶
Symptoms: InfluxDB running but no data from Telegraf
Diagnosis:
# Check Telegraf is sending
journalctl -u telegraf | grep -i "wrote batch"
# Test Telegraf output
telegraf --config /etc/telegraf/telegraf.conf --test
# Query InfluxDB
curl -X POST "http://localhost:8086/api/v2/query?org=myorg" \
-H "Authorization: Token YOUR_TOKEN" \
-H "Content-Type: application/vnd.flux" \
--data-binary 'from(bucket: "telegraf") |> range(start: -5m) |> limit(n: 1)'
Common causes:
-
Authentication failure: Invalid token
-
Network connectivity: Can't reach InfluxDB
-
Wrong bucket: Writing to wrong bucket
-
Clock skew: Time difference too large
No Logs in Loki¶
Symptoms: Loki running but no logs from Alloy
Diagnosis:
# Check Alloy is sending
journalctl -u alloy | grep -i "batch"
# Query Loki
curl -G "http://localhost:3100/loki/api/v1/query" \
--data-urlencode 'query={hostname="YOUR_HOST"}' \
--data-urlencode 'limit=5'
# Check Loki labels
curl http://localhost:3100/loki/api/v1/labels
Common causes:
-
Log sources not producing logs:
-
Alloy can't read logs:
-
Label mismatch: Querying wrong labels
-
Network connectivity: Can't reach Loki
High Resource Usage¶
InfluxDB High Memory¶
Symptoms: InfluxDB using excessive RAM
Diagnosis:
# Check memory usage
podman stats influxdb
# Check cache size
curl http://localhost:8086/metrics | grep cache
Solutions: 1. Reduce cache sizes in configuration 2. Add more RAM 3. Reduce retention period 4. Use S3 backend to offload storage
Loki High Memory¶
Symptoms: Loki using excessive RAM
Diagnosis:
Solutions: 1. Reduce chunk cache size 2. Reduce ingestion rate limits 3. Use S3 backend 4. Increase flush frequency (smaller chunks)
High Disk I/O¶
Symptoms: Disk utilization at 100%
Diagnosis:
Solutions: 1. Use faster storage (SSD/NVMe) 2. Use S3 backend for storage 3. Reduce collection frequency (Telegraf) 4. Reduce log volume (Alloy) 5. Adjust retention policies
Network Issues¶
Can't Connect to InfluxDB¶
Diagnosis:
# From client host
telnet INFLUXDB_HOST 8086
curl -I http://INFLUXDB_HOST:8086/health
# Check firewall
sudo iptables -L -n | grep 8086
sudo firewall-cmd --list-all
Solutions: 1. Open firewall port:
-
Check WireGuard (if used):
-
Verify InfluxDB listening:
Can't Connect to Loki¶
Same troubleshooting as InfluxDB, but for port 3100.
Data Quality Issues¶
Gaps in Metrics¶
Symptoms: Missing data points in InfluxDB
Diagnosis:
# Check Telegraf was running continuously
journalctl -u telegraf --since "1 hour ago" | grep -i "stopped\|started"
# Check for collection errors
journalctl -u telegraf | grep -i error
Causes: 1. Telegraf restarted 2. Network interruption 3. InfluxDB unavailable 4. System clock issues
Duplicate Logs¶
Symptoms: Same log line appearing multiple times in Loki
Causes: 1. Multiple Alloy instances shipping same logs 2. Alloy restarted and re-read logs 3. Log rotation issues
Solutions: 1. Ensure only one Alloy per log source 2. Use unique position tracking 3. Check Alloy configuration for duplicates
Performance Issues¶
Slow Queries¶
InfluxDB:
# Check query performance
curl http://localhost:8086/metrics | grep query_duration
# Solutions:
# - Add time range filters
# - Reduce query complexity
# - Increase cache sizes
# - Use continuous queries/tasks for common queries
Loki:
# Check query performance
curl http://localhost:3100/metrics | grep request_duration
# Solutions:
# - Always use time ranges
# - Use specific label selectors
# - Avoid regex when possible
# - Use instant queries for tables
Slow Ingestion¶
Symptoms: Data delayed by minutes
Diagnosis:
# Check batch sizes and flush intervals
journalctl -u telegraf | grep "wrote batch"
# Check for resource constraints
top
iostat -x 1
Solutions: 1. Increase flush interval (batch more data) 2. Reduce collection frequency 3. Add more resources (CPU, memory, disk) 4. Use S3 backend to reduce disk I/O
Configuration Issues¶
Invalid Alloy Configuration¶
Symptoms: Alloy won't start or reload
Diagnosis:
Common errors: 1. Syntax errors (missing braces, quotes) 2. Invalid component names 3. Circular dependencies 4. Invalid regex patterns
Fix:
1. Run alloy fmt to auto-fix formatting
2. Check logs for specific error line
3. Test configuration before deploying
Invalid Telegraf Configuration¶
Diagnosis:
Common errors: 1. TOML syntax errors 2. Invalid plugin names 3. Missing required fields 4. Type mismatches
Getting Help¶
When troubleshooting fails:
-
Collect diagnostic info:
# Services status systemctl status influxdb loki telegraf alloy > /tmp/services.txt # Logs journalctl -u influxdb -n 200 > /tmp/influxdb.log journalctl -u loki -n 200 > /tmp/loki.log journalctl -u telegraf -n 200 > /tmp/telegraf.log journalctl -u alloy -n 200 > /tmp/alloy.log # Container info podman ps -a > /tmp/containers.txt podman inspect influxdb > /tmp/influxdb-inspect.json -
Check documentation: Reference specific component docs
-
Search issues: Check GitHub issues for known problems
-
Ask community: Provide diagnostic info when asking for help