Troubleshooting Guide¶

Overview¶

This guide covers common issues and their solutions. Follow the systematic approach: identify, diagnose, fix, verify.

General Troubleshooting Process¶

Identify: What is not working?
Diagnose: Check logs, status, connectivity
Fix: Apply solution
Verify: Confirm fix worked

Service Won't Start¶

InfluxDB Won't Start¶

Symptoms: - systemctl status influxdb shows failed - Container not running

Diagnosis:

journalctl -u influxdb -n 100
podman logs influxdb

Common causes:

Port conflict: Port 8086 already in use

sudo netstat -tulpn | grep 8086
# Fix: Stop conflicting service

Storage permission issues:

ls -ld /var/lib/influxdb2
# Fix: Ensure correct ownership
sudo chown -R 1000:1000 /var/lib/influxdb2

Invalid configuration:

# Check for config errors in logs
journalctl -u influxdb | grep -i error

Out of disk space:

df -h /var/lib/influxdb2
# Fix: Free up space or expand storage

Loki Won't Start¶

Diagnosis:

journalctl -u loki -n 100
podman logs loki

Common causes:

Port conflict: Port 3100 already in use
```
sudo netstat -tulpn | grep 3100
```

Configuration syntax error:

# Validate config (if you have loki binary)
loki -config.file=/etc/loki/loki.yaml -verify-config

Storage issues: Same as InfluxDB

ls -ld /var/lib/loki
sudo chown -R 1000:1000 /var/lib/loki

Telegraf Won't Start¶

Diagnosis:

journalctl -u telegraf -n 100

Common causes:

Configuration syntax error:

telegraf --config /etc/telegraf/telegraf.conf --test

Plugin error: Invalid plugin configuration

# Check logs for specific plugin errors
journalctl -u telegraf | grep -i "error"

Permission issues: Can't read system metrics

# Add telegraf user to required groups
sudo usermod -aG docker telegraf  # For Docker metrics

Alloy Won't Start¶

Diagnosis:

journalctl -u alloy -n 100

Common causes:

Configuration syntax error:

alloy validate /etc/alloy/config.alloy
alloy fmt /etc/alloy/config.alloy

Log file access: Can't read log files

# Check file permissions
ls -l /var/log/apache2/access.log
# Add alloy user to adm group
sudo usermod -aG adm alloy

Port conflict: Port 12345 in use
```
sudo netstat -tulpn | grep 12345
```

No Data Appearing¶

No Metrics in InfluxDB¶

Symptoms: InfluxDB running but no data from Telegraf

Diagnosis:

# Check Telegraf is sending
journalctl -u telegraf | grep -i "wrote batch"

# Test Telegraf output
telegraf --config /etc/telegraf/telegraf.conf --test

# Query InfluxDB
curl -X POST "http://localhost:8086/api/v2/query?org=myorg" \
  -H "Authorization: Token YOUR_TOKEN" \
  -H "Content-Type: application/vnd.flux" \
  --data-binary 'from(bucket: "telegraf") |> range(start: -5m) |> limit(n: 1)'

Common causes:

Authentication failure: Invalid token

# Check Telegraf logs for "unauthorized"
journalctl -u telegraf | grep -i "401\|unauthorized"
# Fix: Update Telegraf token

Network connectivity: Can't reach InfluxDB

# Test from Telegraf host
curl http://INFLUXDB_HOST:8086/health
# Fix: Check firewall, WireGuard

Wrong bucket: Writing to wrong bucket

# Verify bucket exists
influx bucket list
# Check Telegraf output configuration
grep bucket /etc/telegraf/telegraf.conf

Clock skew: Time difference too large

# Check time sync
timedatectl status
# Fix: Enable NTP
sudo timedatectl set-ntp true

No Logs in Loki¶

Symptoms: Loki running but no logs from Alloy

Diagnosis:

# Check Alloy is sending
journalctl -u alloy | grep -i "batch"

# Query Loki
curl -G "http://localhost:3100/loki/api/v1/query" \
  --data-urlencode 'query={hostname="YOUR_HOST"}' \
  --data-urlencode 'limit=5'

# Check Loki labels
curl http://localhost:3100/loki/api/v1/labels

Common causes:

Log sources not producing logs:

# Check fail2ban is running
systemctl status fail2ban
# Check logs exist
journalctl -u fail2ban -n 5

Alloy can't read logs:

# Check file permissions
ls -l /var/log/apache2/access.log
# Check Alloy logs for permission errors
journalctl -u alloy | grep -i "permission denied"

Label mismatch: Querying wrong labels

# List all labels
curl http://localhost:3100/loki/api/v1/labels
# List label values
curl http://localhost:3100/loki/api/v1/label/service_type/values

Network connectivity: Can't reach Loki
```
curl http://LOKI_HOST:3100/ready
```

High Resource Usage¶

InfluxDB High Memory¶

Symptoms: InfluxDB using excessive RAM

Diagnosis:

# Check memory usage
podman stats influxdb

# Check cache size
curl http://localhost:8086/metrics | grep cache

Solutions: 1. Reduce cache sizes in configuration 2. Add more RAM 3. Reduce retention period 4. Use S3 backend to offload storage

Loki High Memory¶

Symptoms: Loki using excessive RAM

Diagnosis:

podman stats loki
curl http://localhost:3100/metrics | grep memory

Solutions: 1. Reduce chunk cache size 2. Reduce ingestion rate limits 3. Use S3 backend 4. Increase flush frequency (smaller chunks)

High Disk I/O¶

Symptoms: Disk utilization at 100%

Diagnosis:

iostat -x 1
iotop

Solutions: 1. Use faster storage (SSD/NVMe) 2. Use S3 backend for storage 3. Reduce collection frequency (Telegraf) 4. Reduce log volume (Alloy) 5. Adjust retention policies

Network Issues¶

Can't Connect to InfluxDB¶

Diagnosis:

# From client host
telnet INFLUXDB_HOST 8086
curl -I http://INFLUXDB_HOST:8086/health

# Check firewall
sudo iptables -L -n | grep 8086
sudo firewall-cmd --list-all

Solutions: 1. Open firewall port:

sudo firewall-cmd --add-port=8086/tcp --permanent
sudo firewall-cmd --reload

Check WireGuard (if used):
```
sudo wg show
ping 10.10.0.11
```
Verify InfluxDB listening:
```
sudo netstat -tulpn | grep 8086
```

Can't Connect to Loki¶

Same troubleshooting as InfluxDB, but for port 3100.

Data Quality Issues¶

Gaps in Metrics¶

Symptoms: Missing data points in InfluxDB

Diagnosis:

# Check Telegraf was running continuously
journalctl -u telegraf --since "1 hour ago" | grep -i "stopped\|started"

# Check for collection errors
journalctl -u telegraf | grep -i error

Causes: 1. Telegraf restarted 2. Network interruption 3. InfluxDB unavailable 4. System clock issues

Duplicate Logs¶

Symptoms: Same log line appearing multiple times in Loki

Causes: 1. Multiple Alloy instances shipping same logs 2. Alloy restarted and re-read logs 3. Log rotation issues

Solutions: 1. Ensure only one Alloy per log source 2. Use unique position tracking 3. Check Alloy configuration for duplicates

Performance Issues¶

Slow Queries¶

InfluxDB:

# Check query performance
curl http://localhost:8086/metrics | grep query_duration

# Solutions:
# - Add time range filters
# - Reduce query complexity
# - Increase cache sizes
# - Use continuous queries/tasks for common queries

Loki:

# Check query performance
curl http://localhost:3100/metrics | grep request_duration

# Solutions:
# - Always use time ranges
# - Use specific label selectors
# - Avoid regex when possible
# - Use instant queries for tables

Slow Ingestion¶

Symptoms: Data delayed by minutes

Diagnosis:

# Check batch sizes and flush intervals
journalctl -u telegraf | grep "wrote batch"

# Check for resource constraints
top
iostat -x 1

Solutions: 1. Increase flush interval (batch more data) 2. Reduce collection frequency 3. Add more resources (CPU, memory, disk) 4. Use S3 backend to reduce disk I/O

Configuration Issues¶

Invalid Alloy Configuration¶

Symptoms: Alloy won't start or reload

Diagnosis:

alloy validate /etc/alloy/config.alloy
alloy fmt /etc/alloy/config.alloy

Common errors: 1. Syntax errors (missing braces, quotes) 2. Invalid component names 3. Circular dependencies 4. Invalid regex patterns

Fix: 1. Run alloy fmt to auto-fix formatting 2. Check logs for specific error line 3. Test configuration before deploying

Invalid Telegraf Configuration¶

Diagnosis:

telegraf --config /etc/telegraf/telegraf.conf --test

Common errors: 1. TOML syntax errors 2. Invalid plugin names 3. Missing required fields 4. Type mismatches

Getting Help¶

When troubleshooting fails:

Collect diagnostic info:

# Services status
systemctl status influxdb loki telegraf alloy > /tmp/services.txt

# Logs
journalctl -u influxdb -n 200 > /tmp/influxdb.log
journalctl -u loki -n 200 > /tmp/loki.log
journalctl -u telegraf -n 200 > /tmp/telegraf.log
journalctl -u alloy -n 200 > /tmp/alloy.log

# Container info
podman ps -a > /tmp/containers.txt
podman inspect influxdb > /tmp/influxdb-inspect.json

Check documentation: Reference specific component docs
Search issues: Check GitHub issues for known problems
Ask community: Provide diagnostic info when asking for help