Skip to content

Troubleshooting Guide

Overview

This guide covers common issues and their solutions. Follow the systematic approach: identify, diagnose, fix, verify.

General Troubleshooting Process

  1. Identify: What is not working?
  2. Diagnose: Check logs, status, connectivity
  3. Fix: Apply solution
  4. Verify: Confirm fix worked

Service Won't Start

InfluxDB Won't Start

Symptoms: - systemctl status influxdb shows failed - Container not running

Diagnosis:

journalctl -u influxdb -n 100
podman logs influxdb

Common causes:

  1. Port conflict: Port 8086 already in use

    sudo netstat -tulpn | grep 8086
    # Fix: Stop conflicting service
    

  2. Storage permission issues:

    ls -ld /var/lib/influxdb2
    # Fix: Ensure correct ownership
    sudo chown -R 1000:1000 /var/lib/influxdb2
    

  3. Invalid configuration:

    # Check for config errors in logs
    journalctl -u influxdb | grep -i error
    

  4. Out of disk space:

    df -h /var/lib/influxdb2
    # Fix: Free up space or expand storage
    

Loki Won't Start

Diagnosis:

journalctl -u loki -n 100
podman logs loki

Common causes:

  1. Port conflict: Port 3100 already in use

    sudo netstat -tulpn | grep 3100
    

  2. Configuration syntax error:

    # Validate config (if you have loki binary)
    loki -config.file=/etc/loki/loki.yaml -verify-config
    

  3. Storage issues: Same as InfluxDB

    ls -ld /var/lib/loki
    sudo chown -R 1000:1000 /var/lib/loki
    

Telegraf Won't Start

Diagnosis:

journalctl -u telegraf -n 100

Common causes:

  1. Configuration syntax error:

    telegraf --config /etc/telegraf/telegraf.conf --test
    

  2. Plugin error: Invalid plugin configuration

    # Check logs for specific plugin errors
    journalctl -u telegraf | grep -i "error"
    

  3. Permission issues: Can't read system metrics

    # Add telegraf user to required groups
    sudo usermod -aG docker telegraf  # For Docker metrics
    

Alloy Won't Start

Diagnosis:

journalctl -u alloy -n 100

Common causes:

  1. Configuration syntax error:

    alloy validate /etc/alloy/config.alloy
    alloy fmt /etc/alloy/config.alloy
    

  2. Log file access: Can't read log files

    # Check file permissions
    ls -l /var/log/apache2/access.log
    # Add alloy user to adm group
    sudo usermod -aG adm alloy
    

  3. Port conflict: Port 12345 in use

    sudo netstat -tulpn | grep 12345
    

No Data Appearing

No Metrics in InfluxDB

Symptoms: InfluxDB running but no data from Telegraf

Diagnosis:

# Check Telegraf is sending
journalctl -u telegraf | grep -i "wrote batch"

# Test Telegraf output
telegraf --config /etc/telegraf/telegraf.conf --test

# Query InfluxDB
curl -X POST "http://localhost:8086/api/v2/query?org=myorg" \
  -H "Authorization: Token YOUR_TOKEN" \
  -H "Content-Type: application/vnd.flux" \
  --data-binary 'from(bucket: "telegraf") |> range(start: -5m) |> limit(n: 1)'

Common causes:

  1. Authentication failure: Invalid token

    # Check Telegraf logs for "unauthorized"
    journalctl -u telegraf | grep -i "401\|unauthorized"
    # Fix: Update Telegraf token
    

  2. Network connectivity: Can't reach InfluxDB

    # Test from Telegraf host
    curl http://INFLUXDB_HOST:8086/health
    # Fix: Check firewall, WireGuard
    

  3. Wrong bucket: Writing to wrong bucket

    # Verify bucket exists
    influx bucket list
    # Check Telegraf output configuration
    grep bucket /etc/telegraf/telegraf.conf
    

  4. Clock skew: Time difference too large

    # Check time sync
    timedatectl status
    # Fix: Enable NTP
    sudo timedatectl set-ntp true
    

No Logs in Loki

Symptoms: Loki running but no logs from Alloy

Diagnosis:

# Check Alloy is sending
journalctl -u alloy | grep -i "batch"

# Query Loki
curl -G "http://localhost:3100/loki/api/v1/query" \
  --data-urlencode 'query={hostname="YOUR_HOST"}' \
  --data-urlencode 'limit=5'

# Check Loki labels
curl http://localhost:3100/loki/api/v1/labels

Common causes:

  1. Log sources not producing logs:

    # Check fail2ban is running
    systemctl status fail2ban
    # Check logs exist
    journalctl -u fail2ban -n 5
    

  2. Alloy can't read logs:

    # Check file permissions
    ls -l /var/log/apache2/access.log
    # Check Alloy logs for permission errors
    journalctl -u alloy | grep -i "permission denied"
    

  3. Label mismatch: Querying wrong labels

    # List all labels
    curl http://localhost:3100/loki/api/v1/labels
    # List label values
    curl http://localhost:3100/loki/api/v1/label/service_type/values
    

  4. Network connectivity: Can't reach Loki

    curl http://LOKI_HOST:3100/ready
    

High Resource Usage

InfluxDB High Memory

Symptoms: InfluxDB using excessive RAM

Diagnosis:

# Check memory usage
podman stats influxdb

# Check cache size
curl http://localhost:8086/metrics | grep cache

Solutions: 1. Reduce cache sizes in configuration 2. Add more RAM 3. Reduce retention period 4. Use S3 backend to offload storage

Loki High Memory

Symptoms: Loki using excessive RAM

Diagnosis:

podman stats loki
curl http://localhost:3100/metrics | grep memory

Solutions: 1. Reduce chunk cache size 2. Reduce ingestion rate limits 3. Use S3 backend 4. Increase flush frequency (smaller chunks)

High Disk I/O

Symptoms: Disk utilization at 100%

Diagnosis:

iostat -x 1
iotop

Solutions: 1. Use faster storage (SSD/NVMe) 2. Use S3 backend for storage 3. Reduce collection frequency (Telegraf) 4. Reduce log volume (Alloy) 5. Adjust retention policies

Network Issues

Can't Connect to InfluxDB

Diagnosis:

# From client host
telnet INFLUXDB_HOST 8086
curl -I http://INFLUXDB_HOST:8086/health

# Check firewall
sudo iptables -L -n | grep 8086
sudo firewall-cmd --list-all

Solutions: 1. Open firewall port:

sudo firewall-cmd --add-port=8086/tcp --permanent
sudo firewall-cmd --reload

  1. Check WireGuard (if used):

    sudo wg show
    ping 10.10.0.11
    

  2. Verify InfluxDB listening:

    sudo netstat -tulpn | grep 8086
    

Can't Connect to Loki

Same troubleshooting as InfluxDB, but for port 3100.

Data Quality Issues

Gaps in Metrics

Symptoms: Missing data points in InfluxDB

Diagnosis:

# Check Telegraf was running continuously
journalctl -u telegraf --since "1 hour ago" | grep -i "stopped\|started"

# Check for collection errors
journalctl -u telegraf | grep -i error

Causes: 1. Telegraf restarted 2. Network interruption 3. InfluxDB unavailable 4. System clock issues

Duplicate Logs

Symptoms: Same log line appearing multiple times in Loki

Causes: 1. Multiple Alloy instances shipping same logs 2. Alloy restarted and re-read logs 3. Log rotation issues

Solutions: 1. Ensure only one Alloy per log source 2. Use unique position tracking 3. Check Alloy configuration for duplicates

Performance Issues

Slow Queries

InfluxDB:

# Check query performance
curl http://localhost:8086/metrics | grep query_duration

# Solutions:
# - Add time range filters
# - Reduce query complexity
# - Increase cache sizes
# - Use continuous queries/tasks for common queries

Loki:

# Check query performance
curl http://localhost:3100/metrics | grep request_duration

# Solutions:
# - Always use time ranges
# - Use specific label selectors
# - Avoid regex when possible
# - Use instant queries for tables

Slow Ingestion

Symptoms: Data delayed by minutes

Diagnosis:

# Check batch sizes and flush intervals
journalctl -u telegraf | grep "wrote batch"

# Check for resource constraints
top
iostat -x 1

Solutions: 1. Increase flush interval (batch more data) 2. Reduce collection frequency 3. Add more resources (CPU, memory, disk) 4. Use S3 backend to reduce disk I/O

Configuration Issues

Invalid Alloy Configuration

Symptoms: Alloy won't start or reload

Diagnosis:

alloy validate /etc/alloy/config.alloy
alloy fmt /etc/alloy/config.alloy

Common errors: 1. Syntax errors (missing braces, quotes) 2. Invalid component names 3. Circular dependencies 4. Invalid regex patterns

Fix: 1. Run alloy fmt to auto-fix formatting 2. Check logs for specific error line 3. Test configuration before deploying

Invalid Telegraf Configuration

Diagnosis:

telegraf --config /etc/telegraf/telegraf.conf --test

Common errors: 1. TOML syntax errors 2. Invalid plugin names 3. Missing required fields 4. Type mismatches

Getting Help

When troubleshooting fails:

  1. Collect diagnostic info:

    # Services status
    systemctl status influxdb loki telegraf alloy > /tmp/services.txt
    
    # Logs
    journalctl -u influxdb -n 200 > /tmp/influxdb.log
    journalctl -u loki -n 200 > /tmp/loki.log
    journalctl -u telegraf -n 200 > /tmp/telegraf.log
    journalctl -u alloy -n 200 > /tmp/alloy.log
    
    # Container info
    podman ps -a > /tmp/containers.txt
    podman inspect influxdb > /tmp/influxdb-inspect.json
    

  2. Check documentation: Reference specific component docs

  3. Search issues: Check GitHub issues for known problems

  4. Ask community: Provide diagnostic info when asking for help