Skip to content

Telegraf Role

Overview

Telegraf is a plugin-driven server agent for collecting and sending metrics. It supports a wide variety of input plugins for system metrics, application metrics, and custom data sources.

Purpose

  • Collect system metrics (CPU, memory, disk, network)
  • Collect application metrics (databases, web servers, containers)
  • Send metrics to InfluxDB for storage and analysis
  • Lightweight agent with minimal resource overhead

Installation

The telegraf role installs and configures Telegraf on target hosts:

- role: jackaltx.solti_monitoring.telegraf
  vars:
    telegraf_output_influxdb: true
    telegraf_output_url: "http://monitor.example.com:8086"
    telegraf_output_token: "{{ vault_telegraf_token }}"
    telegraf_output_org: "myorg"
    telegraf_output_bucket: "telegraf"

Key Configuration Options

Output Configuration

InfluxDB output (primary):

telegraf_output_influxdb: true
telegraf_output_url: "http://10.10.0.11:8086"  # InfluxDB API endpoint
telegraf_output_token: "{{ vault_influxdb_token }}"
telegraf_output_org: "myorg"
telegraf_output_bucket: "telegraf"

Input Plugins

System metrics (enabled by default): - cpu - CPU usage statistics - disk - Disk usage and I/O - diskio - Disk I/O statistics - mem - Memory usage - net - Network interface statistics - system - System load and uptime - processes - Process count

Additional plugins (opt-in):

telegraf_enable_docker: true        # Docker container metrics
telegraf_enable_nginx: true         # Nginx web server metrics
telegraf_enable_postgresql: true    # PostgreSQL database metrics
telegraf_enable_redis: true         # Redis metrics

Global Tags

Add custom labels to all metrics:

telegraf_global_tags:
  environment: "production"
  datacenter: "dc1"
  region: "us-east"

Service Management

Systemd Service

Telegraf runs as a systemd service:

# Check status
systemctl status telegraf

# Start/stop/restart
systemctl start telegraf
systemctl stop telegraf
systemctl restart telegraf

# Enable at boot
systemctl enable telegraf

Configuration File

Primary configuration: /etc/telegraf/telegraf.conf

Additional configs: /etc/telegraf/telegraf.d/*.conf

Testing Configuration

Validate configuration without starting service:

telegraf --config /etc/telegraf/telegraf.conf --test

Resource Requirements

Minimal footprint: - CPU: < 1% average - Memory: 128MB - Disk: 50MB for binary and configs - Network: Depends on metric collection frequency

Example Configurations

Basic System Monitoring

Monitor local system only:

telegraf_output_influxdb: true
telegraf_output_url: "http://localhost:8086"
telegraf_output_token: "{{ vault_token }}"
telegraf_output_org: "myorg"
telegraf_output_bucket: "telegraf"

# Only system metrics
telegraf_enable_docker: false
telegraf_enable_nginx: false

Remote Collection via WireGuard

Ship metrics to central server over WireGuard:

telegraf_output_influxdb: true
telegraf_output_url: "http://10.10.0.11:8086"  # WireGuard IP
telegraf_output_token: "{{ vault_token }}"
telegraf_output_org: "myorg"
telegraf_output_bucket: "telegraf"

telegraf_global_tags:
  hostname: "{{ ansible_hostname }}"
  site: "remote"

Multi-Output Configuration

Send metrics to multiple destinations:

telegraf_outputs:
  - type: "influxdb_v2"
    url: "http://primary.example.com:8086"
    token: "{{ vault_primary_token }}"

  - type: "influxdb_v2"
    url: "http://backup.example.com:8086"
    token: "{{ vault_backup_token }}"

Troubleshooting

Check Logs

journalctl -u telegraf -f

Test Connection to InfluxDB

curl -I http://monitor.example.com:8086/health

Verify Metrics Collection

Check that Telegraf is collecting metrics:

telegraf --config /etc/telegraf/telegraf.conf --test --input-filter cpu,mem,disk

Common Issues

  1. Connection refused: InfluxDB not reachable
  2. Check network connectivity
  3. Verify InfluxDB is running
  4. Check firewall rules

  5. Authentication failed: Invalid token

  6. Verify token has write permissions
  7. Check token in InfluxDB UI

  8. No data in InfluxDB: Metrics not being sent

  9. Check telegraf logs for errors
  10. Verify output configuration
  11. Test with --test flag

Performance Tuning

Collection Interval

Adjust collection frequency:

telegraf_interval: "60s"       # Collect every 60 seconds
telegraf_flush_interval: "60s" # Flush to output every 60 seconds

Metric Filtering

Reduce metric volume:

telegraf_metric_filters:
  - measurement: "cpu"
    tags:
      cpu: ["cpu-total"]  # Only collect aggregate CPU stats

Security Considerations

  1. Token Security: Store tokens in Ansible Vault
  2. TLS/SSL: Use HTTPS endpoints when possible
  3. Network Security: Use WireGuard for remote collectors
  4. Least Privilege: Grant minimum required permissions to tokens

Reference Deployment

See Reference Deployments chapter for real-world examples: - monitor11.example.com - Server with local Telegraf - ispconfig3.example.com - Client shipping metrics via WireGuard