System Metrics¶
Overview¶
System metrics provide visibility into host-level resource usage: CPU, memory, disk, network, and processes. These are collected by Telegraf and stored in InfluxDB.
Built-in System Metrics¶
Telegraf collects these metrics by default (no additional configuration needed):
CPU Metrics¶
Plugin: cpu
Metrics collected:
- cpu_usage_user - User space CPU usage percentage
- cpu_usage_system - System/kernel CPU usage percentage
- cpu_usage_idle - Idle CPU percentage
- cpu_usage_iowait - CPU waiting for I/O percentage
- cpu_usage_irq - Hardware interrupt CPU usage
- cpu_usage_softirq - Software interrupt CPU usage
Tags:
- cpu - CPU identifier (cpu0, cpu1, cpu-total)
- host - Hostname
Example query:
from(bucket: "telegraf")
|> range(start: -1h)
|> filter(fn: (r) => r["_measurement"] == "cpu")
|> filter(fn: (r) => r["_field"] == "usage_user")
|> filter(fn: (r) => r["cpu"] == "cpu-total")
Memory Metrics¶
Plugin: mem
Metrics collected:
- mem_total - Total memory in bytes
- mem_available - Available memory in bytes
- mem_used - Used memory in bytes
- mem_free - Free memory in bytes
- mem_cached - Cached memory in bytes
- mem_buffered - Buffered memory in bytes
- mem_used_percent - Memory usage percentage
Example query:
from(bucket: "telegraf")
|> range(start: -1h)
|> filter(fn: (r) => r["_measurement"] == "mem")
|> filter(fn: (r) => r["_field"] == "used_percent")
Disk Metrics¶
Plugin: disk
Metrics collected:
- disk_total - Total disk space in bytes
- disk_used - Used disk space in bytes
- disk_free - Free disk space in bytes
- disk_used_percent - Disk usage percentage
- disk_inodes_total - Total inodes
- disk_inodes_used - Used inodes
- disk_inodes_free - Free inodes
Tags:
- path - Mount point (/, /home, /var)
- device - Device name (sda1, nvme0n1p1)
- fstype - Filesystem type (ext4, xfs)
Example query:
from(bucket: "telegraf")
|> range(start: -1h)
|> filter(fn: (r) => r["_measurement"] == "disk")
|> filter(fn: (r) => r["_field"] == "used_percent")
|> filter(fn: (r) => r["path"] == "/")
Disk I/O Metrics¶
Plugin: diskio
Metrics collected:
- diskio_reads - Number of reads
- diskio_writes - Number of writes
- diskio_read_bytes - Bytes read
- diskio_write_bytes - Bytes written
- diskio_read_time - Time spent reading
- diskio_write_time - Time spent writing
- diskio_io_time - Total I/O time
- diskio_iops_in_progress - Current I/O operations
Tags:
- name - Device name (sda, nvme0n1)
Network Metrics¶
Plugin: net
Metrics collected:
- net_bytes_sent - Bytes transmitted
- net_bytes_recv - Bytes received
- net_packets_sent - Packets transmitted
- net_packets_recv - Packets received
- net_err_in - Inbound errors
- net_err_out - Outbound errors
- net_drop_in - Inbound packets dropped
- net_drop_out - Outbound packets dropped
Tags:
- interface - Network interface (eth0, ens3, wg0)
Example query:
from(bucket: "telegraf")
|> range(start: -1h)
|> filter(fn: (r) => r["_measurement"] == "net")
|> filter(fn: (r) => r["_field"] == "bytes_sent")
|> derivative(unit: 1s, nonNegative: true)
Process Metrics¶
Plugin: processes
Metrics collected:
- processes_running - Number of running processes
- processes_sleeping - Number of sleeping processes
- processes_stopped - Number of stopped processes
- processes_zombies - Number of zombie processes
- processes_total - Total process count
System Load¶
Plugin: system
Metrics collected:
- system_load1 - 1-minute load average
- system_load5 - 5-minute load average
- system_load15 - 15-minute load average
- system_uptime - System uptime in seconds
- system_n_cpus - Number of CPUs
- system_n_users - Number of logged-in users
Configuration¶
System metrics are enabled by default. No additional configuration needed:
- role: jackaltx.solti_monitoring.telegraf
vars:
telegraf_output_influxdb: true
# System metrics automatically enabled
Filtering Metrics¶
To reduce metric volume, filter by tags:
telegraf_metric_filters:
# Only collect aggregate CPU stats
- measurement: "cpu"
tags:
cpu: ["cpu-total"]
# Only monitor root filesystem
- measurement: "disk"
tags:
path: ["/"]
Common Queries¶
CPU Usage Over Time¶
from(bucket: "telegraf")
|> range(start: -24h)
|> filter(fn: (r) => r["_measurement"] == "cpu")
|> filter(fn: (r) => r["_field"] == "usage_user")
|> filter(fn: (r) => r["cpu"] == "cpu-total")
|> aggregateWindow(every: 5m, fn: mean)
Memory Percentage¶
from(bucket: "telegraf")
|> range(start: -1h)
|> filter(fn: (r) => r["_measurement"] == "mem")
|> filter(fn: (r) => r["_field"] == "used_percent")
Disk Space Available¶
from(bucket: "telegraf")
|> range(start: -1h)
|> filter(fn: (r) => r["_measurement"] == "disk")
|> filter(fn: (r) => r["_field"] == "free")
|> filter(fn: (r) => r["path"] == "/")
|> last()
Network Throughput¶
from(bucket: "telegraf")
|> range(start: -1h)
|> filter(fn: (r) => r["_measurement"] == "net")
|> filter(fn: (r) => r["_field"] == "bytes_sent" or r["_field"] == "bytes_recv")
|> derivative(unit: 1s, nonNegative: true)
|> aggregateWindow(every: 1m, fn: mean)
Alerting Thresholds¶
Recommended alert thresholds:
- CPU usage: > 80% for 5 minutes
- Memory usage: > 90%
- Disk usage: > 85%
- Disk I/O wait: > 30%
- Load average: > (CPU count × 2)
Reference¶
For complete list of system metrics, see Telegraf documentation: - https://github.com/influxdata/telegraf/tree/master/plugins/inputs/cpu - https://github.com/influxdata/telegraf/tree/master/plugins/inputs/mem - https://github.com/influxdata/telegraf/tree/master/plugins/inputs/disk