Retention Policies
Overview¶
Retention policies define how long data is kept before being deleted. Proper retention configuration balances storage costs, compliance requirements, and data availability.
Why Retention Matters¶
- Cost Control: Limit storage growth
- Compliance: Meet regulatory requirements
- Performance: Smaller datasets query faster
- Capacity Planning: Predictable storage needs
InfluxDB Retention¶
Bucket-Level Retention¶
Each InfluxDB bucket has its own retention policy:
# Short-term bucket for detailed metrics
influxdb_buckets:
- name: "telegraf_hourly"
retention: "7d"
description: "High-resolution metrics"
- name: "telegraf_daily"
retention: "90d"
description: "Daily aggregates"
- name: "telegraf_monthly"
retention: "365d"
description: "Monthly summaries"
Setting Retention via Role¶
- role: jackaltx.solti_monitoring.influxdb
vars:
influxdb_bucket: "telegraf"
influxdb_retention: "30d"
Updating Retention¶
Change retention for existing bucket:
Via API:
curl -X PATCH "http://localhost:8086/api/v2/buckets/BUCKET_ID" \
-H "Authorization: Token YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"retentionRules": [{"type": "expire", "everySeconds": 7776000}]}'
Infinite Retention¶
Warning: Infinite retention leads to unbounded storage growth.
Loki Retention¶
Global Retention¶
Configure retention for all log streams:
Stream-Level Retention¶
Configure different retention per log type (via labels):
loki_retention_config:
- selector: '{service_type="audit"}'
retention: "365d" # Keep audit logs for 1 year
- selector: '{service_type="debug"}'
retention: "7d" # Keep debug logs for 1 week
- selector: '{service_type="application"}'
retention: "30d" # Keep application logs for 1 month
Compaction and Deletion¶
Loki automatically:
1. Marks chunks older than retention for deletion
2. Waits for retention_delete_delay (default: 2h)
3. Deletes chunks during compaction
Retention Strategy by Data Type¶
Metrics (InfluxDB)¶
System metrics (CPU, memory, disk): - Retention: 30-90 days - Rationale: Sufficient for troubleshooting and capacity planning
Application metrics (app-specific): - Retention: 30-90 days - Rationale: Correlate with logs and events
Business metrics (KPIs, analytics): - Retention: 365+ days - Rationale: Year-over-year comparison, trend analysis
Logs (Loki)¶
Application logs: - Retention: 30 days - Rationale: Recent troubleshooting
Security logs (fail2ban, auth): - Retention: 90-365 days - Rationale: Compliance, security audits
Audit logs (admin actions): - Retention: 365+ days - Rationale: Compliance requirements
Debug logs: - Retention: 7 days - Rationale: Temporary troubleshooting
Access logs (web, API): - Retention: 30 days - Rationale: Traffic analysis, debugging
Multi-Tier Retention¶
Downsampling Strategy¶
Keep detailed metrics short-term, aggregated metrics long-term:
Tier 1 (Raw data): 7 days, 10-second intervals Tier 2 (5-minute aggregates): 30 days Tier 3 (1-hour aggregates): 365 days Tier 4 (Daily summaries): Forever
influxdb_buckets:
- name: "telegraf_raw"
retention: "7d"
- name: "telegraf_5m"
retention: "30d"
- name: "telegraf_1h"
retention: "365d"
Create downsampling tasks:
option task = {name: "downsample_5m", every: 5m}
from(bucket: "telegraf_raw")
|> range(start: -10m)
|> aggregateWindow(every: 5m, fn: mean)
|> to(bucket: "telegraf_5m")
Compliance Considerations¶
Regulatory Requirements¶
GDPR: Personal data retention limits HIPAA: Healthcare data retention (6 years) SOX: Financial data retention (7 years) PCI-DSS: Payment card data retention limits
Implementing Compliance¶
- Identify regulated data: Which logs/metrics contain sensitive info
- Set appropriate retention: Match regulatory requirements
- Document policy: Maintain retention policy documentation
- Audit regularly: Verify retention settings
- Secure deletion: Ensure deleted data is unrecoverable
Example Compliance Configuration¶
# HIPAA-compliant retention
loki_retention_config:
- selector: '{data_class="phi"}'
retention: "2190d" # 6 years
# GDPR-compliant retention
- selector: '{data_class="pii"}'
retention: "90d" # Delete after purpose fulfilled
Monitoring Retention¶
Check Current Retention¶
InfluxDB:
Loki:
Storage Growth Tracking¶
Monitor storage growth to verify retention is working:
# InfluxDB storage
du -sh /var/lib/influxdb2
# Loki storage
du -sh /var/lib/loki
# S3 bucket size (if using S3)
aws s3 ls --summarize --recursive s3://influx11/
Alerts¶
Set up alerts for: - Storage growth exceeding expected rate - Retention deletion failures - Storage approaching capacity
Adjusting Retention¶
When to Increase Retention¶
- Compliance requirements change
- Need longer historical data
- Storage costs decrease
- Business needs change
When to Decrease Retention¶
- Storage costs too high
- Running out of disk space
- Data rarely accessed
- Compliance allows shorter retention
Impact Assessment¶
Before changing retention:
- Query patterns: Check how old data is typically queried
- Storage impact: Calculate storage savings
- User impact: Notify users of retention changes
- Compliance: Verify changes meet requirements
Backup vs Retention¶
Retention: Automatic deletion of old data Backup: Separate copy for disaster recovery
Best practice: Retention ≠ Backup - Set retention based on operational needs - Create backups for disaster recovery - Archive old data separately if needed for compliance
Cost Optimization¶
Storage Cost by Retention¶
Example calculation:
Current: 30d retention = 100 GB = $50/month
Option 1: 7d retention = 23 GB = $12/month (76% savings)
Option 2: 90d retention = 300 GB = $150/month (200% increase)
Retention Recommendations¶
Cost-optimized: - System metrics: 7-14 days - Application logs: 7-14 days - Security logs: 30 days (minimum)
Balanced: - System metrics: 30 days - Application logs: 30 days - Security logs: 90 days
Compliance-focused: - System metrics: 90 days - Application logs: 90 days - Security logs: 365 days
Reference Deployment¶
See Reference Deployments chapter for retention configuration in production: - monitor11.example.com - 30-day retention with S3 backend