Audit Retention and Cleanup
This guide covers the automated retention and cleanup system for audit indices in PDaaS. The cleanup system ensures compliance with retention policies while managing storage costs by automatically deleting old audit indices.
Overview
The audit cleanup system provides:
- Automated Cleanup: Scheduled daily job deletes indices older than retention period
- Configurable Retention: Global, per-organization, and per-service retention policies
- Safety Guards: Built-in protections against accidental data deletion
- Manual Tools: CLI utilities for ad-hoc cleanup and analysis
- Audit Trail: All cleanup operations are themselves audited
- Monitoring: Comprehensive metrics and alerting for cleanup failures
Default Configuration
- Default Retention Period: 90 days
- Minimum Age Safety Guard: 7 days (indices younger than this are never deleted)
- Schedule: Daily at 1 AM UTC
- Dry-Run Mode: Available for testing without actual deletion
How It Works
Automated Daily Cleanup
The cleanup process runs automatically every day at 1 AM UTC:
- Scan Indices: Identifies all audit indices matching pattern
audit-* - Parse Metadata: Extracts organization, account, service, and date from index names
- Apply Retention Policy: Determines retention period based on hierarchical rules
- Safety Check: Ensures indices are older than 7 days (minimum age guard)
- Age Verification: Compares index age against retention period
- Delete Indices: Removes indices exceeding retention period
- Track Metrics: Updates Prometheus metrics for monitoring
- Audit Operation: Creates audit event documenting cleanup results
Index Naming Pattern
Audit indices follow a standardized naming pattern:
audit-{organization_id}-{account_id}-{service}-{YYYY-MM-DD}
Examples:
audit-org123-acc456-api-2025-09-30audit-org456-acc789-worker-2025-10-01audit-system-system-cleanup-2025-10-02
Retention Policy Hierarchy
Retention policies are applied in priority order:
-
Service-Specific Override (highest priority)
- Example:
org_123→apiservice → 365 days
- Example:
-
Organization Default
- Example:
org_123→ 180 days
- Example:
-
Global Default (lowest priority)
- Example: All organizations → 90 days
Configuration
Environment Variables
Configure cleanup behavior via environment variables:
# Enable/disable automatic cleanup
AUDIT_CLEANUP_ENABLED=true
# Global default retention period (days)
AUDIT_RETENTION_DAYS=90
# Cleanup schedule (cron format)
AUDIT_CLEANUP_SCHEDULE="0 1 * * *" # Daily at 1 AM UTC
# Dry-run mode (test without deleting)
AUDIT_CLEANUP_DRY_RUN=false
Retention Policy Configuration
Global Default
Set the global default retention period in your .env file:
AUDIT_RETENTION_DAYS=90
Per-Organization Policies
Set organization-specific retention policies via CLI:
# Set organization retention to 180 days
python -m backend.audit.cli.cleanup set-org-policy \
--organization-id org_123 \
--retention-days 180
Per-Service Overrides
Set service-specific retention within an organization:
# API logs retained for 365 days, other services use org default (180 days)
python -m backend.audit.cli.cleanup set-org-policy \
--organization-id org_123 \
--retention-days 180 \
--service api \
--service-retention 365
Note: Database persistence for organization policies is planned for future implementation. Currently, policies must be configured in code or via startup scripts.
CLI Usage
The cleanup CLI provides several commands for managing audit indices.
Run Cleanup Manually
Execute cleanup job on-demand:
# Dry-run (preview deletions without actually deleting)
python -m backend.audit.cli.cleanup run --dry-run --retention-days 90
# Execute cleanup
python -m backend.audit.cli.cleanup run --retention-days 90
# Custom retention period
python -m backend.audit.cli.cleanup run --retention-days 180
Output Example:
Running cleanup job (retention: 90 days, dry-run: False)
============================================================
============================================================
CLEANUP SUMMARY
============================================================
Indices scanned: 245
Indices deleted: 32
Storage freed: 1,234.56 MB
Duration: 12.34 seconds
✓ Cleanup completed successfully
List All Indices
View all audit indices with metadata:
# List all indices sorted by date
python -m backend.audit.cli.cleanup list-indices
# Sort by age (oldest first)
python -m backend.audit.cli.cleanup list-indices --sort-by age
# Sort by size (largest first) in reverse
python -m backend.audit.cli.cleanup list-indices --sort-by size --reverse
# Filter by organization
python -m backend.audit.cli.cleanup list-indices --organization-id org_123
Output Example:
+-------------------------------------------+------------+------------+------------+-------------+----------+-----------+
| Index Name | Date | Age (days) | Size (MB) | Documents | Org ID | Service |
+===========================================+============+============+============+=============+==========+===========+
| audit-org123-acc456-api-2024-07-01 | 2024-07-01 | 95 | 45.23 | 12,345 | org123 | api |
| audit-org123-acc456-api-2024-07-02 | 2024-07-02 | 94 | 43.12 | 11,987 | org123 | api |
| audit-org456-acc789-worker-2024-09-15 | 2024-09-15 | 20 | 12.45 | 3,456 | org456 | worker |
+-------------------------------------------+------------+------------+------------+-------------+----------+-----------+
Total indices: 245
Total storage: 5,678.90 MB
Total documents: 1,234,567
View Cleanup Statistics
Analyze what would be deleted without performing cleanup:
# Preview cleanup impact
python -m backend.audit.cli.cleanup stats --retention-days 90
Output Example:
Analyzing indices (retention: 90 days)...
============================================================
============================================================
CLEANUP STATISTICS
============================================================
Total indices: 245
Deletable indices: 32 (13.1%)
Total storage: 5,678.90 MB
Storage to be freed: 1,234.56 MB (21.7%)
Retention period: 90 days
Indices older than 90 days will be deleted (minimum age: 7 days)
Set Organization Retention Policy
Configure organization-specific retention:
# Basic organization policy
python -m backend.audit.cli.cleanup set-org-policy \
--organization-id org_123 \
--retention-days 180
# With service override
python -m backend.audit.cli.cleanup set-org-policy \
--organization-id org_123 \
--retention-days 180 \
--service api \
--service-retention 365
Output Example:
============================================================
RETENTION POLICY (Preview)
============================================================
Organization ID: org_123
Default retention: 180 days
Service overrides:
- api: 365 days
⚠ Note: Database persistence not yet implemented.
This policy is validated but not stored.
Future implementation will persist to organization settings table.
Safety Features
Minimum Age Guard
The cleanup system includes a 7-day minimum age safeguard:
- Indices less than 7 days old are never deleted, regardless of retention policy
- Prevents accidental deletion of recent data due to misconfiguration
- Hard-coded protection that cannot be disabled
Example:
# Even if retention is set to 1 day, only indices 7+ days old are deleted
if age_days < 7:
logger.debug(f"Skipping {index_name}: too recent ({age_days} days)")
return False # Do not delete
Dry-Run Mode
Test cleanup operations without actually deleting data:
# Preview what would be deleted
python -m backend.audit.cli.cleanup run --dry-run --retention-days 90
Dry-run mode:
- Lists all indices that would be deleted
- Shows storage that would be freed
- Creates no changes to OpenSearch
- Safe for production testing
Error Handling
The cleanup system handles errors gracefully:
- Connection Errors: Retries with exponential backoff
- Permission Errors: Logs and continues with next index
- Timeout Errors: Tracks and alerts operations team
- Partial Failures: Continues cleanup even if some indices fail
- Fatal Errors: Stops execution and alerts immediately
Monitoring and Alerting
Prometheus Metrics
The cleanup system exports comprehensive metrics:
# Cleanup execution metrics
audit_cleanup_indices_scanned # Total indices examined
audit_cleanup_indices_deleted_total # Total indices deleted
audit_cleanup_storage_freed_mb_total # Total storage freed (MB)
audit_cleanup_duration_seconds # Cleanup job duration
audit_cleanup_last_run_timestamp # Last successful run timestamp
# Error tracking
audit_cleanup_errors_total{error_type} # Errors by type
Alert Rules
Configure these alerts for production:
1. Cleanup Job Failed
alert: AuditCleanupJobFailed
expr: time() - audit_cleanup_last_run_timestamp > 90000 # 25 hours
for: 10m
severity: warning
summary: "Audit cleanup job hasn't run in 25+ hours"
description: "Daily cleanup job may have failed or is disabled"
2. High Deletion Failure Rate
alert: AuditCleanupHighFailureRate
expr: |
rate(audit_cleanup_errors_total[5m]) /
rate(audit_cleanup_indices_deleted_total[5m]) > 0.1
for: 15m
severity: warning
summary: "More than 10% of index deletions failing"
description: "Check OpenSearch connectivity and permissions"
3. Unexpected High Deletion
alert: AuditCleanupUnexpectedlyHigh
expr: audit_cleanup_indices_deleted_total > 1000
for: 1m
severity: critical
summary: "Cleanup deleted more than 1000 indices"
description: "Possible misconfiguration - review retention policies"
Dashboards
Create Grafana dashboards to visualize cleanup operations:
Cleanup Overview Dashboard:
- Total indices over time
- Storage usage trend
- Cleanup job success rate
- Deletion rate by organization
- Error breakdown by type
Example Grafana Queries:
# Storage freed per day
increase(audit_cleanup_storage_freed_mb_total[24h])
# Deletion success rate
(
rate(audit_cleanup_indices_deleted_total[1h]) /
(rate(audit_cleanup_indices_deleted_total[1h]) + rate(audit_cleanup_errors_total[1h]))
) * 100
# Average cleanup duration
avg_over_time(audit_cleanup_duration_seconds[7d])
Audit Trail
All cleanup operations create audit events for compliance:
{
"action": "audit.cleanup",
"target": "audit-indices",
"actor_type": "system",
"actor_id": "audit_cleanup_worker",
"occurred_at": "2025-10-03T01:00:00.000Z",
"metadata": {
"started_at": "2025-10-03T01:00:00.000Z",
"completed_at": "2025-10-03T01:00:15.234Z",
"duration_seconds": 15.234,
"dry_run": false,
"indices_scanned": 245,
"indices_deleted": 32,
"storage_freed_mb": 1234.56,
"errors": []
}
}
Query cleanup history:
GET audit-system-system-*/_search
{
"query": {
"term": { "action": "audit.cleanup" }
},
"sort": [
{ "occurred_at": "desc" }
]
}
Worker Integration
The cleanup job runs as a scheduled worker (see Workers Development Guide).
Worker Configuration
The AuditCleanupWorker is automatically registered and scheduled:
# backend/workers/audit_cleanup_worker.py
class AuditCleanupWorker(ScheduledWorker):
def __init__(self):
super().__init__(
name="audit_cleanup",
schedule="0 1 * * *", # Daily at 1 AM UTC
enabled=config.cleanup_enabled,
timezone="UTC"
)
Running the Worker
Start the cleanup worker:
# Run all workers (includes cleanup)
python -m backend.workers.cli run
# Run only cleanup worker
python -m backend.workers.cli run --worker audit_cleanup
# Run once (ignore schedule)
python -m backend.workers.cli run --worker audit_cleanup --once
Worker Health Checks
Monitor worker health:
# Check worker status
curl http://localhost:8001/health
# Worker metrics
curl http://localhost:8001/metrics | grep worker
Troubleshooting
Issue 1: Cleanup Not Running
Symptoms:
- No indices being deleted
audit_cleanup_last_run_timestampnot updating
Diagnosis:
# Check if cleanup is enabled
echo $AUDIT_CLEANUP_ENABLED
# Check worker status
python -m backend.workers.cli status
# Check worker logs
docker logs api-container | grep -i "audit cleanup"
Solutions:
- Ensure
AUDIT_CLEANUP_ENABLED=true - Verify worker is running:
python -m backend.workers.cli run --worker audit_cleanup - Check OpenSearch connectivity
- Review worker logs for errors
Issue 2: Indices Not Being Deleted
Symptoms:
- Cleanup runs successfully
- No indices deleted despite being old
Diagnosis:
# Check retention configuration
python -m backend.audit.cli.cleanup stats --retention-days 90
# List old indices
python -m backend.audit.cli.cleanup list-indices --sort-by age --reverse
Possible Causes:
- Retention period too long: Check
AUDIT_RETENTION_DAYS - Minimum age guard: Indices must be 7+ days old
- Organization override: Org-specific retention may be longer
- Dry-run mode: Check
AUDIT_CLEANUP_DRY_RUN=false
Issue 3: High Deletion Failure Rate
Symptoms:
- Alert:
AuditCleanupHighFailureRate - Many errors in
audit_cleanup_errors_totalmetric
Diagnosis:
# Check recent errors
docker logs api-container | grep -i "failed to delete index" | tail -20
# Test OpenSearch connection
curl -u admin:password https://opensearch:9200/_cluster/health
# Check permissions
curl -u admin:password -X DELETE https://opensearch:9200/test-index
Solutions:
- Connection issues: Verify network connectivity to OpenSearch
- Permission issues: Ensure cleanup user has delete permissions
- Timeout issues: Increase OpenSearch timeout settings
- Index locks: Check for active operations on indices
Issue 4: Unexpected Storage Growth
Symptoms:
- Storage usage increasing despite cleanup running
- More indices than expected retention period
Diagnosis:
# Analyze storage by organization
python -m backend.audit.cli.cleanup list-indices --organization-id org_123
# Check deletion statistics
python -m backend.audit.cli.cleanup stats --retention-days 90
Solutions:
- Review retention policies: May be too long for your needs
- Check for failed deletions: Review
audit_cleanup_errors_total - Verify index lifecycle: Ensure indices are rotated correctly
- Manual cleanup: Run
cleanup runto force immediate cleanup
Best Practices
Retention Policy Design
-
Compliance First: Set retention based on regulatory requirements
- GDPR: 30-90 days typical
- SOC2: 90-365 days typical
- HIPAA: 2190 days (6 years) required
-
Service-Specific Policies: Different services may need different retention
# High-value API calls: 1 year
# Background workers: 90 days
# System operations: 30 days -
Storage Budget: Balance compliance with storage costs
- Monitor storage usage trends
- Adjust retention based on capacity
- Consider cold storage for long-term retention
Operational Guidelines
-
Test in Non-Production First
- Always dry-run before executing:
--dry-run - Validate retention policies in staging
- Monitor metrics after policy changes
- Always dry-run before executing:
-
Regular Audits
- Review cleanup audit events monthly
- Verify storage trends align with expectations
- Check for failed deletions
-
Monitoring Setup
- Configure all recommended alerts
- Create cleanup dashboard in Grafana
- Set up alert notifications (Slack, PagerDuty)
-
Backup Strategy
- Consider snapshots before major policy changes
- Document restore procedures
- Test recovery from backups quarterly
Capacity Planning
Estimate storage needs based on:
Daily Storage = (Average Events/Day × Average Event Size)
Retention Storage = Daily Storage × Retention Days
Example Calculation:
- Average events: 100,000/day
- Average event size: 2 KB
- Retention: 90 days
Daily Storage = 100,000 × 2 KB = 200 MB/day
Retention Storage = 200 MB × 90 = 18 GB
Add 20-30% buffer for growth and index overhead.
Advanced Topics
Custom Retention Policies in Code
For programmatic policy management:
from backend.audit.cleanup.policy import RetentionPolicy
from backend.audit.config import AuditConfig
# Initialize policy manager
config = AuditConfig()
policy = RetentionPolicy(config)
# Set organization policy
policy.set_organization_policy(
organization_id="org_123",
retention_days=180,
service_overrides={
"api": 365, # API logs: 1 year
"worker": 90, # Worker logs: 90 days
"system": 30 # System logs: 30 days
}
)
# Retrieve retention for specific context
api_retention = policy.get_retention_days("org_123", "api") # 365
worker_retention = policy.get_retention_days("org_123", "worker") # 90
default_retention = policy.get_retention_days("org_456") # 90 (global)
Integration with Index Lifecycle Management (ILM)
Future enhancement to integrate with OpenSearch ILM:
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {}
},
"warm": {
"min_age": "30d",
"actions": {
"replica_count": { "number_of_replicas": 1 }
}
},
"cold": {
"min_age": "60d",
"actions": {
"readonly": {}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
Selective Data Retention (GDPR Right to Erasure)
For GDPR compliance, implement user data anonymization:
# Anonymize user data (planned feature)
from backend.audit.cleanup import anonymize_user_data
await anonymize_user_data(
user_id="user_123",
organization_id="org_456",
replacement_id="deleted_user_abc123" # Hash-based anonymous ID
)
This replaces all occurrences of user_id with anonymous identifier while preserving audit integrity.
Support
For cleanup-related questions or issues:
- Documentation: Review this guide and Operations Runbook
- CLI Help: Run
python -m backend.audit.cli.cleanup --help - Worker Guide: See Workers Development Guide
- Monitoring: Check Grafana dashboards and Prometheus metrics
- Support: Contact your PDaaS administrator or SRE team