Skip to main content

Audit Retention and Cleanup

This guide covers the automated retention and cleanup system for audit indices in PDaaS. The cleanup system ensures compliance with retention policies while managing storage costs by automatically deleting old audit indices.

Overview

The audit cleanup system provides:

  • Automated Cleanup: Scheduled daily job deletes indices older than retention period
  • Configurable Retention: Global, per-organization, and per-service retention policies
  • Safety Guards: Built-in protections against accidental data deletion
  • Manual Tools: CLI utilities for ad-hoc cleanup and analysis
  • Audit Trail: All cleanup operations are themselves audited
  • Monitoring: Comprehensive metrics and alerting for cleanup failures

Default Configuration

  • Default Retention Period: 90 days
  • Minimum Age Safety Guard: 7 days (indices younger than this are never deleted)
  • Schedule: Daily at 1 AM UTC
  • Dry-Run Mode: Available for testing without actual deletion

How It Works

Automated Daily Cleanup

The cleanup process runs automatically every day at 1 AM UTC:

  1. Scan Indices: Identifies all audit indices matching pattern audit-*
  2. Parse Metadata: Extracts organization, account, service, and date from index names
  3. Apply Retention Policy: Determines retention period based on hierarchical rules
  4. Safety Check: Ensures indices are older than 7 days (minimum age guard)
  5. Age Verification: Compares index age against retention period
  6. Delete Indices: Removes indices exceeding retention period
  7. Track Metrics: Updates Prometheus metrics for monitoring
  8. Audit Operation: Creates audit event documenting cleanup results

Index Naming Pattern

Audit indices follow a standardized naming pattern:

audit-{organization_id}-{account_id}-{service}-{YYYY-MM-DD}

Examples:

  • audit-org123-acc456-api-2025-09-30
  • audit-org456-acc789-worker-2025-10-01
  • audit-system-system-cleanup-2025-10-02

Retention Policy Hierarchy

Retention policies are applied in priority order:

  1. Service-Specific Override (highest priority)

    • Example: org_123api service → 365 days
  2. Organization Default

    • Example: org_123 → 180 days
  3. Global Default (lowest priority)

    • Example: All organizations → 90 days

Configuration

Environment Variables

Configure cleanup behavior via environment variables:

# Enable/disable automatic cleanup
AUDIT_CLEANUP_ENABLED=true

# Global default retention period (days)
AUDIT_RETENTION_DAYS=90

# Cleanup schedule (cron format)
AUDIT_CLEANUP_SCHEDULE="0 1 * * *" # Daily at 1 AM UTC

# Dry-run mode (test without deleting)
AUDIT_CLEANUP_DRY_RUN=false

Retention Policy Configuration

Global Default

Set the global default retention period in your .env file:

AUDIT_RETENTION_DAYS=90

Per-Organization Policies

Set organization-specific retention policies via CLI:

# Set organization retention to 180 days
python -m backend.audit.cli.cleanup set-org-policy \
--organization-id org_123 \
--retention-days 180

Per-Service Overrides

Set service-specific retention within an organization:

# API logs retained for 365 days, other services use org default (180 days)
python -m backend.audit.cli.cleanup set-org-policy \
--organization-id org_123 \
--retention-days 180 \
--service api \
--service-retention 365

Note: Database persistence for organization policies is planned for future implementation. Currently, policies must be configured in code or via startup scripts.

CLI Usage

The cleanup CLI provides several commands for managing audit indices.

Run Cleanup Manually

Execute cleanup job on-demand:

# Dry-run (preview deletions without actually deleting)
python -m backend.audit.cli.cleanup run --dry-run --retention-days 90

# Execute cleanup
python -m backend.audit.cli.cleanup run --retention-days 90

# Custom retention period
python -m backend.audit.cli.cleanup run --retention-days 180

Output Example:

Running cleanup job (retention: 90 days, dry-run: False)
============================================================

============================================================
CLEANUP SUMMARY
============================================================
Indices scanned: 245
Indices deleted: 32
Storage freed: 1,234.56 MB
Duration: 12.34 seconds

✓ Cleanup completed successfully

List All Indices

View all audit indices with metadata:

# List all indices sorted by date
python -m backend.audit.cli.cleanup list-indices

# Sort by age (oldest first)
python -m backend.audit.cli.cleanup list-indices --sort-by age

# Sort by size (largest first) in reverse
python -m backend.audit.cli.cleanup list-indices --sort-by size --reverse

# Filter by organization
python -m backend.audit.cli.cleanup list-indices --organization-id org_123

Output Example:

+-------------------------------------------+------------+------------+------------+-------------+----------+-----------+
| Index Name | Date | Age (days) | Size (MB) | Documents | Org ID | Service |
+===========================================+============+============+============+=============+==========+===========+
| audit-org123-acc456-api-2024-07-01 | 2024-07-01 | 95 | 45.23 | 12,345 | org123 | api |
| audit-org123-acc456-api-2024-07-02 | 2024-07-02 | 94 | 43.12 | 11,987 | org123 | api |
| audit-org456-acc789-worker-2024-09-15 | 2024-09-15 | 20 | 12.45 | 3,456 | org456 | worker |
+-------------------------------------------+------------+------------+------------+-------------+----------+-----------+

Total indices: 245
Total storage: 5,678.90 MB
Total documents: 1,234,567

View Cleanup Statistics

Analyze what would be deleted without performing cleanup:

# Preview cleanup impact
python -m backend.audit.cli.cleanup stats --retention-days 90

Output Example:

Analyzing indices (retention: 90 days)...
============================================================

============================================================
CLEANUP STATISTICS
============================================================
Total indices: 245
Deletable indices: 32 (13.1%)
Total storage: 5,678.90 MB
Storage to be freed: 1,234.56 MB (21.7%)

Retention period: 90 days
Indices older than 90 days will be deleted (minimum age: 7 days)

Set Organization Retention Policy

Configure organization-specific retention:

# Basic organization policy
python -m backend.audit.cli.cleanup set-org-policy \
--organization-id org_123 \
--retention-days 180

# With service override
python -m backend.audit.cli.cleanup set-org-policy \
--organization-id org_123 \
--retention-days 180 \
--service api \
--service-retention 365

Output Example:

============================================================
RETENTION POLICY (Preview)
============================================================
Organization ID: org_123
Default retention: 180 days

Service overrides:
- api: 365 days

⚠ Note: Database persistence not yet implemented.
This policy is validated but not stored.
Future implementation will persist to organization settings table.

Safety Features

Minimum Age Guard

The cleanup system includes a 7-day minimum age safeguard:

  • Indices less than 7 days old are never deleted, regardless of retention policy
  • Prevents accidental deletion of recent data due to misconfiguration
  • Hard-coded protection that cannot be disabled

Example:

# Even if retention is set to 1 day, only indices 7+ days old are deleted
if age_days < 7:
logger.debug(f"Skipping {index_name}: too recent ({age_days} days)")
return False # Do not delete

Dry-Run Mode

Test cleanup operations without actually deleting data:

# Preview what would be deleted
python -m backend.audit.cli.cleanup run --dry-run --retention-days 90

Dry-run mode:

  • Lists all indices that would be deleted
  • Shows storage that would be freed
  • Creates no changes to OpenSearch
  • Safe for production testing

Error Handling

The cleanup system handles errors gracefully:

  • Connection Errors: Retries with exponential backoff
  • Permission Errors: Logs and continues with next index
  • Timeout Errors: Tracks and alerts operations team
  • Partial Failures: Continues cleanup even if some indices fail
  • Fatal Errors: Stops execution and alerts immediately

Monitoring and Alerting

Prometheus Metrics

The cleanup system exports comprehensive metrics:

# Cleanup execution metrics
audit_cleanup_indices_scanned # Total indices examined
audit_cleanup_indices_deleted_total # Total indices deleted
audit_cleanup_storage_freed_mb_total # Total storage freed (MB)
audit_cleanup_duration_seconds # Cleanup job duration
audit_cleanup_last_run_timestamp # Last successful run timestamp

# Error tracking
audit_cleanup_errors_total{error_type} # Errors by type

Alert Rules

Configure these alerts for production:

1. Cleanup Job Failed

alert: AuditCleanupJobFailed
expr: time() - audit_cleanup_last_run_timestamp > 90000 # 25 hours
for: 10m
severity: warning
summary: "Audit cleanup job hasn't run in 25+ hours"
description: "Daily cleanup job may have failed or is disabled"

2. High Deletion Failure Rate

alert: AuditCleanupHighFailureRate
expr: |
rate(audit_cleanup_errors_total[5m]) /
rate(audit_cleanup_indices_deleted_total[5m]) > 0.1
for: 15m
severity: warning
summary: "More than 10% of index deletions failing"
description: "Check OpenSearch connectivity and permissions"

3. Unexpected High Deletion

alert: AuditCleanupUnexpectedlyHigh
expr: audit_cleanup_indices_deleted_total > 1000
for: 1m
severity: critical
summary: "Cleanup deleted more than 1000 indices"
description: "Possible misconfiguration - review retention policies"

Dashboards

Create Grafana dashboards to visualize cleanup operations:

Cleanup Overview Dashboard:

  • Total indices over time
  • Storage usage trend
  • Cleanup job success rate
  • Deletion rate by organization
  • Error breakdown by type

Example Grafana Queries:

# Storage freed per day
increase(audit_cleanup_storage_freed_mb_total[24h])

# Deletion success rate
(
rate(audit_cleanup_indices_deleted_total[1h]) /
(rate(audit_cleanup_indices_deleted_total[1h]) + rate(audit_cleanup_errors_total[1h]))
) * 100

# Average cleanup duration
avg_over_time(audit_cleanup_duration_seconds[7d])

Audit Trail

All cleanup operations create audit events for compliance:

{
"action": "audit.cleanup",
"target": "audit-indices",
"actor_type": "system",
"actor_id": "audit_cleanup_worker",
"occurred_at": "2025-10-03T01:00:00.000Z",
"metadata": {
"started_at": "2025-10-03T01:00:00.000Z",
"completed_at": "2025-10-03T01:00:15.234Z",
"duration_seconds": 15.234,
"dry_run": false,
"indices_scanned": 245,
"indices_deleted": 32,
"storage_freed_mb": 1234.56,
"errors": []
}
}

Query cleanup history:

GET audit-system-system-*/_search
{
"query": {
"term": { "action": "audit.cleanup" }
},
"sort": [
{ "occurred_at": "desc" }
]
}

Worker Integration

The cleanup job runs as a scheduled worker (see Workers Development Guide).

Worker Configuration

The AuditCleanupWorker is automatically registered and scheduled:

# backend/workers/audit_cleanup_worker.py
class AuditCleanupWorker(ScheduledWorker):
def __init__(self):
super().__init__(
name="audit_cleanup",
schedule="0 1 * * *", # Daily at 1 AM UTC
enabled=config.cleanup_enabled,
timezone="UTC"
)

Running the Worker

Start the cleanup worker:

# Run all workers (includes cleanup)
python -m backend.workers.cli run

# Run only cleanup worker
python -m backend.workers.cli run --worker audit_cleanup

# Run once (ignore schedule)
python -m backend.workers.cli run --worker audit_cleanup --once

Worker Health Checks

Monitor worker health:

# Check worker status
curl http://localhost:8001/health

# Worker metrics
curl http://localhost:8001/metrics | grep worker

Troubleshooting

Issue 1: Cleanup Not Running

Symptoms:

  • No indices being deleted
  • audit_cleanup_last_run_timestamp not updating

Diagnosis:

# Check if cleanup is enabled
echo $AUDIT_CLEANUP_ENABLED

# Check worker status
python -m backend.workers.cli status

# Check worker logs
docker logs api-container | grep -i "audit cleanup"

Solutions:

  1. Ensure AUDIT_CLEANUP_ENABLED=true
  2. Verify worker is running: python -m backend.workers.cli run --worker audit_cleanup
  3. Check OpenSearch connectivity
  4. Review worker logs for errors

Issue 2: Indices Not Being Deleted

Symptoms:

  • Cleanup runs successfully
  • No indices deleted despite being old

Diagnosis:

# Check retention configuration
python -m backend.audit.cli.cleanup stats --retention-days 90

# List old indices
python -m backend.audit.cli.cleanup list-indices --sort-by age --reverse

Possible Causes:

  1. Retention period too long: Check AUDIT_RETENTION_DAYS
  2. Minimum age guard: Indices must be 7+ days old
  3. Organization override: Org-specific retention may be longer
  4. Dry-run mode: Check AUDIT_CLEANUP_DRY_RUN=false

Issue 3: High Deletion Failure Rate

Symptoms:

  • Alert: AuditCleanupHighFailureRate
  • Many errors in audit_cleanup_errors_total metric

Diagnosis:

# Check recent errors
docker logs api-container | grep -i "failed to delete index" | tail -20

# Test OpenSearch connection
curl -u admin:password https://opensearch:9200/_cluster/health

# Check permissions
curl -u admin:password -X DELETE https://opensearch:9200/test-index

Solutions:

  1. Connection issues: Verify network connectivity to OpenSearch
  2. Permission issues: Ensure cleanup user has delete permissions
  3. Timeout issues: Increase OpenSearch timeout settings
  4. Index locks: Check for active operations on indices

Issue 4: Unexpected Storage Growth

Symptoms:

  • Storage usage increasing despite cleanup running
  • More indices than expected retention period

Diagnosis:

# Analyze storage by organization
python -m backend.audit.cli.cleanup list-indices --organization-id org_123

# Check deletion statistics
python -m backend.audit.cli.cleanup stats --retention-days 90

Solutions:

  1. Review retention policies: May be too long for your needs
  2. Check for failed deletions: Review audit_cleanup_errors_total
  3. Verify index lifecycle: Ensure indices are rotated correctly
  4. Manual cleanup: Run cleanup run to force immediate cleanup

Best Practices

Retention Policy Design

  1. Compliance First: Set retention based on regulatory requirements

    • GDPR: 30-90 days typical
    • SOC2: 90-365 days typical
    • HIPAA: 2190 days (6 years) required
  2. Service-Specific Policies: Different services may need different retention

    # High-value API calls: 1 year
    # Background workers: 90 days
    # System operations: 30 days
  3. Storage Budget: Balance compliance with storage costs

    • Monitor storage usage trends
    • Adjust retention based on capacity
    • Consider cold storage for long-term retention

Operational Guidelines

  1. Test in Non-Production First

    • Always dry-run before executing: --dry-run
    • Validate retention policies in staging
    • Monitor metrics after policy changes
  2. Regular Audits

    • Review cleanup audit events monthly
    • Verify storage trends align with expectations
    • Check for failed deletions
  3. Monitoring Setup

    • Configure all recommended alerts
    • Create cleanup dashboard in Grafana
    • Set up alert notifications (Slack, PagerDuty)
  4. Backup Strategy

    • Consider snapshots before major policy changes
    • Document restore procedures
    • Test recovery from backups quarterly

Capacity Planning

Estimate storage needs based on:

Daily Storage = (Average Events/Day × Average Event Size)
Retention Storage = Daily Storage × Retention Days

Example Calculation:

  • Average events: 100,000/day
  • Average event size: 2 KB
  • Retention: 90 days
Daily Storage = 100,000 × 2 KB = 200 MB/day
Retention Storage = 200 MB × 90 = 18 GB

Add 20-30% buffer for growth and index overhead.

Advanced Topics

Custom Retention Policies in Code

For programmatic policy management:

from backend.audit.cleanup.policy import RetentionPolicy
from backend.audit.config import AuditConfig

# Initialize policy manager
config = AuditConfig()
policy = RetentionPolicy(config)

# Set organization policy
policy.set_organization_policy(
organization_id="org_123",
retention_days=180,
service_overrides={
"api": 365, # API logs: 1 year
"worker": 90, # Worker logs: 90 days
"system": 30 # System logs: 30 days
}
)

# Retrieve retention for specific context
api_retention = policy.get_retention_days("org_123", "api") # 365
worker_retention = policy.get_retention_days("org_123", "worker") # 90
default_retention = policy.get_retention_days("org_456") # 90 (global)

Integration with Index Lifecycle Management (ILM)

Future enhancement to integrate with OpenSearch ILM:

{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {}
},
"warm": {
"min_age": "30d",
"actions": {
"replica_count": { "number_of_replicas": 1 }
}
},
"cold": {
"min_age": "60d",
"actions": {
"readonly": {}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}

Selective Data Retention (GDPR Right to Erasure)

For GDPR compliance, implement user data anonymization:

# Anonymize user data (planned feature)
from backend.audit.cleanup import anonymize_user_data

await anonymize_user_data(
user_id="user_123",
organization_id="org_456",
replacement_id="deleted_user_abc123" # Hash-based anonymous ID
)

This replaces all occurrences of user_id with anonymous identifier while preserving audit integrity.

Support

For cleanup-related questions or issues:

  • Documentation: Review this guide and Operations Runbook
  • CLI Help: Run python -m backend.audit.cli.cleanup --help
  • Worker Guide: See Workers Development Guide
  • Monitoring: Check Grafana dashboards and Prometheus metrics
  • Support: Contact your PDaaS administrator or SRE team