Audit Retention and Cleanup

This guide covers the automated retention and cleanup system for audit indices in PDaaS. The cleanup system ensures compliance with retention policies while managing storage costs by automatically deleting old audit indices.

Overview

The audit cleanup system provides:

Automated Cleanup: Scheduled daily job deletes indices older than retention period
Configurable Retention: Global, per-organization, and per-service retention policies
Safety Guards: Built-in protections against accidental data deletion
Manual Tools: CLI utilities for ad-hoc cleanup and analysis
Audit Trail: All cleanup operations are themselves audited
Monitoring: Comprehensive metrics and alerting for cleanup failures

Default Configuration

Default Retention Period: 90 days
Minimum Age Safety Guard: 7 days (indices younger than this are never deleted)
Schedule: Daily at 1 AM UTC
Dry-Run Mode: Available for testing without actual deletion

How It Works

Automated Daily Cleanup

The cleanup process runs automatically every day at 1 AM UTC:

Scan Indices: Identifies all audit indices matching pattern audit-*
Parse Metadata: Extracts organization, account, service, and date from index names
Apply Retention Policy: Determines retention period based on hierarchical rules
Safety Check: Ensures indices are older than 7 days (minimum age guard)
Age Verification: Compares index age against retention period
Delete Indices: Removes indices exceeding retention period
Track Metrics: Updates Prometheus metrics for monitoring
Audit Operation: Creates audit event documenting cleanup results

Index Naming Pattern

Audit indices follow a standardized naming pattern:

audit-{organization_id}-{account_id}-{service}-{YYYY-MM-DD}

Examples:

audit-org123-acc456-api-2025-09-30
audit-org456-acc789-worker-2025-10-01
audit-system-system-cleanup-2025-10-02

Retention Policy Hierarchy

Retention policies are applied in priority order:

Service-Specific Override (highest priority)
- Example: org_123 → api service → 365 days
Organization Default
- Example: org_123 → 180 days
Global Default (lowest priority)
- Example: All organizations → 90 days

Configuration

Environment Variables

Configure cleanup behavior via environment variables:

# Enable/disable automatic cleanup
AUDIT_CLEANUP_ENABLED=true

# Global default retention period (days)
AUDIT_RETENTION_DAYS=90

# Cleanup schedule (cron format)
AUDIT_CLEANUP_SCHEDULE="0 1 * * *"  # Daily at 1 AM UTC

# Dry-run mode (test without deleting)
AUDIT_CLEANUP_DRY_RUN=false

Retention Policy Configuration

Global Default

Set the global default retention period in your .env file:

AUDIT_RETENTION_DAYS=90

Per-Organization Policies

Set organization-specific retention policies via CLI:

# Set organization retention to 180 days
python -m backend.audit.cli.cleanup set-org-policy \
    --organization-id org_123 \
    --retention-days 180

Per-Service Overrides

Set service-specific retention within an organization:

# API logs retained for 365 days, other services use org default (180 days)
python -m backend.audit.cli.cleanup set-org-policy \
    --organization-id org_123 \
    --retention-days 180 \
    --service api \
    --service-retention 365

Note: Database persistence for organization policies is planned for future implementation. Currently, policies must be configured in code or via startup scripts.

CLI Usage

The cleanup CLI provides several commands for managing audit indices.

Run Cleanup Manually

Execute cleanup job on-demand:

# Dry-run (preview deletions without actually deleting)
python -m backend.audit.cli.cleanup run --dry-run --retention-days 90

# Execute cleanup
python -m backend.audit.cli.cleanup run --retention-days 90

# Custom retention period
python -m backend.audit.cli.cleanup run --retention-days 180

Output Example:

Running cleanup job (retention: 90 days, dry-run: False)
============================================================

============================================================
CLEANUP SUMMARY
============================================================
Indices scanned: 245
Indices deleted: 32
Storage freed: 1,234.56 MB
Duration: 12.34 seconds

✓ Cleanup completed successfully

List All Indices

View all audit indices with metadata:

# List all indices sorted by date
python -m backend.audit.cli.cleanup list-indices

# Sort by age (oldest first)
python -m backend.audit.cli.cleanup list-indices --sort-by age

# Sort by size (largest first) in reverse
python -m backend.audit.cli.cleanup list-indices --sort-by size --reverse

# Filter by organization
python -m backend.audit.cli.cleanup list-indices --organization-id org_123

Output Example:

+-------------------------------------------+------------+------------+------------+-------------+----------+-----------+
| Index Name                                | Date       | Age (days) | Size (MB)  | Documents   | Org ID   | Service   |
+===========================================+============+============+============+=============+==========+===========+
| audit-org123-acc456-api-2024-07-01       | 2024-07-01 | 95         | 45.23      | 12,345      | org123   | api       |
| audit-org123-acc456-api-2024-07-02       | 2024-07-02 | 94         | 43.12      | 11,987      | org123   | api       |
| audit-org456-acc789-worker-2024-09-15    | 2024-09-15 | 20         | 12.45      | 3,456       | org456   | worker    |
+-------------------------------------------+------------+------------+------------+-------------+----------+-----------+

Total indices: 245
Total storage: 5,678.90 MB
Total documents: 1,234,567

View Cleanup Statistics

Analyze what would be deleted without performing cleanup:

# Preview cleanup impact
python -m backend.audit.cli.cleanup stats --retention-days 90

Output Example:

Analyzing indices (retention: 90 days)...
============================================================

============================================================
CLEANUP STATISTICS
============================================================
Total indices: 245
Deletable indices: 32 (13.1%)
Total storage: 5,678.90 MB
Storage to be freed: 1,234.56 MB (21.7%)

Retention period: 90 days
Indices older than 90 days will be deleted (minimum age: 7 days)

Set Organization Retention Policy

Configure organization-specific retention:

# Basic organization policy
python -m backend.audit.cli.cleanup set-org-policy \
    --organization-id org_123 \
    --retention-days 180

# With service override
python -m backend.audit.cli.cleanup set-org-policy \
    --organization-id org_123 \
    --retention-days 180 \
    --service api \
    --service-retention 365

Output Example:

============================================================
RETENTION POLICY (Preview)
============================================================
Organization ID: org_123
Default retention: 180 days

Service overrides:
  - api: 365 days

⚠ Note: Database persistence not yet implemented.
This policy is validated but not stored.
Future implementation will persist to organization settings table.

Safety Features

Minimum Age Guard

The cleanup system includes a 7-day minimum age safeguard:

Indices less than 7 days old are never deleted, regardless of retention policy
Prevents accidental deletion of recent data due to misconfiguration
Hard-coded protection that cannot be disabled

Example:

# Even if retention is set to 1 day, only indices 7+ days old are deleted
if age_days < 7:
    logger.debug(f"Skipping {index_name}: too recent ({age_days} days)")
    return False  # Do not delete

Dry-Run Mode

Test cleanup operations without actually deleting data:

# Preview what would be deleted
python -m backend.audit.cli.cleanup run --dry-run --retention-days 90

Dry-run mode:

Lists all indices that would be deleted
Shows storage that would be freed
Creates no changes to OpenSearch
Safe for production testing

Error Handling

The cleanup system handles errors gracefully:

Connection Errors: Retries with exponential backoff
Permission Errors: Logs and continues with next index
Timeout Errors: Tracks and alerts operations team
Partial Failures: Continues cleanup even if some indices fail
Fatal Errors: Stops execution and alerts immediately

Monitoring and Alerting

Prometheus Metrics

The cleanup system exports comprehensive metrics:

# Cleanup execution metrics
audit_cleanup_indices_scanned           # Total indices examined
audit_cleanup_indices_deleted_total     # Total indices deleted
audit_cleanup_storage_freed_mb_total    # Total storage freed (MB)
audit_cleanup_duration_seconds          # Cleanup job duration
audit_cleanup_last_run_timestamp        # Last successful run timestamp

# Error tracking
audit_cleanup_errors_total{error_type}  # Errors by type

Alert Rules

Configure these alerts for production:

1. Cleanup Job Failed

alert: AuditCleanupJobFailed
expr: time() - audit_cleanup_last_run_timestamp > 90000  # 25 hours
for: 10m
severity: warning
summary: "Audit cleanup job hasn't run in 25+ hours"
description: "Daily cleanup job may have failed or is disabled"

2. High Deletion Failure Rate

alert: AuditCleanupHighFailureRate
expr: |
  rate(audit_cleanup_errors_total[5m]) /
  rate(audit_cleanup_indices_deleted_total[5m]) > 0.1
for: 15m
severity: warning
summary: "More than 10% of index deletions failing"
description: "Check OpenSearch connectivity and permissions"

3. Unexpected High Deletion

alert: AuditCleanupUnexpectedlyHigh
expr: audit_cleanup_indices_deleted_total > 1000
for: 1m
severity: critical
summary: "Cleanup deleted more than 1000 indices"
description: "Possible misconfiguration - review retention policies"

Dashboards

Create Grafana dashboards to visualize cleanup operations:

Cleanup Overview Dashboard:

Total indices over time
Storage usage trend
Cleanup job success rate
Deletion rate by organization
Error breakdown by type

Example Grafana Queries:

# Storage freed per day
increase(audit_cleanup_storage_freed_mb_total[24h])

# Deletion success rate
(
  rate(audit_cleanup_indices_deleted_total[1h]) /
  (rate(audit_cleanup_indices_deleted_total[1h]) + rate(audit_cleanup_errors_total[1h]))
) * 100

# Average cleanup duration
avg_over_time(audit_cleanup_duration_seconds[7d])

Audit Trail

All cleanup operations create audit events for compliance:

{
  "action": "audit.cleanup",
  "target": "audit-indices",
  "actor_type": "system",
  "actor_id": "audit_cleanup_worker",
  "occurred_at": "2025-10-03T01:00:00.000Z",
  "metadata": {
    "started_at": "2025-10-03T01:00:00.000Z",
    "completed_at": "2025-10-03T01:00:15.234Z",
    "duration_seconds": 15.234,
    "dry_run": false,
    "indices_scanned": 245,
    "indices_deleted": 32,
    "storage_freed_mb": 1234.56,
    "errors": []
  }
}

Query cleanup history:

GET audit-system-system-*/_search
{
  "query": {
    "term": { "action": "audit.cleanup" }
  },
  "sort": [
    { "occurred_at": "desc" }
  ]
}

Worker Integration

The cleanup job runs as a scheduled worker (see Workers Development Guide).

Worker Configuration

The AuditCleanupWorker is automatically registered and scheduled:

# backend/workers/audit_cleanup_worker.py
class AuditCleanupWorker(ScheduledWorker):
    def __init__(self):
        super().__init__(
            name="audit_cleanup",
            schedule="0 1 * * *",  # Daily at 1 AM UTC
            enabled=config.cleanup_enabled,
            timezone="UTC"
        )

Running the Worker

Start the cleanup worker:

# Run all workers (includes cleanup)
python -m backend.workers.cli run

# Run only cleanup worker
python -m backend.workers.cli run --worker audit_cleanup

# Run once (ignore schedule)
python -m backend.workers.cli run --worker audit_cleanup --once

Worker Health Checks

Monitor worker health:

# Check worker status
curl http://localhost:8001/health

# Worker metrics
curl http://localhost:8001/metrics | grep worker

Troubleshooting

Issue 1: Cleanup Not Running

Symptoms:

No indices being deleted
audit_cleanup_last_run_timestamp not updating

Diagnosis:

# Check if cleanup is enabled
echo $AUDIT_CLEANUP_ENABLED

# Check worker status
python -m backend.workers.cli status

# Check worker logs
docker logs api-container | grep -i "audit cleanup"

Solutions:

Ensure AUDIT_CLEANUP_ENABLED=true
Verify worker is running: python -m backend.workers.cli run --worker audit_cleanup
Check OpenSearch connectivity
Review worker logs for errors

Issue 2: Indices Not Being Deleted

Symptoms:

Cleanup runs successfully
No indices deleted despite being old

Diagnosis:

# Check retention configuration
python -m backend.audit.cli.cleanup stats --retention-days 90

# List old indices
python -m backend.audit.cli.cleanup list-indices --sort-by age --reverse

Possible Causes:

Retention period too long: Check AUDIT_RETENTION_DAYS
Minimum age guard: Indices must be 7+ days old
Organization override: Org-specific retention may be longer
Dry-run mode: Check AUDIT_CLEANUP_DRY_RUN=false

Issue 3: High Deletion Failure Rate

Symptoms:

Alert: AuditCleanupHighFailureRate
Many errors in audit_cleanup_errors_total metric

Diagnosis:

# Check recent errors
docker logs api-container | grep -i "failed to delete index" | tail -20

# Test OpenSearch connection
curl -u admin:password https://opensearch:9200/_cluster/health

# Check permissions
curl -u admin:password -X DELETE https://opensearch:9200/test-index

Solutions:

Connection issues: Verify network connectivity to OpenSearch
Permission issues: Ensure cleanup user has delete permissions
Timeout issues: Increase OpenSearch timeout settings
Index locks: Check for active operations on indices

Issue 4: Unexpected Storage Growth

Symptoms:

Storage usage increasing despite cleanup running
More indices than expected retention period

Diagnosis:

# Analyze storage by organization
python -m backend.audit.cli.cleanup list-indices --organization-id org_123

# Check deletion statistics
python -m backend.audit.cli.cleanup stats --retention-days 90

Solutions:

Review retention policies: May be too long for your needs
Check for failed deletions: Review audit_cleanup_errors_total
Verify index lifecycle: Ensure indices are rotated correctly
Manual cleanup: Run cleanup run to force immediate cleanup

Best Practices

Retention Policy Design

Compliance First: Set retention based on regulatory requirements
- GDPR: 30-90 days typical
- SOC2: 90-365 days typical
- HIPAA: 2190 days (6 years) required

Service-Specific Policies: Different services may need different retention

# High-value API calls: 1 year
# Background workers: 90 days
# System operations: 30 days

Storage Budget: Balance compliance with storage costs
- Monitor storage usage trends
- Adjust retention based on capacity
- Consider cold storage for long-term retention

Operational Guidelines

Test in Non-Production First
- Always dry-run before executing: --dry-run
- Validate retention policies in staging
- Monitor metrics after policy changes
Regular Audits
- Review cleanup audit events monthly
- Verify storage trends align with expectations
- Check for failed deletions
Monitoring Setup
- Configure all recommended alerts
- Create cleanup dashboard in Grafana
- Set up alert notifications (Slack, PagerDuty)
Backup Strategy
- Consider snapshots before major policy changes
- Document restore procedures
- Test recovery from backups quarterly

Capacity Planning

Estimate storage needs based on:

Daily Storage = (Average Events/Day × Average Event Size)
Retention Storage = Daily Storage × Retention Days

Example Calculation:

Average events: 100,000/day
Average event size: 2 KB
Retention: 90 days

Daily Storage = 100,000 × 2 KB = 200 MB/day
Retention Storage = 200 MB × 90 = 18 GB

Add 20-30% buffer for growth and index overhead.

Advanced Topics

Custom Retention Policies in Code

For programmatic policy management:

from backend.audit.cleanup.policy import RetentionPolicy
from backend.audit.config import AuditConfig

# Initialize policy manager
config = AuditConfig()
policy = RetentionPolicy(config)

# Set organization policy
policy.set_organization_policy(
    organization_id="org_123",
    retention_days=180,
    service_overrides={
        "api": 365,        # API logs: 1 year
        "worker": 90,      # Worker logs: 90 days
        "system": 30       # System logs: 30 days
    }
)

# Retrieve retention for specific context
api_retention = policy.get_retention_days("org_123", "api")      # 365
worker_retention = policy.get_retention_days("org_123", "worker")  # 90
default_retention = policy.get_retention_days("org_456")          # 90 (global)

Integration with Index Lifecycle Management (ILM)

Future enhancement to integrate with OpenSearch ILM:

{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {}
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "replica_count": { "number_of_replicas": 1 }
        }
      },
      "cold": {
        "min_age": "60d",
        "actions": {
          "readonly": {}
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

For GDPR compliance, implement user data anonymization:

# Anonymize user data (planned feature)
from backend.audit.cleanup import anonymize_user_data

await anonymize_user_data(
    user_id="user_123",
    organization_id="org_456",
    replacement_id="deleted_user_abc123"  # Hash-based anonymous ID
)

This replaces all occurrences of user_id with anonymous identifier while preserving audit integrity.

Support

For cleanup-related questions or issues:

Documentation: Review this guide and Operations Runbook
CLI Help: Run python -m backend.audit.cli.cleanup --help
Worker Guide: See Workers Development Guide
Monitoring: Check Grafana dashboards and Prometheus metrics
Support: Contact your PDaaS administrator or SRE team

Overview​

Default Configuration​

How It Works​

Automated Daily Cleanup​

Index Naming Pattern​

Retention Policy Hierarchy​

Configuration​

Environment Variables​

Retention Policy Configuration​

Global Default​

Per-Organization Policies​

Per-Service Overrides​

CLI Usage​

Run Cleanup Manually​

List All Indices​

View Cleanup Statistics​

Set Organization Retention Policy​

Safety Features​

Minimum Age Guard​

Dry-Run Mode​

Error Handling​

Monitoring and Alerting​

Prometheus Metrics​

Alert Rules​

1. Cleanup Job Failed​

2. High Deletion Failure Rate​

3. Unexpected High Deletion​

Dashboards​

Audit Trail​

Worker Integration​

Worker Configuration​

Running the Worker​

Worker Health Checks​

Troubleshooting​

Issue 1: Cleanup Not Running​

Issue 2: Indices Not Being Deleted​

Issue 3: High Deletion Failure Rate​

Issue 4: Unexpected Storage Growth​

Best Practices​

Retention Policy Design​

Operational Guidelines​

Capacity Planning​

Advanced Topics​

Custom Retention Policies in Code​

Integration with Index Lifecycle Management (ILM)​

Selective Data Retention (GDPR Right to Erasure)​

Support​

Overview

Default Configuration

How It Works

Automated Daily Cleanup

Index Naming Pattern

Retention Policy Hierarchy

Configuration

Environment Variables

Retention Policy Configuration

Global Default

Per-Organization Policies

Per-Service Overrides

CLI Usage

Run Cleanup Manually

List All Indices

View Cleanup Statistics

Set Organization Retention Policy

Safety Features

Minimum Age Guard

Dry-Run Mode

Error Handling

Monitoring and Alerting

Prometheus Metrics

Alert Rules

1. Cleanup Job Failed

2. High Deletion Failure Rate

3. Unexpected High Deletion

Dashboards

Audit Trail

Worker Integration

Worker Configuration

Running the Worker

Worker Health Checks

Troubleshooting

Issue 1: Cleanup Not Running

Issue 2: Indices Not Being Deleted

Issue 3: High Deletion Failure Rate

Issue 4: Unexpected Storage Growth

Best Practices

Retention Policy Design

Operational Guidelines

Capacity Planning

Advanced Topics

Custom Retention Policies in Code

Integration with Index Lifecycle Management (ILM)

Selective Data Retention (GDPR Right to Erasure)

Support