Audit System Operations Runbook

This runbook provides operational procedures for managing and troubleshooting the PDaaS audit system. It is intended for DevOps engineers, SRE teams, and system administrators.

System Overview

Architecture

The PDaaS audit system consists of these core components:

┌─────────────────┐
│  FastAPI App    │
│  ┌───────────┐  │
│  │  Audit    │  │
│  │ Middleware│  │
│  └─────┬─────┘  │
└────────┼────────┘
         │ async
         ▼
┌─────────────────┐
│  Audit Module   │
│  ┌───────────┐  │
│  │  Emitter  │  │──┐
│  └───────────┘  │  │
│  ┌───────────┐  │  │
│  │Sanitizer  │  │  │ batched
│  └───────────┘  │  │ events
│  ┌───────────┐  │  │
│  │   Sinks   │◄─┘  │
│  └─────┬─────┘  │
└────────┼────────┘
         │
    ┌────┴────┐
    ▼         ▼
┌────────┐ ┌────────┐
│Logs    │ │OpenSrch│
└────────┘ └────────┘

Component Responsibilities

Component	Responsibility	Critical?
AuditMiddleware	Captures HTTP requests/responses	Yes
Emitter	Queues and batches events	Yes
Sanitizer	Removes sensitive data	Yes
OpenSearchAuditSink	Writes to OpenSearch	Yes
LoggerAuditSink	Fallback to logs	Yes
CircuitBreaker	Protects against OpenSearch failures	Yes

Data Flow

Request arrives → Middleware captures start time
Request processed → Handler executes business logic
Response generated → Middleware captures response
Event created → EnhancedAuditEvent with full context
Event sanitized → Sensitive data removed
Event queued → Added to batch buffer
Batch flushed → On size (100) or time (5s) trigger
Write to OpenSearch → Bulk write with retry
Fallback if needed → Circuit breaker → logs

Key Metrics

Monitor these metrics for system health:

audit_events_total - Total events emitted
audit_events_latency_seconds - Event processing latency
audit_batch_flush_duration_seconds - Batch write time
audit_queue_size - Current queue depth
circuit_breaker_state - 0=closed, 1=open, 2=half-open
audit_errors_total - Total error count

Common Issues and Solutions

Issue 1: Circuit Breaker Open

Symptoms:

Alert: "CircuitBreakerOpen - OpenSearch unavailable for 10+ minutes"
Logs: WARNING: Circuit breaker open, using fallback sink
Metric: circuit_breaker_state{sink_type="opensearch"} == 1
Events going to logs only, not searchable in OpenSearch

Diagnosis:

# Check circuit breaker state
curl http://localhost:8000/metrics | grep circuit_breaker_state

# Check OpenSearch health
curl -u admin:password https://opensearch:9200/_cluster/health

# Check connectivity from app server
curl -v https://opensearch:9200/_cluster/health

# Check recent errors
docker logs api-container | grep -i "opensearch" | tail -50

Root Causes:

OpenSearch cluster down or unavailable
Network connectivity issues
OpenSearch overloaded (too many requests)
Authentication/SSL certificate issues
OpenSearch disk full

Resolution:

If OpenSearch is down:

# Restart OpenSearch cluster
docker restart opensearch-node1

# Or on Kubernetes
kubectl rollout restart deployment opensearch

# Verify health
curl https://opensearch:9200/_cluster/health

If network issues:

# Test connectivity
ping opensearch.internal
telnet opensearch.internal 9200

# Check firewall rules
iptables -L | grep 9200

# Check DNS resolution
nslookup opensearch.internal

If authentication issues:

# Verify credentials
curl -u $AUDIT_OPENSEARCH_USERNAME:$AUDIT_OPENSEARCH_PASSWORD \
  https://opensearch:9200/_cluster/health

# Check SSL cert
openssl s_client -connect opensearch:9200 -showcerts

If disk full:

# Check disk usage
curl https://opensearch:9200/_cat/allocation?v

# Delete old indices
curl -X DELETE https://opensearch:9200/audit-*-2025-01-*

Auto-Recovery:

Circuit breaker automatically tries to close after 60 seconds
Monitor circuit_breaker_transitions_total metric
Events buffered during outage will be written when recovered

Prevention:

Monitor OpenSearch cluster health proactively
Set up auto-scaling for OpenSearch
Configure index lifecycle management
Regular disk cleanup

Issue 2: High Queue Size

Symptoms:

Alert: "LargeAuditQueue - Queue size > 5,000 events"
Alert: "AuditQueueOverflow - Queue > 10,000 events" (critical)
Metric: audit_queue_size > 5000
Logs: WARNING: Audit queue size: 8432
Memory usage increasing on app servers

Diagnosis:

# Check queue size
curl http://localhost:8000/metrics | grep audit_queue_size

# Check OpenSearch write latency
curl http://localhost:8000/metrics | grep opensearch_write_latency

# Check batch flush duration
curl http://localhost:8000/metrics | grep audit_batch_flush_duration

# Check OpenSearch cluster performance
curl https://opensearch:9200/_cluster/stats

Root Causes:

OpenSearch write throughput too low
Batch flush taking too long
Traffic spike overwhelming system
OpenSearch cluster under-provisioned
Network latency between app and OpenSearch

Resolution:

Immediate (stop the bleeding):

# Increase flush frequency (reduce interval)
# Update environment variable and restart
AUDIT_FLUSH_INTERVAL_SECONDS=1.0  # Down from 5.0

# Increase batch size (more efficient bulk writes)
AUDIT_BATCH_SIZE=500  # Up from 100

# Scale app horizontally (distribute load)
kubectl scale deployment api --replicas=5

Short-term (within hours):

# Scale OpenSearch cluster
# Add more data nodes for write capacity
kubectl scale statefulset opensearch --replicas=5

# Increase OpenSearch resources
# Edit deployment to increase CPU/memory

# Check index refresh interval
curl -X PUT https://opensearch:9200/audit-*/_settings -d '{
  "index": {
    "refresh_interval": "30s"
  }
}'

Long-term (capacity planning):

Review traffic patterns and plan capacity
Implement auto-scaling based on queue size
Consider dedicated OpenSearch cluster for audit
Optimize index settings (shards, replicas)
Implement index rollover for better performance

Monitoring:

# Track queue recovery
watch 'curl -s http://localhost:8000/metrics | grep audit_queue_size'

# Set up alert for sustained high queue
# Alert if queue > 5000 for 10+ minutes

Prevention:

Capacity plan for 2x expected peak load
Auto-scale OpenSearch based on CPU/memory
Monitor write latency and queue size proactively
Regular load testing

Issue 3: Index Creation Failures

Symptoms:

Alert: "IndexCreationFailures - Unable to create new indices"
Logs: ERROR: Failed to create index audit-org123-acc456-api-2025-10-03
HTTP 400/403 errors from OpenSearch
Events failing to write

Diagnosis:

# Check index creation errors
curl http://localhost:8000/metrics | grep audit_errors_total | grep index_creation

# List current indices
curl https://opensearch:9200/_cat/indices?v | grep audit

# Check cluster settings
curl https://opensearch:9200/_cluster/settings?include_defaults=true

# Check shard limits
curl https://opensearch:9200/_cluster/stats | jq '.indices.shards'

Root Causes:

OpenSearch shard limit reached (default 1000 per node)
Disk space exhausted
Index template conflicts
Permission issues (user cannot create indices)
Too many indices (need cleanup)

Resolution:

If shard limit reached:

# Increase shard limit temporarily
curl -X PUT https://opensearch:9200/_cluster/settings -d '{
  "persistent": {
    "cluster.max_shards_per_node": 2000
  }
}'

# Delete old indices to free shards
python -m backend.audit.cleanup --older-than 90

# Or delete manually
curl -X DELETE https://opensearch:9200/audit-*-2025-01-*

If disk full:

# Check disk usage
curl https://opensearch:9200/_cat/allocation?v

# Enable watermark enforcement (if disabled)
curl -X PUT https://opensearch:9200/_cluster/settings -d '{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "85%",
    "cluster.routing.allocation.disk.watermark.high": "90%"
  }
}'

# Force merge old indices to reclaim space
curl -X POST https://opensearch:9200/audit-*-2025-09-*/_forcemerge?max_num_segments=1

# Delete old indices
curl -X DELETE https://opensearch:9200/audit-*-2025-08-*

If permission issues:

# Check user permissions
curl -u admin:password https://opensearch:9200/_plugins/_security/api/roles/audit_writer

# Update role to allow index creation
curl -X PUT https://opensearch:9200/_plugins/_security/api/roles/audit_writer -d '{
  "cluster_permissions": ["cluster:monitor/health"],
  "index_permissions": [{
    "index_patterns": ["audit-*"],
    "allowed_actions": ["write", "create_index", "indices:admin/create"]
  }]
}'

Prevention:

Automated retention policies (delete indices > 90 days)
Monitor shard count and disk usage
Use monthly rotation for low-volume organizations
Regular cleanup job

Issue 4: High Latency

Symptoms:

Alert: "HighAuditLatency - p95 latency > 100ms for 15 minutes"
Alert: "HighMiddlewareLatency - p95 > 5ms for 15 minutes"
Slow API response times
User complaints about performance

Diagnosis:

# Check audit latency
curl http://localhost:8000/metrics | grep audit_events_latency_seconds

# Check middleware overhead
curl http://localhost:8000/metrics | grep audit_middleware_latency_seconds

# Check OpenSearch write latency
curl http://localhost:8000/metrics | grep opensearch_write_latency_seconds

# Check batch flush duration
curl http://localhost:8000/metrics | grep audit_batch_flush_duration_seconds

Root Causes:

Synchronous operations blocking request thread
Large request/response bodies (serialization overhead)
OpenSearch cluster slow or overloaded
Network latency to OpenSearch
Queue full causing backpressure

Resolution:

If middleware overhead high:

# Verify async processing is working
# Check logs for blocking operations
docker logs api-container | grep "audit" | grep -i "block"

# Increase excluded paths (reduce audit volume)
AUDIT_EXCLUDED_PATHS='["/health","/metrics","/internal/*","/debug/*"]'

# Reduce body size limit (less serialization)
AUDIT_MAX_BODY_SIZE=1024  # Down from 10240

If OpenSearch writes slow:

# Check OpenSearch performance
curl https://opensearch:9200/_nodes/stats/indices

# Reduce batch size (smaller bulk requests)
AUDIT_BATCH_SIZE=50  # Down from 100

# Increase OpenSearch resources
# Scale cluster or increase CPU/memory

# Check index settings
curl https://opensearch:9200/audit-*/_settings

If network latency:

# Measure network latency
ping opensearch.internal

# Use dedicated network for audit traffic
# Configure app to use internal VPC endpoint

# Increase batch size to amortize latency
AUDIT_BATCH_SIZE=500

Prevention:

Keep middleware logic minimal
Monitor latency continuously
Load test before production deployment
Use async operations everywhere

Issue 5: Audit Events Not Appearing

Symptoms:

Events missing from OpenSearch
Searches return no results
Dashboard shows no data
Users cannot find audit trail

Diagnosis:

# Check if auditing enabled
echo $AUDIT_ENABLED

# Check metrics for events emitted
curl http://localhost:8000/metrics | grep audit_events_total

# Check OpenSearch for indices
curl https://opensearch:9200/_cat/indices?v | grep audit

# Check recent events in index
curl https://opensearch:9200/audit-*/_search?size=10&sort=occurred_at:desc

# Check logs for errors
docker logs api-container | grep -i "audit" | grep -i "error"

Root Causes:

Auditing disabled via config
Path excluded from auditing
Circuit breaker open (events in logs only)
OpenSearch index pattern mismatch
Tenant context missing (events written to wrong index)

Resolution:

If auditing disabled:

# Enable auditing
export AUDIT_ENABLED=true

# Restart application
docker restart api-container

If path excluded:

# Check excluded paths
python -c "
from backend.audit.config import get_audit_config
config = get_audit_config()
print(config.excluded_paths)
"

# Check if path is excluded
python -c "
from backend.audit.config import get_audit_config
config = get_audit_config()
print(config.is_path_excluded('/your/path'))
"

# Update excluded paths if needed
AUDIT_EXCLUDED_PATHS='["/health","/metrics"]'

If circuit breaker open:

# Check circuit breaker state
curl http://localhost:8000/metrics | grep circuit_breaker_state

# Events in logs during outage
docker logs api-container | grep "AuditEvent"

# Fix OpenSearch and wait for recovery
# See "Issue 1: Circuit Breaker Open"

If index pattern mismatch:

# List all audit indices
curl https://opensearch:9200/_cat/indices?v | grep audit

# Check expected index name
# Pattern: audit-{org_id}-{account_id}-{service}-{date}
# Example: audit-org123-acc456-api-2025-10-03

# Create index pattern in OpenSearch Dashboards
# Pattern: audit-{your_org_id}-*

If tenant context missing:

# Check middleware order
# TenantMiddleware MUST run before AuditMiddleware

# Verify in app.py
grep -A 5 "add_middleware" backend/api/app.py

# Correct order:
# 1. CorrelationMiddleware
# 2. TenantMiddleware
# 3. AuditMiddleware

Prevention:

Monitor event ingestion rate
Test audit in all environments
Verify middleware order in CI/CD
Document excluded paths

Troubleshooting Guide

Debug Logging

Enable detailed debug logging:

# In code
import logging
logging.getLogger("backend.audit").setLevel(logging.DEBUG)

# Or via environment
export LOG_LEVEL=DEBUG

Debug output includes:

Event creation and sanitization
Batch queue operations
OpenSearch write attempts
Circuit breaker state changes
Error stack traces

Manual Event Testing

Test event emission manually:

from backend.audit import emit
from backend.utils.actor import ActorInfo

# Emit test event
await emit(
    action="test.manual",
    target="test:runbook",
    metadata={"test": True, "timestamp": "2025-10-03"},
    actor=ActorInfo(actor_type="system", actor_id="test"),
    organization_id="org_test",
    account_id="acc_test"
)

# Check OpenSearch for event
# Index: audit-org_test-acc_test-api-2025-10-03

Verify in OpenSearch:

curl https://opensearch:9200/audit-org_test-acc_test-api-2025-10-03/_search?q=action:test.manual

OpenSearch Query Examples

Find events by actor:

curl -X POST https://opensearch:9200/audit-*/_search -d '{
  "query": {
    "term": {"actor_id": "user_123"}
  },
  "sort": [{"occurred_at": "desc"}],
  "size": 100
}'

Find failed operations:

curl -X POST https://opensearch:9200/audit-*/_search -d '{
  "query": {
    "bool": {
      "must": [
        {"term": {"operation_result": "failure"}},
        {"range": {"response_status_code": {"gte": 400}}}
      ]
    }
  }
}'

Find recent errors:

curl -X POST https://opensearch:9200/audit-*/_search -d '{
  "query": {
    "bool": {
      "must": [
        {"range": {"response_status_code": {"gte": 500}}},
        {"range": {"occurred_at": {"gte": "now-1h"}}}
      ]
    }
  },
  "sort": [{"occurred_at": "desc"}]
}'

Trace a request:

curl -X POST https://opensearch:9200/audit-*/_search -d '{
  "query": {
    "term": {"trace_id": "abc123def456"}
  }
}'

Performance Profiling

Profile audit system performance:

# Add timing instrumentation
import time

start = time.time()
await emit(action="test", target="perf", request=request)
duration = time.time() - start
print(f"Emit took {duration*1000:.2f}ms")

Check metrics:

# Middleware overhead
curl -s http://localhost:8000/metrics | grep audit_middleware_latency_seconds

# Event processing
curl -s http://localhost:8000/metrics | grep audit_events_latency_seconds

# Batch flush
curl -s http://localhost:8000/metrics | grep audit_batch_flush_duration_seconds

Health Check Script

#!/bin/bash
# audit_health_check.sh

echo "=== Audit System Health Check ==="

# 1. Check if auditing is enabled
if [ "$AUDIT_ENABLED" = "true" ]; then
    echo "✓ Auditing enabled"
else
    echo "✗ Auditing disabled"
    exit 1
fi

# 2. Check OpenSearch connectivity
if curl -s -u $AUDIT_OPENSEARCH_USERNAME:$AUDIT_OPENSEARCH_PASSWORD \
    https://$AUDIT_OPENSEARCH_HOST:$AUDIT_OPENSEARCH_PORT/_cluster/health | grep -q green; then
    echo "✓ OpenSearch cluster healthy"
else
    echo "✗ OpenSearch cluster unhealthy"
fi

# 3. Check circuit breaker state
CB_STATE=$(curl -s http://localhost:8000/metrics | grep 'circuit_breaker_state{sink_type="opensearch"}' | awk '{print $2}')
if [ "$CB_STATE" = "0" ]; then
    echo "✓ Circuit breaker closed"
else
    echo "✗ Circuit breaker open/half-open"
fi

# 4. Check queue size
QUEUE_SIZE=$(curl -s http://localhost:8000/metrics | grep 'audit_queue_size{' | awk '{print $2}')
if [ "$QUEUE_SIZE" -lt 5000 ]; then
    echo "✓ Queue size normal ($QUEUE_SIZE)"
else
    echo "⚠ Queue size high ($QUEUE_SIZE)"
fi

# 5. Check event ingestion rate
EVENTS=$(curl -s http://localhost:8000/metrics | grep 'audit_events_total{' | head -1 | awk '{print $2}')
echo "✓ Events emitted: $EVENTS"

echo "=== Health Check Complete ==="

Recovery Procedures

Procedure 1: Restart Audit System

When to use: After configuration changes, or to recover from errors

# 1. Graceful shutdown (flush pending events)
curl -X POST http://localhost:8000/admin/audit/flush

# 2. Wait for flush to complete (check queue size)
curl -s http://localhost:8000/metrics | grep audit_queue_size

# 3. Restart application
docker restart api-container

# Or on Kubernetes
kubectl rollout restart deployment api

# 4. Verify startup
docker logs -f api-container | grep "audit"

# 5. Check health
curl http://localhost:8000/health/audit

Procedure 2: Flush Pending Events

When to use: Before shutdown, or to clear queue

# 1. Check current queue size
curl -s http://localhost:8000/metrics | grep audit_queue_size

# 2. Trigger manual flush
curl -X POST http://localhost:8000/admin/audit/flush

# 3. Wait for flush (monitor queue size)
watch 'curl -s http://localhost:8000/metrics | grep audit_queue_size'

# 4. Verify all events written
# Queue size should be 0

Procedure 3: Recover from OpenSearch Outage

When to use: After extended OpenSearch downtime

During outage:

Events automatically go to logs (fallback)
Circuit breaker opens to protect system
Monitor for OpenSearch recovery

After OpenSearch recovers:

# 1. OpenSearch comes back online
curl https://opensearch:9200/_cluster/health

# 2. Circuit breaker auto-recovers (60s timeout)
# Monitor state transition
curl -s http://localhost:8000/metrics | grep circuit_breaker_state

# 3. New events resume writing to OpenSearch
# Check event count in indices
curl https://opensearch:9200/audit-*/_count

# 4. Review log events during outage
docker logs api-container --since 1h | grep "AuditEvent"

Lost events during outage:

Events written to structured logs
Can be extracted and replayed if needed
Use log aggregation tool (e.g., CloudWatch, Datadog)

Procedure 4: Recover from Data Loss

When to use: Index accidentally deleted, corruption

# 1. Check if snapshots available
curl https://opensearch:9200/_snapshot/_all

# 2. List snapshots for date range
curl https://opensearch:9200/_snapshot/audit_backup/_all

# 3. Restore specific indices
curl -X POST https://opensearch:9200/_snapshot/audit_backup/snapshot_2025-10-03/_restore -d '{
  "indices": "audit-org123-acc456-api-2025-10-03",
  "ignore_unavailable": true,
  "include_global_state": false
}'

# 4. Monitor restore progress
curl https://opensearch:9200/_recovery?active_only=true

# 5. Verify data restored
curl https://opensearch:9200/audit-org123-acc456-api-2025-10-03/_count

Prevention:

Daily automated snapshots
Immutable index settings (prevent deletion)
Backup retention: 30 days

Procedure 5: Emergency Disable

When to use: Critical performance issue, security concern

# 1. Disable auditing immediately
export AUDIT_ENABLED=false

# 2. Restart application
docker restart api-container

# 3. Verify auditing stopped
curl -s http://localhost:8000/metrics | grep audit_events_total
# Count should stop increasing

# 4. Investigate issue while system is stable

# 5. Re-enable when resolved
export AUDIT_ENABLED=true
docker restart api-container

⚠️ WARNING: Disabling audit breaks compliance. Only use in emergencies.

Capacity Planning

Event Volume Estimation

Calculate expected events per day:

Events/day = (API requests/day) × (1 - excluded_ratio)

Example:
- 10M API requests/day
- 20% excluded (/health, /metrics)
- Events/day = 10M × 0.8 = 8M events/day

Storage estimation:

Storage/day = Events/day × Avg_event_size

Example:
- 8M events/day
- 5KB average event size (after sanitization)
- Storage/day = 8M × 5KB = 40GB/day
- Storage/month = 40GB × 30 = 1.2TB/month

OpenSearch Sizing

Cluster sizing guidelines:

Traffic Level	Events/sec	Storage/day	Recommended Cluster
Low	< 100	< 5GB	1 data node, 2 CPU, 4GB RAM
Medium	100-1000	5-50GB	3 data nodes, 4 CPU, 8GB RAM
High	1000-5000	50-250GB	5 data nodes, 8 CPU, 16GB RAM
Very High	> 5000	> 250GB	10+ data nodes, 16 CPU, 32GB RAM

Disk sizing:

Total disk = (Daily storage × Retention days) × Replication factor × Overhead

Example:
- 40GB/day
- 90 days retention
- 1 replica (2x)
- 20% overhead
- Total = 40 × 90 × 2 × 1.2 = 8.6TB

Scaling Guidelines

Horizontal scaling (add app instances):

Increases event throughput
Distributes queue load
Each instance has independent queue

Vertical scaling (increase app resources):

Larger queue capacity (more memory)
Faster event processing (more CPU)

OpenSearch scaling:

# Scale data nodes
kubectl scale statefulset opensearch --replicas=5

# Increase node resources
# Edit statefulset to increase CPU/memory limits

# Add dedicated master nodes (for large clusters)
# Prevents split-brain, improves stability

Auto-Scaling Configuration

Application auto-scaling (Kubernetes HPA):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: audit_queue_size
      target:
        type: AverageValue
        averageValue: 3000  # Scale up if avg queue > 3000

OpenSearch auto-scaling:

Use managed OpenSearch (AWS, Elastic Cloud)
Configure auto-scaling based on CPU/memory
Or monitor and scale manually

Cost Optimization

Reduce storage costs:

Decrease retention period (90 → 60 days)
Use compression (gzip indices)
Reduce replica count (2 → 1)
Implement ILM (hot → warm → cold → delete)
Increase body truncation (10KB → 5KB)

Reduce compute costs:

Increase batch size (more efficient writes)
Exclude more paths (reduce volume)
Use cheaper instance types for cold nodes
Right-size cluster for actual load

Alerting Configuration

Critical Alerts (PagerDuty)

Alert: AuditSystemDown

- alert: AuditSystemDown
  expr: rate(audit_events_total[5m]) == 0
  for: 5m
  labels:
    severity: critical
    component: audit
  annotations:
    summary: "Audit system not writing events"
    description: "No audit events written in last 5 minutes"
    action: "Check if auditing is enabled, verify OpenSearch connectivity"

Alert: CircuitBreakerOpen

- alert: CircuitBreakerOpen
  expr: circuit_breaker_state{sink_type="opensearch"} == 1
  for: 10m
  labels:
    severity: critical
    component: audit
  annotations:
    summary: "OpenSearch circuit breaker open"
    description: "Audit events going to logs only for 10+ minutes"
    action: "Fix OpenSearch connectivity, check cluster health"

Alert: HighAuditErrorRate

- alert: HighAuditErrorRate
  expr: rate(audit_errors_total[5m]) / rate(audit_events_total[5m]) > 0.05
  for: 5m
  labels:
    severity: critical
    component: audit
  annotations:
    summary: "High audit error rate (>5%)"
    description: "More than 5% of audit writes failing"
    action: "Check OpenSearch cluster, review error logs"

Alert: AuditQueueOverflow

- alert: AuditQueueOverflow
  expr: audit_queue_size > 10000
  for: 5m
  labels:
    severity: critical
    component: audit
  annotations:
    summary: "Audit queue overflow (>10,000 events)"
    description: "Queue size exceeded threshold, risk of event loss"
    action: "Scale OpenSearch, increase flush frequency, scale app"

Warning Alerts (Slack)

Alert: HighAuditLatency

- alert: HighAuditLatency
  expr: histogram_quantile(0.95, rate(audit_events_latency_seconds_bucket[5m])) > 0.1
  for: 15m
  labels:
    severity: warning
    component: audit
  annotations:
    summary: "High audit latency (p95 > 100ms)"
    description: "Audit event processing taking too long"
    action: "Check OpenSearch performance, review batch settings"

Alert: LargeAuditQueue

- alert: LargeAuditQueue
  expr: audit_queue_size > 5000
  for: 10m
  labels:
    severity: warning
    component: audit
  annotations:
    summary: "Large audit queue (>5,000 events)"
    description: "Queue size elevated, monitor for growth"
    action: "Monitor OpenSearch write throughput, prepare to scale"

Dashboard Links

Include dashboard links in alerts:

annotations:
  dashboard: "https://grafana.internal/d/audit-system-health"
  runbook: "https://docs.internal/audit-operations-runbook"

Monitoring Best Practices

Key Dashboards

Audit System Health (operations)
- Event ingestion rate
- Queue size and latency
- Circuit breaker state
- Error rate
API Activity (business)
- Request volume by endpoint
- Error rate by endpoint
- Response time trends
Security & Compliance (security team)
- Authentication events
- Authorization decisions
- Failed access attempts
- Sensitive operations
Multi-Tenant Activity (product)
- Events by organization
- Storage usage
- Top active users

Metric Retention

High resolution (1m): 7 days
Medium resolution (5m): 30 days
Low resolution (1h): 365 days

Log Retention

Application logs: 30 days
Audit events (OpenSearch): 90 days (configurable)
Metrics: 365 days

Security Considerations

Access Control

OpenSearch access:

Use dedicated service account
Least privilege permissions (write only to audit-*)
Rotate credentials regularly (90 days)
Use SSL/TLS for all connections

Audit log access:

Restrict to security team and compliance
Use role-based access control (RBAC)
Audit access to audit logs (meta-audit)
Require MFA for audit log access

Data Protection

Encryption:

In transit: TLS 1.3 for all connections
At rest: OpenSearch encryption enabled
Credentials: Use secrets manager (AWS Secrets, Vault)

Immutability:

Indices configured as write-once
No delete/update permissions
Daily backups to immutable storage (S3)

Compliance

GDPR:

Data minimization (automatic sanitization)
Right to access (search by user)
Right to erasure (anonymization process)
Data retention (configurable per-org)

SOC2:

Complete audit trail of all operations
Immutable audit logs
Encryption at rest and in transit
Access control and monitoring

HIPAA:

PHI access logging (all events captured)
6-year retention capability
Encryption (FIPS 140-2)
Audit log integrity checks

Useful Commands

Quick Reference

# Check audit system health
curl http://localhost:8000/health/audit

# View metrics
curl http://localhost:8000/metrics | grep audit

# Flush pending events
curl -X POST http://localhost:8000/admin/audit/flush

# Check OpenSearch health
curl https://opensearch:9200/_cluster/health

# List audit indices
curl https://opensearch:9200/_cat/indices?v | grep audit

# Count events in index
curl https://opensearch:9200/audit-org123-*/_count

# Delete old indices (cleanup)
curl -X DELETE https://opensearch:9200/audit-*-2025-01-*

# Search recent events
curl -X POST https://opensearch:9200/audit-*/_search -d '{
  "query": {"match_all": {}},
  "sort": [{"occurred_at": "desc"}],
  "size": 10
}'

Support Contacts

Escalation Path

L1 Support: DevOps on-call
- Basic health checks
- Restart services
- Check metrics/logs
L2 Support: SRE team
- Troubleshoot issues
- Scale infrastructure
- Config changes
L3 Support: Engineering team
- Code changes
- Architecture decisions
- Feature requests

Documentation

User Guide: /docs/05-security-and-audit/01-audit-trails.md
Configuration: /docs/05-security-and-audit/02-audit-configuration.md
This Runbook: /docs/05-security-and-audit/03-audit-operations-runbook.md
Architecture: /features/audit-module/PRD.md

Communication Channels

Incidents: PagerDuty #audit-system-critical
Warnings: Slack #audit-system-alerts
Questions: Slack #platform-support
Changes: Slack #platform-changes

Appendix

Configuration Reference

See full configuration reference in Audit Configuration Guide

Architecture Diagrams

See detailed architecture in Audit Module PRD

Changelog

Version	Date	Changes
1.0	2025-10-03	Initial runbook creation

Feedback

This runbook is maintained by the Platform Engineering team. For corrections or improvements:

Open an issue in the internal docs repository
Contact the Platform Engineering team
Contribute via pull request

System Overview​

Architecture​

Component Responsibilities​

Data Flow​

Key Metrics​

Common Issues and Solutions​

Issue 1: Circuit Breaker Open​

Issue 2: High Queue Size​

Issue 3: Index Creation Failures​

Issue 4: High Latency​

Issue 5: Audit Events Not Appearing​

Troubleshooting Guide​

Debug Logging​

Manual Event Testing​

OpenSearch Query Examples​

Performance Profiling​

Health Check Script​

Recovery Procedures​

Procedure 1: Restart Audit System​

Procedure 2: Flush Pending Events​

Procedure 3: Recover from OpenSearch Outage​

Procedure 4: Recover from Data Loss​

Procedure 5: Emergency Disable​

Capacity Planning​

Event Volume Estimation​

OpenSearch Sizing​

Scaling Guidelines​

Auto-Scaling Configuration​

Cost Optimization​

Alerting Configuration​

Critical Alerts (PagerDuty)​

Warning Alerts (Slack)​

Dashboard Links​

Monitoring Best Practices​

Key Dashboards​

Metric Retention​

Log Retention​

Security Considerations​

Access Control​

Data Protection​

Compliance​

Useful Commands​

Quick Reference​

Support Contacts​

Escalation Path​

Documentation​

Communication Channels​

Appendix​

Configuration Reference​

Architecture Diagrams​

Changelog​

Feedback​

System Overview

Architecture

Component Responsibilities

Data Flow

Key Metrics

Common Issues and Solutions

Issue 1: Circuit Breaker Open

Issue 2: High Queue Size

Issue 3: Index Creation Failures

Issue 4: High Latency

Issue 5: Audit Events Not Appearing

Troubleshooting Guide

Debug Logging

Manual Event Testing

OpenSearch Query Examples

Performance Profiling

Health Check Script

Recovery Procedures

Procedure 1: Restart Audit System

Procedure 2: Flush Pending Events

Procedure 3: Recover from OpenSearch Outage

Procedure 4: Recover from Data Loss

Procedure 5: Emergency Disable

Capacity Planning

Event Volume Estimation

OpenSearch Sizing

Scaling Guidelines

Auto-Scaling Configuration

Cost Optimization

Alerting Configuration

Critical Alerts (PagerDuty)

Warning Alerts (Slack)

Dashboard Links

Monitoring Best Practices

Key Dashboards

Metric Retention

Log Retention

Security Considerations

Access Control

Data Protection

Compliance

Useful Commands

Quick Reference

Support Contacts

Escalation Path

Documentation

Communication Channels

Appendix

Configuration Reference

Architecture Diagrams

Changelog

Feedback