Skip to main content

Audit System Operations Runbook

This runbook provides operational procedures for managing and troubleshooting the PDaaS audit system. It is intended for DevOps engineers, SRE teams, and system administrators.

System Overview

Architecture

The PDaaS audit system consists of these core components:

┌─────────────────┐
│ FastAPI App │
│ ┌───────────┐ │
│ │ Audit │ │
│ │ Middleware│ │
│ └─────┬─────┘ │
└────────┼────────┘
│ async

┌─────────────────┐
│ Audit Module │
│ ┌───────────┐ │
│ │ Emitter │ │──┐
│ └───────────┘ │ │
│ ┌───────────┐ │ │
│ │Sanitizer │ │ │ batched
│ └───────────┘ │ │ events
│ ┌───────────┐ │ │
│ │ Sinks │◄─┘ │
│ └─────┬─────┘ │
└────────┼────────┘

┌────┴────┐
▼ ▼
┌────────┐ ┌────────┐
│Logs │ │OpenSrch│
└────────┘ └────────┘

Component Responsibilities

ComponentResponsibilityCritical?
AuditMiddlewareCaptures HTTP requests/responsesYes
EmitterQueues and batches eventsYes
SanitizerRemoves sensitive dataYes
OpenSearchAuditSinkWrites to OpenSearchYes
LoggerAuditSinkFallback to logsYes
CircuitBreakerProtects against OpenSearch failuresYes

Data Flow

  1. Request arrives → Middleware captures start time
  2. Request processed → Handler executes business logic
  3. Response generated → Middleware captures response
  4. Event created → EnhancedAuditEvent with full context
  5. Event sanitized → Sensitive data removed
  6. Event queued → Added to batch buffer
  7. Batch flushed → On size (100) or time (5s) trigger
  8. Write to OpenSearch → Bulk write with retry
  9. Fallback if needed → Circuit breaker → logs

Key Metrics

Monitor these metrics for system health:

  • audit_events_total - Total events emitted
  • audit_events_latency_seconds - Event processing latency
  • audit_batch_flush_duration_seconds - Batch write time
  • audit_queue_size - Current queue depth
  • circuit_breaker_state - 0=closed, 1=open, 2=half-open
  • audit_errors_total - Total error count

Common Issues and Solutions

Issue 1: Circuit Breaker Open

Symptoms:

  • Alert: "CircuitBreakerOpen - OpenSearch unavailable for 10+ minutes"
  • Logs: WARNING: Circuit breaker open, using fallback sink
  • Metric: circuit_breaker_state{sink_type="opensearch"} == 1
  • Events going to logs only, not searchable in OpenSearch

Diagnosis:

# Check circuit breaker state
curl http://localhost:8000/metrics | grep circuit_breaker_state

# Check OpenSearch health
curl -u admin:password https://opensearch:9200/_cluster/health

# Check connectivity from app server
curl -v https://opensearch:9200/_cluster/health

# Check recent errors
docker logs api-container | grep -i "opensearch" | tail -50

Root Causes:

  1. OpenSearch cluster down or unavailable
  2. Network connectivity issues
  3. OpenSearch overloaded (too many requests)
  4. Authentication/SSL certificate issues
  5. OpenSearch disk full

Resolution:

If OpenSearch is down:

# Restart OpenSearch cluster
docker restart opensearch-node1

# Or on Kubernetes
kubectl rollout restart deployment opensearch

# Verify health
curl https://opensearch:9200/_cluster/health

If network issues:

# Test connectivity
ping opensearch.internal
telnet opensearch.internal 9200

# Check firewall rules
iptables -L | grep 9200

# Check DNS resolution
nslookup opensearch.internal

If authentication issues:

# Verify credentials
curl -u $AUDIT_OPENSEARCH_USERNAME:$AUDIT_OPENSEARCH_PASSWORD \
https://opensearch:9200/_cluster/health

# Check SSL cert
openssl s_client -connect opensearch:9200 -showcerts

If disk full:

# Check disk usage
curl https://opensearch:9200/_cat/allocation?v

# Delete old indices
curl -X DELETE https://opensearch:9200/audit-*-2025-01-*

Auto-Recovery:

  • Circuit breaker automatically tries to close after 60 seconds
  • Monitor circuit_breaker_transitions_total metric
  • Events buffered during outage will be written when recovered

Prevention:

  • Monitor OpenSearch cluster health proactively
  • Set up auto-scaling for OpenSearch
  • Configure index lifecycle management
  • Regular disk cleanup

Issue 2: High Queue Size

Symptoms:

  • Alert: "LargeAuditQueue - Queue size > 5,000 events"
  • Alert: "AuditQueueOverflow - Queue > 10,000 events" (critical)
  • Metric: audit_queue_size > 5000
  • Logs: WARNING: Audit queue size: 8432
  • Memory usage increasing on app servers

Diagnosis:

# Check queue size
curl http://localhost:8000/metrics | grep audit_queue_size

# Check OpenSearch write latency
curl http://localhost:8000/metrics | grep opensearch_write_latency

# Check batch flush duration
curl http://localhost:8000/metrics | grep audit_batch_flush_duration

# Check OpenSearch cluster performance
curl https://opensearch:9200/_cluster/stats

Root Causes:

  1. OpenSearch write throughput too low
  2. Batch flush taking too long
  3. Traffic spike overwhelming system
  4. OpenSearch cluster under-provisioned
  5. Network latency between app and OpenSearch

Resolution:

Immediate (stop the bleeding):

# Increase flush frequency (reduce interval)
# Update environment variable and restart
AUDIT_FLUSH_INTERVAL_SECONDS=1.0 # Down from 5.0

# Increase batch size (more efficient bulk writes)
AUDIT_BATCH_SIZE=500 # Up from 100

# Scale app horizontally (distribute load)
kubectl scale deployment api --replicas=5

Short-term (within hours):

# Scale OpenSearch cluster
# Add more data nodes for write capacity
kubectl scale statefulset opensearch --replicas=5

# Increase OpenSearch resources
# Edit deployment to increase CPU/memory

# Check index refresh interval
curl -X PUT https://opensearch:9200/audit-*/_settings -d '{
"index": {
"refresh_interval": "30s"
}
}'

Long-term (capacity planning):

  1. Review traffic patterns and plan capacity
  2. Implement auto-scaling based on queue size
  3. Consider dedicated OpenSearch cluster for audit
  4. Optimize index settings (shards, replicas)
  5. Implement index rollover for better performance

Monitoring:

# Track queue recovery
watch 'curl -s http://localhost:8000/metrics | grep audit_queue_size'

# Set up alert for sustained high queue
# Alert if queue > 5000 for 10+ minutes

Prevention:

  • Capacity plan for 2x expected peak load
  • Auto-scale OpenSearch based on CPU/memory
  • Monitor write latency and queue size proactively
  • Regular load testing

Issue 3: Index Creation Failures

Symptoms:

  • Alert: "IndexCreationFailures - Unable to create new indices"
  • Logs: ERROR: Failed to create index audit-org123-acc456-api-2025-10-03
  • HTTP 400/403 errors from OpenSearch
  • Events failing to write

Diagnosis:

# Check index creation errors
curl http://localhost:8000/metrics | grep audit_errors_total | grep index_creation

# List current indices
curl https://opensearch:9200/_cat/indices?v | grep audit

# Check cluster settings
curl https://opensearch:9200/_cluster/settings?include_defaults=true

# Check shard limits
curl https://opensearch:9200/_cluster/stats | jq '.indices.shards'

Root Causes:

  1. OpenSearch shard limit reached (default 1000 per node)
  2. Disk space exhausted
  3. Index template conflicts
  4. Permission issues (user cannot create indices)
  5. Too many indices (need cleanup)

Resolution:

If shard limit reached:

# Increase shard limit temporarily
curl -X PUT https://opensearch:9200/_cluster/settings -d '{
"persistent": {
"cluster.max_shards_per_node": 2000
}
}'

# Delete old indices to free shards
python -m backend.audit.cleanup --older-than 90

# Or delete manually
curl -X DELETE https://opensearch:9200/audit-*-2025-01-*

If disk full:

# Check disk usage
curl https://opensearch:9200/_cat/allocation?v

# Enable watermark enforcement (if disabled)
curl -X PUT https://opensearch:9200/_cluster/settings -d '{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "85%",
"cluster.routing.allocation.disk.watermark.high": "90%"
}
}'

# Force merge old indices to reclaim space
curl -X POST https://opensearch:9200/audit-*-2025-09-*/_forcemerge?max_num_segments=1

# Delete old indices
curl -X DELETE https://opensearch:9200/audit-*-2025-08-*

If permission issues:

# Check user permissions
curl -u admin:password https://opensearch:9200/_plugins/_security/api/roles/audit_writer

# Update role to allow index creation
curl -X PUT https://opensearch:9200/_plugins/_security/api/roles/audit_writer -d '{
"cluster_permissions": ["cluster:monitor/health"],
"index_permissions": [{
"index_patterns": ["audit-*"],
"allowed_actions": ["write", "create_index", "indices:admin/create"]
}]
}'

Prevention:

  • Automated retention policies (delete indices > 90 days)
  • Monitor shard count and disk usage
  • Use monthly rotation for low-volume organizations
  • Regular cleanup job

Issue 4: High Latency

Symptoms:

  • Alert: "HighAuditLatency - p95 latency > 100ms for 15 minutes"
  • Alert: "HighMiddlewareLatency - p95 > 5ms for 15 minutes"
  • Slow API response times
  • User complaints about performance

Diagnosis:

# Check audit latency
curl http://localhost:8000/metrics | grep audit_events_latency_seconds

# Check middleware overhead
curl http://localhost:8000/metrics | grep audit_middleware_latency_seconds

# Check OpenSearch write latency
curl http://localhost:8000/metrics | grep opensearch_write_latency_seconds

# Check batch flush duration
curl http://localhost:8000/metrics | grep audit_batch_flush_duration_seconds

Root Causes:

  1. Synchronous operations blocking request thread
  2. Large request/response bodies (serialization overhead)
  3. OpenSearch cluster slow or overloaded
  4. Network latency to OpenSearch
  5. Queue full causing backpressure

Resolution:

If middleware overhead high:

# Verify async processing is working
# Check logs for blocking operations
docker logs api-container | grep "audit" | grep -i "block"

# Increase excluded paths (reduce audit volume)
AUDIT_EXCLUDED_PATHS='["/health","/metrics","/internal/*","/debug/*"]'

# Reduce body size limit (less serialization)
AUDIT_MAX_BODY_SIZE=1024 # Down from 10240

If OpenSearch writes slow:

# Check OpenSearch performance
curl https://opensearch:9200/_nodes/stats/indices

# Reduce batch size (smaller bulk requests)
AUDIT_BATCH_SIZE=50 # Down from 100

# Increase OpenSearch resources
# Scale cluster or increase CPU/memory

# Check index settings
curl https://opensearch:9200/audit-*/_settings

If network latency:

# Measure network latency
ping opensearch.internal

# Use dedicated network for audit traffic
# Configure app to use internal VPC endpoint

# Increase batch size to amortize latency
AUDIT_BATCH_SIZE=500

Prevention:

  • Keep middleware logic minimal
  • Monitor latency continuously
  • Load test before production deployment
  • Use async operations everywhere

Issue 5: Audit Events Not Appearing

Symptoms:

  • Events missing from OpenSearch
  • Searches return no results
  • Dashboard shows no data
  • Users cannot find audit trail

Diagnosis:

# Check if auditing enabled
echo $AUDIT_ENABLED

# Check metrics for events emitted
curl http://localhost:8000/metrics | grep audit_events_total

# Check OpenSearch for indices
curl https://opensearch:9200/_cat/indices?v | grep audit

# Check recent events in index
curl https://opensearch:9200/audit-*/_search?size=10&sort=occurred_at:desc

# Check logs for errors
docker logs api-container | grep -i "audit" | grep -i "error"

Root Causes:

  1. Auditing disabled via config
  2. Path excluded from auditing
  3. Circuit breaker open (events in logs only)
  4. OpenSearch index pattern mismatch
  5. Tenant context missing (events written to wrong index)

Resolution:

If auditing disabled:

# Enable auditing
export AUDIT_ENABLED=true

# Restart application
docker restart api-container

If path excluded:

# Check excluded paths
python -c "
from backend.audit.config import get_audit_config
config = get_audit_config()
print(config.excluded_paths)
"

# Check if path is excluded
python -c "
from backend.audit.config import get_audit_config
config = get_audit_config()
print(config.is_path_excluded('/your/path'))
"

# Update excluded paths if needed
AUDIT_EXCLUDED_PATHS='["/health","/metrics"]'

If circuit breaker open:

# Check circuit breaker state
curl http://localhost:8000/metrics | grep circuit_breaker_state

# Events in logs during outage
docker logs api-container | grep "AuditEvent"

# Fix OpenSearch and wait for recovery
# See "Issue 1: Circuit Breaker Open"

If index pattern mismatch:

# List all audit indices
curl https://opensearch:9200/_cat/indices?v | grep audit

# Check expected index name
# Pattern: audit-{org_id}-{account_id}-{service}-{date}
# Example: audit-org123-acc456-api-2025-10-03

# Create index pattern in OpenSearch Dashboards
# Pattern: audit-{your_org_id}-*

If tenant context missing:

# Check middleware order
# TenantMiddleware MUST run before AuditMiddleware

# Verify in app.py
grep -A 5 "add_middleware" backend/api/app.py

# Correct order:
# 1. CorrelationMiddleware
# 2. TenantMiddleware
# 3. AuditMiddleware

Prevention:

  • Monitor event ingestion rate
  • Test audit in all environments
  • Verify middleware order in CI/CD
  • Document excluded paths

Troubleshooting Guide

Debug Logging

Enable detailed debug logging:

# In code
import logging
logging.getLogger("backend.audit").setLevel(logging.DEBUG)

# Or via environment
export LOG_LEVEL=DEBUG

Debug output includes:

  • Event creation and sanitization
  • Batch queue operations
  • OpenSearch write attempts
  • Circuit breaker state changes
  • Error stack traces

Manual Event Testing

Test event emission manually:

from backend.audit import emit
from backend.utils.actor import ActorInfo

# Emit test event
await emit(
action="test.manual",
target="test:runbook",
metadata={"test": True, "timestamp": "2025-10-03"},
actor=ActorInfo(actor_type="system", actor_id="test"),
organization_id="org_test",
account_id="acc_test"
)

# Check OpenSearch for event
# Index: audit-org_test-acc_test-api-2025-10-03

Verify in OpenSearch:

curl https://opensearch:9200/audit-org_test-acc_test-api-2025-10-03/_search?q=action:test.manual

OpenSearch Query Examples

Find events by actor:

curl -X POST https://opensearch:9200/audit-*/_search -d '{
"query": {
"term": {"actor_id": "user_123"}
},
"sort": [{"occurred_at": "desc"}],
"size": 100
}'

Find failed operations:

curl -X POST https://opensearch:9200/audit-*/_search -d '{
"query": {
"bool": {
"must": [
{"term": {"operation_result": "failure"}},
{"range": {"response_status_code": {"gte": 400}}}
]
}
}
}'

Find recent errors:

curl -X POST https://opensearch:9200/audit-*/_search -d '{
"query": {
"bool": {
"must": [
{"range": {"response_status_code": {"gte": 500}}},
{"range": {"occurred_at": {"gte": "now-1h"}}}
]
}
},
"sort": [{"occurred_at": "desc"}]
}'

Trace a request:

curl -X POST https://opensearch:9200/audit-*/_search -d '{
"query": {
"term": {"trace_id": "abc123def456"}
}
}'

Performance Profiling

Profile audit system performance:

# Add timing instrumentation
import time

start = time.time()
await emit(action="test", target="perf", request=request)
duration = time.time() - start
print(f"Emit took {duration*1000:.2f}ms")

Check metrics:

# Middleware overhead
curl -s http://localhost:8000/metrics | grep audit_middleware_latency_seconds

# Event processing
curl -s http://localhost:8000/metrics | grep audit_events_latency_seconds

# Batch flush
curl -s http://localhost:8000/metrics | grep audit_batch_flush_duration_seconds

Health Check Script

#!/bin/bash
# audit_health_check.sh

echo "=== Audit System Health Check ==="

# 1. Check if auditing is enabled
if [ "$AUDIT_ENABLED" = "true" ]; then
echo "✓ Auditing enabled"
else
echo "✗ Auditing disabled"
exit 1
fi

# 2. Check OpenSearch connectivity
if curl -s -u $AUDIT_OPENSEARCH_USERNAME:$AUDIT_OPENSEARCH_PASSWORD \
https://$AUDIT_OPENSEARCH_HOST:$AUDIT_OPENSEARCH_PORT/_cluster/health | grep -q green; then
echo "✓ OpenSearch cluster healthy"
else
echo "✗ OpenSearch cluster unhealthy"
fi

# 3. Check circuit breaker state
CB_STATE=$(curl -s http://localhost:8000/metrics | grep 'circuit_breaker_state{sink_type="opensearch"}' | awk '{print $2}')
if [ "$CB_STATE" = "0" ]; then
echo "✓ Circuit breaker closed"
else
echo "✗ Circuit breaker open/half-open"
fi

# 4. Check queue size
QUEUE_SIZE=$(curl -s http://localhost:8000/metrics | grep 'audit_queue_size{' | awk '{print $2}')
if [ "$QUEUE_SIZE" -lt 5000 ]; then
echo "✓ Queue size normal ($QUEUE_SIZE)"
else
echo "⚠ Queue size high ($QUEUE_SIZE)"
fi

# 5. Check event ingestion rate
EVENTS=$(curl -s http://localhost:8000/metrics | grep 'audit_events_total{' | head -1 | awk '{print $2}')
echo "✓ Events emitted: $EVENTS"

echo "=== Health Check Complete ==="

Recovery Procedures

Procedure 1: Restart Audit System

When to use: After configuration changes, or to recover from errors

# 1. Graceful shutdown (flush pending events)
curl -X POST http://localhost:8000/admin/audit/flush

# 2. Wait for flush to complete (check queue size)
curl -s http://localhost:8000/metrics | grep audit_queue_size

# 3. Restart application
docker restart api-container

# Or on Kubernetes
kubectl rollout restart deployment api

# 4. Verify startup
docker logs -f api-container | grep "audit"

# 5. Check health
curl http://localhost:8000/health/audit

Procedure 2: Flush Pending Events

When to use: Before shutdown, or to clear queue

# 1. Check current queue size
curl -s http://localhost:8000/metrics | grep audit_queue_size

# 2. Trigger manual flush
curl -X POST http://localhost:8000/admin/audit/flush

# 3. Wait for flush (monitor queue size)
watch 'curl -s http://localhost:8000/metrics | grep audit_queue_size'

# 4. Verify all events written
# Queue size should be 0

Procedure 3: Recover from OpenSearch Outage

When to use: After extended OpenSearch downtime

During outage:

  1. Events automatically go to logs (fallback)
  2. Circuit breaker opens to protect system
  3. Monitor for OpenSearch recovery

After OpenSearch recovers:

# 1. OpenSearch comes back online
curl https://opensearch:9200/_cluster/health

# 2. Circuit breaker auto-recovers (60s timeout)
# Monitor state transition
curl -s http://localhost:8000/metrics | grep circuit_breaker_state

# 3. New events resume writing to OpenSearch
# Check event count in indices
curl https://opensearch:9200/audit-*/_count

# 4. Review log events during outage
docker logs api-container --since 1h | grep "AuditEvent"

Lost events during outage:

  • Events written to structured logs
  • Can be extracted and replayed if needed
  • Use log aggregation tool (e.g., CloudWatch, Datadog)

Procedure 4: Recover from Data Loss

When to use: Index accidentally deleted, corruption

# 1. Check if snapshots available
curl https://opensearch:9200/_snapshot/_all

# 2. List snapshots for date range
curl https://opensearch:9200/_snapshot/audit_backup/_all

# 3. Restore specific indices
curl -X POST https://opensearch:9200/_snapshot/audit_backup/snapshot_2025-10-03/_restore -d '{
"indices": "audit-org123-acc456-api-2025-10-03",
"ignore_unavailable": true,
"include_global_state": false
}'

# 4. Monitor restore progress
curl https://opensearch:9200/_recovery?active_only=true

# 5. Verify data restored
curl https://opensearch:9200/audit-org123-acc456-api-2025-10-03/_count

Prevention:

  • Daily automated snapshots
  • Immutable index settings (prevent deletion)
  • Backup retention: 30 days

Procedure 5: Emergency Disable

When to use: Critical performance issue, security concern

# 1. Disable auditing immediately
export AUDIT_ENABLED=false

# 2. Restart application
docker restart api-container

# 3. Verify auditing stopped
curl -s http://localhost:8000/metrics | grep audit_events_total
# Count should stop increasing

# 4. Investigate issue while system is stable

# 5. Re-enable when resolved
export AUDIT_ENABLED=true
docker restart api-container

⚠️ WARNING: Disabling audit breaks compliance. Only use in emergencies.

Capacity Planning

Event Volume Estimation

Calculate expected events per day:

Events/day = (API requests/day) × (1 - excluded_ratio)

Example:
- 10M API requests/day
- 20% excluded (/health, /metrics)
- Events/day = 10M × 0.8 = 8M events/day

Storage estimation:

Storage/day = Events/day × Avg_event_size

Example:
- 8M events/day
- 5KB average event size (after sanitization)
- Storage/day = 8M × 5KB = 40GB/day
- Storage/month = 40GB × 30 = 1.2TB/month

OpenSearch Sizing

Cluster sizing guidelines:

Traffic LevelEvents/secStorage/dayRecommended Cluster
Low< 100< 5GB1 data node, 2 CPU, 4GB RAM
Medium100-10005-50GB3 data nodes, 4 CPU, 8GB RAM
High1000-500050-250GB5 data nodes, 8 CPU, 16GB RAM
Very High> 5000> 250GB10+ data nodes, 16 CPU, 32GB RAM

Disk sizing:

Total disk = (Daily storage × Retention days) × Replication factor × Overhead

Example:
- 40GB/day
- 90 days retention
- 1 replica (2x)
- 20% overhead
- Total = 40 × 90 × 2 × 1.2 = 8.6TB

Scaling Guidelines

Horizontal scaling (add app instances):

  • Increases event throughput
  • Distributes queue load
  • Each instance has independent queue

Vertical scaling (increase app resources):

  • Larger queue capacity (more memory)
  • Faster event processing (more CPU)

OpenSearch scaling:

# Scale data nodes
kubectl scale statefulset opensearch --replicas=5

# Increase node resources
# Edit statefulset to increase CPU/memory limits

# Add dedicated master nodes (for large clusters)
# Prevents split-brain, improves stability

Auto-Scaling Configuration

Application auto-scaling (Kubernetes HPA):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: audit_queue_size
target:
type: AverageValue
averageValue: 3000 # Scale up if avg queue > 3000

OpenSearch auto-scaling:

  • Use managed OpenSearch (AWS, Elastic Cloud)
  • Configure auto-scaling based on CPU/memory
  • Or monitor and scale manually

Cost Optimization

Reduce storage costs:

  1. Decrease retention period (90 → 60 days)
  2. Use compression (gzip indices)
  3. Reduce replica count (2 → 1)
  4. Implement ILM (hot → warm → cold → delete)
  5. Increase body truncation (10KB → 5KB)

Reduce compute costs:

  1. Increase batch size (more efficient writes)
  2. Exclude more paths (reduce volume)
  3. Use cheaper instance types for cold nodes
  4. Right-size cluster for actual load

Alerting Configuration

Critical Alerts (PagerDuty)

Alert: AuditSystemDown

- alert: AuditSystemDown
expr: rate(audit_events_total[5m]) == 0
for: 5m
labels:
severity: critical
component: audit
annotations:
summary: "Audit system not writing events"
description: "No audit events written in last 5 minutes"
action: "Check if auditing is enabled, verify OpenSearch connectivity"

Alert: CircuitBreakerOpen

- alert: CircuitBreakerOpen
expr: circuit_breaker_state{sink_type="opensearch"} == 1
for: 10m
labels:
severity: critical
component: audit
annotations:
summary: "OpenSearch circuit breaker open"
description: "Audit events going to logs only for 10+ minutes"
action: "Fix OpenSearch connectivity, check cluster health"

Alert: HighAuditErrorRate

- alert: HighAuditErrorRate
expr: rate(audit_errors_total[5m]) / rate(audit_events_total[5m]) > 0.05
for: 5m
labels:
severity: critical
component: audit
annotations:
summary: "High audit error rate (>5%)"
description: "More than 5% of audit writes failing"
action: "Check OpenSearch cluster, review error logs"

Alert: AuditQueueOverflow

- alert: AuditQueueOverflow
expr: audit_queue_size > 10000
for: 5m
labels:
severity: critical
component: audit
annotations:
summary: "Audit queue overflow (>10,000 events)"
description: "Queue size exceeded threshold, risk of event loss"
action: "Scale OpenSearch, increase flush frequency, scale app"

Warning Alerts (Slack)

Alert: HighAuditLatency

- alert: HighAuditLatency
expr: histogram_quantile(0.95, rate(audit_events_latency_seconds_bucket[5m])) > 0.1
for: 15m
labels:
severity: warning
component: audit
annotations:
summary: "High audit latency (p95 > 100ms)"
description: "Audit event processing taking too long"
action: "Check OpenSearch performance, review batch settings"

Alert: LargeAuditQueue

- alert: LargeAuditQueue
expr: audit_queue_size > 5000
for: 10m
labels:
severity: warning
component: audit
annotations:
summary: "Large audit queue (>5,000 events)"
description: "Queue size elevated, monitor for growth"
action: "Monitor OpenSearch write throughput, prepare to scale"

Include dashboard links in alerts:

annotations:
dashboard: "https://grafana.internal/d/audit-system-health"
runbook: "https://docs.internal/audit-operations-runbook"

Monitoring Best Practices

Key Dashboards

  1. Audit System Health (operations)

    • Event ingestion rate
    • Queue size and latency
    • Circuit breaker state
    • Error rate
  2. API Activity (business)

    • Request volume by endpoint
    • Error rate by endpoint
    • Response time trends
  3. Security & Compliance (security team)

    • Authentication events
    • Authorization decisions
    • Failed access attempts
    • Sensitive operations
  4. Multi-Tenant Activity (product)

    • Events by organization
    • Storage usage
    • Top active users

Metric Retention

  • High resolution (1m): 7 days
  • Medium resolution (5m): 30 days
  • Low resolution (1h): 365 days

Log Retention

  • Application logs: 30 days
  • Audit events (OpenSearch): 90 days (configurable)
  • Metrics: 365 days

Security Considerations

Access Control

OpenSearch access:

  • Use dedicated service account
  • Least privilege permissions (write only to audit-*)
  • Rotate credentials regularly (90 days)
  • Use SSL/TLS for all connections

Audit log access:

  • Restrict to security team and compliance
  • Use role-based access control (RBAC)
  • Audit access to audit logs (meta-audit)
  • Require MFA for audit log access

Data Protection

Encryption:

  • In transit: TLS 1.3 for all connections
  • At rest: OpenSearch encryption enabled
  • Credentials: Use secrets manager (AWS Secrets, Vault)

Immutability:

  • Indices configured as write-once
  • No delete/update permissions
  • Daily backups to immutable storage (S3)

Compliance

GDPR:

  • Data minimization (automatic sanitization)
  • Right to access (search by user)
  • Right to erasure (anonymization process)
  • Data retention (configurable per-org)

SOC2:

  • Complete audit trail of all operations
  • Immutable audit logs
  • Encryption at rest and in transit
  • Access control and monitoring

HIPAA:

  • PHI access logging (all events captured)
  • 6-year retention capability
  • Encryption (FIPS 140-2)
  • Audit log integrity checks

Useful Commands

Quick Reference

# Check audit system health
curl http://localhost:8000/health/audit

# View metrics
curl http://localhost:8000/metrics | grep audit

# Flush pending events
curl -X POST http://localhost:8000/admin/audit/flush

# Check OpenSearch health
curl https://opensearch:9200/_cluster/health

# List audit indices
curl https://opensearch:9200/_cat/indices?v | grep audit

# Count events in index
curl https://opensearch:9200/audit-org123-*/_count

# Delete old indices (cleanup)
curl -X DELETE https://opensearch:9200/audit-*-2025-01-*

# Search recent events
curl -X POST https://opensearch:9200/audit-*/_search -d '{
"query": {"match_all": {}},
"sort": [{"occurred_at": "desc"}],
"size": 10
}'

Support Contacts

Escalation Path

  1. L1 Support: DevOps on-call

    • Basic health checks
    • Restart services
    • Check metrics/logs
  2. L2 Support: SRE team

    • Troubleshoot issues
    • Scale infrastructure
    • Config changes
  3. L3 Support: Engineering team

    • Code changes
    • Architecture decisions
    • Feature requests

Documentation

  • User Guide: /docs/05-security-and-audit/01-audit-trails.md
  • Configuration: /docs/05-security-and-audit/02-audit-configuration.md
  • This Runbook: /docs/05-security-and-audit/03-audit-operations-runbook.md
  • Architecture: /features/audit-module/PRD.md

Communication Channels

  • Incidents: PagerDuty #audit-system-critical
  • Warnings: Slack #audit-system-alerts
  • Questions: Slack #platform-support
  • Changes: Slack #platform-changes

Appendix

Configuration Reference

See full configuration reference in Audit Configuration Guide

Architecture Diagrams

See detailed architecture in Audit Module PRD

Changelog

VersionDateChanges
1.02025-10-03Initial runbook creation

Feedback

This runbook is maintained by the Platform Engineering team. For corrections or improvements:

  • Open an issue in the internal docs repository
  • Contact the Platform Engineering team
  • Contribute via pull request