Audit System Operations Runbook
This runbook provides operational procedures for managing and troubleshooting the PDaaS audit system. It is intended for DevOps engineers, SRE teams, and system administrators.
System Overview
Architecture
The PDaaS audit system consists of these core components:
┌─────────────────┐
│ FastAPI App │
│ ┌───────────┐ │
│ │ Audit │ │
│ │ Middleware│ │
│ └─────┬─────┘ │
└────────┼────────┘
│ async
▼
┌─────────────────┐
│ Audit Module │
│ ┌───────────┐ │
│ │ Emitter │ │──┐
│ └───────────┘ │ │
│ ┌───────────┐ │ │
│ │Sanitizer │ │ │ batched
│ └───────────┘ │ │ events
│ ┌───────────┐ │ │
│ │ Sinks │◄─┘ │
│ └─────┬─────┘ │
└────────┼────────┘
│
┌────┴────┐
▼ ▼
┌────────┐ ┌────────┐
│Logs │ │OpenSrch│
└────────┘ └────────┘
Component Responsibilities
| Component | Responsibility | Critical? |
|---|---|---|
| AuditMiddleware | Captures HTTP requests/responses | Yes |
| Emitter | Queues and batches events | Yes |
| Sanitizer | Removes sensitive data | Yes |
| OpenSearchAuditSink | Writes to OpenSearch | Yes |
| LoggerAuditSink | Fallback to logs | Yes |
| CircuitBreaker | Protects against OpenSearch failures | Yes |
Data Flow
- Request arrives → Middleware captures start time
- Request processed → Handler executes business logic
- Response generated → Middleware captures response
- Event created → EnhancedAuditEvent with full context
- Event sanitized → Sensitive data removed
- Event queued → Added to batch buffer
- Batch flushed → On size (100) or time (5s) trigger
- Write to OpenSearch → Bulk write with retry
- Fallback if needed → Circuit breaker → logs
Key Metrics
Monitor these metrics for system health:
audit_events_total- Total events emittedaudit_events_latency_seconds- Event processing latencyaudit_batch_flush_duration_seconds- Batch write timeaudit_queue_size- Current queue depthcircuit_breaker_state- 0=closed, 1=open, 2=half-openaudit_errors_total- Total error count
Common Issues and Solutions
Issue 1: Circuit Breaker Open
Symptoms:
- Alert: "CircuitBreakerOpen - OpenSearch unavailable for 10+ minutes"
- Logs:
WARNING: Circuit breaker open, using fallback sink - Metric:
circuit_breaker_state{sink_type="opensearch"} == 1 - Events going to logs only, not searchable in OpenSearch
Diagnosis:
# Check circuit breaker state
curl http://localhost:8000/metrics | grep circuit_breaker_state
# Check OpenSearch health
curl -u admin:password https://opensearch:9200/_cluster/health
# Check connectivity from app server
curl -v https://opensearch:9200/_cluster/health
# Check recent errors
docker logs api-container | grep -i "opensearch" | tail -50
Root Causes:
- OpenSearch cluster down or unavailable
- Network connectivity issues
- OpenSearch overloaded (too many requests)
- Authentication/SSL certificate issues
- OpenSearch disk full
Resolution:
If OpenSearch is down:
# Restart OpenSearch cluster
docker restart opensearch-node1
# Or on Kubernetes
kubectl rollout restart deployment opensearch
# Verify health
curl https://opensearch:9200/_cluster/health
If network issues:
# Test connectivity
ping opensearch.internal
telnet opensearch.internal 9200
# Check firewall rules
iptables -L | grep 9200
# Check DNS resolution
nslookup opensearch.internal
If authentication issues:
# Verify credentials
curl -u $AUDIT_OPENSEARCH_USERNAME:$AUDIT_OPENSEARCH_PASSWORD \
https://opensearch:9200/_cluster/health
# Check SSL cert
openssl s_client -connect opensearch:9200 -showcerts
If disk full:
# Check disk usage
curl https://opensearch:9200/_cat/allocation?v
# Delete old indices
curl -X DELETE https://opensearch:9200/audit-*-2025-01-*
Auto-Recovery:
- Circuit breaker automatically tries to close after 60 seconds
- Monitor
circuit_breaker_transitions_totalmetric - Events buffered during outage will be written when recovered
Prevention:
- Monitor OpenSearch cluster health proactively
- Set up auto-scaling for OpenSearch
- Configure index lifecycle management
- Regular disk cleanup
Issue 2: High Queue Size
Symptoms:
- Alert: "LargeAuditQueue - Queue size > 5,000 events"
- Alert: "AuditQueueOverflow - Queue > 10,000 events" (critical)
- Metric:
audit_queue_size > 5000 - Logs:
WARNING: Audit queue size: 8432 - Memory usage increasing on app servers
Diagnosis:
# Check queue size
curl http://localhost:8000/metrics | grep audit_queue_size
# Check OpenSearch write latency
curl http://localhost:8000/metrics | grep opensearch_write_latency
# Check batch flush duration
curl http://localhost:8000/metrics | grep audit_batch_flush_duration
# Check OpenSearch cluster performance
curl https://opensearch:9200/_cluster/stats
Root Causes:
- OpenSearch write throughput too low
- Batch flush taking too long
- Traffic spike overwhelming system
- OpenSearch cluster under-provisioned
- Network latency between app and OpenSearch
Resolution:
Immediate (stop the bleeding):
# Increase flush frequency (reduce interval)
# Update environment variable and restart
AUDIT_FLUSH_INTERVAL_SECONDS=1.0 # Down from 5.0
# Increase batch size (more efficient bulk writes)
AUDIT_BATCH_SIZE=500 # Up from 100
# Scale app horizontally (distribute load)
kubectl scale deployment api --replicas=5
Short-term (within hours):
# Scale OpenSearch cluster
# Add more data nodes for write capacity
kubectl scale statefulset opensearch --replicas=5
# Increase OpenSearch resources
# Edit deployment to increase CPU/memory
# Check index refresh interval
curl -X PUT https://opensearch:9200/audit-*/_settings -d '{
"index": {
"refresh_interval": "30s"
}
}'
Long-term (capacity planning):
- Review traffic patterns and plan capacity
- Implement auto-scaling based on queue size
- Consider dedicated OpenSearch cluster for audit
- Optimize index settings (shards, replicas)
- Implement index rollover for better performance
Monitoring:
# Track queue recovery
watch 'curl -s http://localhost:8000/metrics | grep audit_queue_size'
# Set up alert for sustained high queue
# Alert if queue > 5000 for 10+ minutes
Prevention:
- Capacity plan for 2x expected peak load
- Auto-scale OpenSearch based on CPU/memory
- Monitor write latency and queue size proactively
- Regular load testing
Issue 3: Index Creation Failures
Symptoms:
- Alert: "IndexCreationFailures - Unable to create new indices"
- Logs:
ERROR: Failed to create index audit-org123-acc456-api-2025-10-03 - HTTP 400/403 errors from OpenSearch
- Events failing to write
Diagnosis:
# Check index creation errors
curl http://localhost:8000/metrics | grep audit_errors_total | grep index_creation
# List current indices
curl https://opensearch:9200/_cat/indices?v | grep audit
# Check cluster settings
curl https://opensearch:9200/_cluster/settings?include_defaults=true
# Check shard limits
curl https://opensearch:9200/_cluster/stats | jq '.indices.shards'
Root Causes:
- OpenSearch shard limit reached (default 1000 per node)
- Disk space exhausted
- Index template conflicts
- Permission issues (user cannot create indices)
- Too many indices (need cleanup)
Resolution:
If shard limit reached:
# Increase shard limit temporarily
curl -X PUT https://opensearch:9200/_cluster/settings -d '{
"persistent": {
"cluster.max_shards_per_node": 2000
}
}'
# Delete old indices to free shards
python -m backend.audit.cleanup --older-than 90
# Or delete manually
curl -X DELETE https://opensearch:9200/audit-*-2025-01-*
If disk full:
# Check disk usage
curl https://opensearch:9200/_cat/allocation?v
# Enable watermark enforcement (if disabled)
curl -X PUT https://opensearch:9200/_cluster/settings -d '{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "85%",
"cluster.routing.allocation.disk.watermark.high": "90%"
}
}'
# Force merge old indices to reclaim space
curl -X POST https://opensearch:9200/audit-*-2025-09-*/_forcemerge?max_num_segments=1
# Delete old indices
curl -X DELETE https://opensearch:9200/audit-*-2025-08-*
If permission issues:
# Check user permissions
curl -u admin:password https://opensearch:9200/_plugins/_security/api/roles/audit_writer
# Update role to allow index creation
curl -X PUT https://opensearch:9200/_plugins/_security/api/roles/audit_writer -d '{
"cluster_permissions": ["cluster:monitor/health"],
"index_permissions": [{
"index_patterns": ["audit-*"],
"allowed_actions": ["write", "create_index", "indices:admin/create"]
}]
}'
Prevention:
- Automated retention policies (delete indices > 90 days)
- Monitor shard count and disk usage
- Use monthly rotation for low-volume organizations
- Regular cleanup job
Issue 4: High Latency
Symptoms:
- Alert: "HighAuditLatency - p95 latency > 100ms for 15 minutes"
- Alert: "HighMiddlewareLatency - p95 > 5ms for 15 minutes"
- Slow API response times
- User complaints about performance
Diagnosis:
# Check audit latency
curl http://localhost:8000/metrics | grep audit_events_latency_seconds
# Check middleware overhead
curl http://localhost:8000/metrics | grep audit_middleware_latency_seconds
# Check OpenSearch write latency
curl http://localhost:8000/metrics | grep opensearch_write_latency_seconds
# Check batch flush duration
curl http://localhost:8000/metrics | grep audit_batch_flush_duration_seconds
Root Causes:
- Synchronous operations blocking request thread
- Large request/response bodies (serialization overhead)
- OpenSearch cluster slow or overloaded
- Network latency to OpenSearch
- Queue full causing backpressure
Resolution:
If middleware overhead high:
# Verify async processing is working
# Check logs for blocking operations
docker logs api-container | grep "audit" | grep -i "block"
# Increase excluded paths (reduce audit volume)
AUDIT_EXCLUDED_PATHS='["/health","/metrics","/internal/*","/debug/*"]'
# Reduce body size limit (less serialization)
AUDIT_MAX_BODY_SIZE=1024 # Down from 10240
If OpenSearch writes slow:
# Check OpenSearch performance
curl https://opensearch:9200/_nodes/stats/indices
# Reduce batch size (smaller bulk requests)
AUDIT_BATCH_SIZE=50 # Down from 100
# Increase OpenSearch resources
# Scale cluster or increase CPU/memory
# Check index settings
curl https://opensearch:9200/audit-*/_settings
If network latency:
# Measure network latency
ping opensearch.internal
# Use dedicated network for audit traffic
# Configure app to use internal VPC endpoint
# Increase batch size to amortize latency
AUDIT_BATCH_SIZE=500
Prevention:
- Keep middleware logic minimal
- Monitor latency continuously
- Load test before production deployment
- Use async operations everywhere
Issue 5: Audit Events Not Appearing
Symptoms:
- Events missing from OpenSearch
- Searches return no results
- Dashboard shows no data
- Users cannot find audit trail
Diagnosis:
# Check if auditing enabled
echo $AUDIT_ENABLED
# Check metrics for events emitted
curl http://localhost:8000/metrics | grep audit_events_total
# Check OpenSearch for indices
curl https://opensearch:9200/_cat/indices?v | grep audit
# Check recent events in index
curl https://opensearch:9200/audit-*/_search?size=10&sort=occurred_at:desc
# Check logs for errors
docker logs api-container | grep -i "audit" | grep -i "error"
Root Causes:
- Auditing disabled via config
- Path excluded from auditing
- Circuit breaker open (events in logs only)
- OpenSearch index pattern mismatch
- Tenant context missing (events written to wrong index)
Resolution:
If auditing disabled:
# Enable auditing
export AUDIT_ENABLED=true
# Restart application
docker restart api-container
If path excluded:
# Check excluded paths
python -c "
from backend.audit.config import get_audit_config
config = get_audit_config()
print(config.excluded_paths)
"
# Check if path is excluded
python -c "
from backend.audit.config import get_audit_config
config = get_audit_config()
print(config.is_path_excluded('/your/path'))
"
# Update excluded paths if needed
AUDIT_EXCLUDED_PATHS='["/health","/metrics"]'
If circuit breaker open:
# Check circuit breaker state
curl http://localhost:8000/metrics | grep circuit_breaker_state
# Events in logs during outage
docker logs api-container | grep "AuditEvent"
# Fix OpenSearch and wait for recovery
# See "Issue 1: Circuit Breaker Open"
If index pattern mismatch:
# List all audit indices
curl https://opensearch:9200/_cat/indices?v | grep audit
# Check expected index name
# Pattern: audit-{org_id}-{account_id}-{service}-{date}
# Example: audit-org123-acc456-api-2025-10-03
# Create index pattern in OpenSearch Dashboards
# Pattern: audit-{your_org_id}-*
If tenant context missing:
# Check middleware order
# TenantMiddleware MUST run before AuditMiddleware
# Verify in app.py
grep -A 5 "add_middleware" backend/api/app.py
# Correct order:
# 1. CorrelationMiddleware
# 2. TenantMiddleware
# 3. AuditMiddleware
Prevention:
- Monitor event ingestion rate
- Test audit in all environments
- Verify middleware order in CI/CD
- Document excluded paths
Troubleshooting Guide
Debug Logging
Enable detailed debug logging:
# In code
import logging
logging.getLogger("backend.audit").setLevel(logging.DEBUG)
# Or via environment
export LOG_LEVEL=DEBUG
Debug output includes:
- Event creation and sanitization
- Batch queue operations
- OpenSearch write attempts
- Circuit breaker state changes
- Error stack traces
Manual Event Testing
Test event emission manually:
from backend.audit import emit
from backend.utils.actor import ActorInfo
# Emit test event
await emit(
action="test.manual",
target="test:runbook",
metadata={"test": True, "timestamp": "2025-10-03"},
actor=ActorInfo(actor_type="system", actor_id="test"),
organization_id="org_test",
account_id="acc_test"
)
# Check OpenSearch for event
# Index: audit-org_test-acc_test-api-2025-10-03
Verify in OpenSearch:
curl https://opensearch:9200/audit-org_test-acc_test-api-2025-10-03/_search?q=action:test.manual
OpenSearch Query Examples
Find events by actor:
curl -X POST https://opensearch:9200/audit-*/_search -d '{
"query": {
"term": {"actor_id": "user_123"}
},
"sort": [{"occurred_at": "desc"}],
"size": 100
}'
Find failed operations:
curl -X POST https://opensearch:9200/audit-*/_search -d '{
"query": {
"bool": {
"must": [
{"term": {"operation_result": "failure"}},
{"range": {"response_status_code": {"gte": 400}}}
]
}
}
}'
Find recent errors:
curl -X POST https://opensearch:9200/audit-*/_search -d '{
"query": {
"bool": {
"must": [
{"range": {"response_status_code": {"gte": 500}}},
{"range": {"occurred_at": {"gte": "now-1h"}}}
]
}
},
"sort": [{"occurred_at": "desc"}]
}'
Trace a request:
curl -X POST https://opensearch:9200/audit-*/_search -d '{
"query": {
"term": {"trace_id": "abc123def456"}
}
}'
Performance Profiling
Profile audit system performance:
# Add timing instrumentation
import time
start = time.time()
await emit(action="test", target="perf", request=request)
duration = time.time() - start
print(f"Emit took {duration*1000:.2f}ms")
Check metrics:
# Middleware overhead
curl -s http://localhost:8000/metrics | grep audit_middleware_latency_seconds
# Event processing
curl -s http://localhost:8000/metrics | grep audit_events_latency_seconds
# Batch flush
curl -s http://localhost:8000/metrics | grep audit_batch_flush_duration_seconds
Health Check Script
#!/bin/bash
# audit_health_check.sh
echo "=== Audit System Health Check ==="
# 1. Check if auditing is enabled
if [ "$AUDIT_ENABLED" = "true" ]; then
echo "✓ Auditing enabled"
else
echo "✗ Auditing disabled"
exit 1
fi
# 2. Check OpenSearch connectivity
if curl -s -u $AUDIT_OPENSEARCH_USERNAME:$AUDIT_OPENSEARCH_PASSWORD \
https://$AUDIT_OPENSEARCH_HOST:$AUDIT_OPENSEARCH_PORT/_cluster/health | grep -q green; then
echo "✓ OpenSearch cluster healthy"
else
echo "✗ OpenSearch cluster unhealthy"
fi
# 3. Check circuit breaker state
CB_STATE=$(curl -s http://localhost:8000/metrics | grep 'circuit_breaker_state{sink_type="opensearch"}' | awk '{print $2}')
if [ "$CB_STATE" = "0" ]; then
echo "✓ Circuit breaker closed"
else
echo "✗ Circuit breaker open/half-open"
fi
# 4. Check queue size
QUEUE_SIZE=$(curl -s http://localhost:8000/metrics | grep 'audit_queue_size{' | awk '{print $2}')
if [ "$QUEUE_SIZE" -lt 5000 ]; then
echo "✓ Queue size normal ($QUEUE_SIZE)"
else
echo "⚠ Queue size high ($QUEUE_SIZE)"
fi
# 5. Check event ingestion rate
EVENTS=$(curl -s http://localhost:8000/metrics | grep 'audit_events_total{' | head -1 | awk '{print $2}')
echo "✓ Events emitted: $EVENTS"
echo "=== Health Check Complete ==="
Recovery Procedures
Procedure 1: Restart Audit System
When to use: After configuration changes, or to recover from errors
# 1. Graceful shutdown (flush pending events)
curl -X POST http://localhost:8000/admin/audit/flush
# 2. Wait for flush to complete (check queue size)
curl -s http://localhost:8000/metrics | grep audit_queue_size
# 3. Restart application
docker restart api-container
# Or on Kubernetes
kubectl rollout restart deployment api
# 4. Verify startup
docker logs -f api-container | grep "audit"
# 5. Check health
curl http://localhost:8000/health/audit
Procedure 2: Flush Pending Events
When to use: Before shutdown, or to clear queue
# 1. Check current queue size
curl -s http://localhost:8000/metrics | grep audit_queue_size
# 2. Trigger manual flush
curl -X POST http://localhost:8000/admin/audit/flush
# 3. Wait for flush (monitor queue size)
watch 'curl -s http://localhost:8000/metrics | grep audit_queue_size'
# 4. Verify all events written
# Queue size should be 0
Procedure 3: Recover from OpenSearch Outage
When to use: After extended OpenSearch downtime
During outage:
- Events automatically go to logs (fallback)
- Circuit breaker opens to protect system
- Monitor for OpenSearch recovery
After OpenSearch recovers:
# 1. OpenSearch comes back online
curl https://opensearch:9200/_cluster/health
# 2. Circuit breaker auto-recovers (60s timeout)
# Monitor state transition
curl -s http://localhost:8000/metrics | grep circuit_breaker_state
# 3. New events resume writing to OpenSearch
# Check event count in indices
curl https://opensearch:9200/audit-*/_count
# 4. Review log events during outage
docker logs api-container --since 1h | grep "AuditEvent"
Lost events during outage:
- Events written to structured logs
- Can be extracted and replayed if needed
- Use log aggregation tool (e.g., CloudWatch, Datadog)
Procedure 4: Recover from Data Loss
When to use: Index accidentally deleted, corruption
# 1. Check if snapshots available
curl https://opensearch:9200/_snapshot/_all
# 2. List snapshots for date range
curl https://opensearch:9200/_snapshot/audit_backup/_all
# 3. Restore specific indices
curl -X POST https://opensearch:9200/_snapshot/audit_backup/snapshot_2025-10-03/_restore -d '{
"indices": "audit-org123-acc456-api-2025-10-03",
"ignore_unavailable": true,
"include_global_state": false
}'
# 4. Monitor restore progress
curl https://opensearch:9200/_recovery?active_only=true
# 5. Verify data restored
curl https://opensearch:9200/audit-org123-acc456-api-2025-10-03/_count
Prevention:
- Daily automated snapshots
- Immutable index settings (prevent deletion)
- Backup retention: 30 days
Procedure 5: Emergency Disable
When to use: Critical performance issue, security concern
# 1. Disable auditing immediately
export AUDIT_ENABLED=false
# 2. Restart application
docker restart api-container
# 3. Verify auditing stopped
curl -s http://localhost:8000/metrics | grep audit_events_total
# Count should stop increasing
# 4. Investigate issue while system is stable
# 5. Re-enable when resolved
export AUDIT_ENABLED=true
docker restart api-container
⚠️ WARNING: Disabling audit breaks compliance. Only use in emergencies.
Capacity Planning
Event Volume Estimation
Calculate expected events per day:
Events/day = (API requests/day) × (1 - excluded_ratio)
Example:
- 10M API requests/day
- 20% excluded (/health, /metrics)
- Events/day = 10M × 0.8 = 8M events/day
Storage estimation:
Storage/day = Events/day × Avg_event_size
Example:
- 8M events/day
- 5KB average event size (after sanitization)
- Storage/day = 8M × 5KB = 40GB/day
- Storage/month = 40GB × 30 = 1.2TB/month
OpenSearch Sizing
Cluster sizing guidelines:
| Traffic Level | Events/sec | Storage/day | Recommended Cluster |
|---|---|---|---|
| Low | < 100 | < 5GB | 1 data node, 2 CPU, 4GB RAM |
| Medium | 100-1000 | 5-50GB | 3 data nodes, 4 CPU, 8GB RAM |
| High | 1000-5000 | 50-250GB | 5 data nodes, 8 CPU, 16GB RAM |
| Very High | > 5000 | > 250GB | 10+ data nodes, 16 CPU, 32GB RAM |
Disk sizing:
Total disk = (Daily storage × Retention days) × Replication factor × Overhead
Example:
- 40GB/day
- 90 days retention
- 1 replica (2x)
- 20% overhead
- Total = 40 × 90 × 2 × 1.2 = 8.6TB
Scaling Guidelines
Horizontal scaling (add app instances):
- Increases event throughput
- Distributes queue load
- Each instance has independent queue
Vertical scaling (increase app resources):
- Larger queue capacity (more memory)
- Faster event processing (more CPU)
OpenSearch scaling:
# Scale data nodes
kubectl scale statefulset opensearch --replicas=5
# Increase node resources
# Edit statefulset to increase CPU/memory limits
# Add dedicated master nodes (for large clusters)
# Prevents split-brain, improves stability
Auto-Scaling Configuration
Application auto-scaling (Kubernetes HPA):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: audit_queue_size
target:
type: AverageValue
averageValue: 3000 # Scale up if avg queue > 3000
OpenSearch auto-scaling:
- Use managed OpenSearch (AWS, Elastic Cloud)
- Configure auto-scaling based on CPU/memory
- Or monitor and scale manually
Cost Optimization
Reduce storage costs:
- Decrease retention period (90 → 60 days)
- Use compression (gzip indices)
- Reduce replica count (2 → 1)
- Implement ILM (hot → warm → cold → delete)
- Increase body truncation (10KB → 5KB)
Reduce compute costs:
- Increase batch size (more efficient writes)
- Exclude more paths (reduce volume)
- Use cheaper instance types for cold nodes
- Right-size cluster for actual load
Alerting Configuration
Critical Alerts (PagerDuty)
Alert: AuditSystemDown
- alert: AuditSystemDown
expr: rate(audit_events_total[5m]) == 0
for: 5m
labels:
severity: critical
component: audit
annotations:
summary: "Audit system not writing events"
description: "No audit events written in last 5 minutes"
action: "Check if auditing is enabled, verify OpenSearch connectivity"
Alert: CircuitBreakerOpen
- alert: CircuitBreakerOpen
expr: circuit_breaker_state{sink_type="opensearch"} == 1
for: 10m
labels:
severity: critical
component: audit
annotations:
summary: "OpenSearch circuit breaker open"
description: "Audit events going to logs only for 10+ minutes"
action: "Fix OpenSearch connectivity, check cluster health"
Alert: HighAuditErrorRate
- alert: HighAuditErrorRate
expr: rate(audit_errors_total[5m]) / rate(audit_events_total[5m]) > 0.05
for: 5m
labels:
severity: critical
component: audit
annotations:
summary: "High audit error rate (>5%)"
description: "More than 5% of audit writes failing"
action: "Check OpenSearch cluster, review error logs"
Alert: AuditQueueOverflow
- alert: AuditQueueOverflow
expr: audit_queue_size > 10000
for: 5m
labels:
severity: critical
component: audit
annotations:
summary: "Audit queue overflow (>10,000 events)"
description: "Queue size exceeded threshold, risk of event loss"
action: "Scale OpenSearch, increase flush frequency, scale app"
Warning Alerts (Slack)
Alert: HighAuditLatency
- alert: HighAuditLatency
expr: histogram_quantile(0.95, rate(audit_events_latency_seconds_bucket[5m])) > 0.1
for: 15m
labels:
severity: warning
component: audit
annotations:
summary: "High audit latency (p95 > 100ms)"
description: "Audit event processing taking too long"
action: "Check OpenSearch performance, review batch settings"
Alert: LargeAuditQueue
- alert: LargeAuditQueue
expr: audit_queue_size > 5000
for: 10m
labels:
severity: warning
component: audit
annotations:
summary: "Large audit queue (>5,000 events)"
description: "Queue size elevated, monitor for growth"
action: "Monitor OpenSearch write throughput, prepare to scale"
Dashboard Links
Include dashboard links in alerts:
annotations:
dashboard: "https://grafana.internal/d/audit-system-health"
runbook: "https://docs.internal/audit-operations-runbook"
Monitoring Best Practices
Key Dashboards
-
Audit System Health (operations)
- Event ingestion rate
- Queue size and latency
- Circuit breaker state
- Error rate
-
API Activity (business)
- Request volume by endpoint
- Error rate by endpoint
- Response time trends
-
Security & Compliance (security team)
- Authentication events
- Authorization decisions
- Failed access attempts
- Sensitive operations
-
Multi-Tenant Activity (product)
- Events by organization
- Storage usage
- Top active users
Metric Retention
- High resolution (1m): 7 days
- Medium resolution (5m): 30 days
- Low resolution (1h): 365 days
Log Retention
- Application logs: 30 days
- Audit events (OpenSearch): 90 days (configurable)
- Metrics: 365 days
Security Considerations
Access Control
OpenSearch access:
- Use dedicated service account
- Least privilege permissions (write only to audit-*)
- Rotate credentials regularly (90 days)
- Use SSL/TLS for all connections
Audit log access:
- Restrict to security team and compliance
- Use role-based access control (RBAC)
- Audit access to audit logs (meta-audit)
- Require MFA for audit log access
Data Protection
Encryption:
- In transit: TLS 1.3 for all connections
- At rest: OpenSearch encryption enabled
- Credentials: Use secrets manager (AWS Secrets, Vault)
Immutability:
- Indices configured as write-once
- No delete/update permissions
- Daily backups to immutable storage (S3)
Compliance
GDPR:
- Data minimization (automatic sanitization)
- Right to access (search by user)
- Right to erasure (anonymization process)
- Data retention (configurable per-org)
SOC2:
- Complete audit trail of all operations
- Immutable audit logs
- Encryption at rest and in transit
- Access control and monitoring
HIPAA:
- PHI access logging (all events captured)
- 6-year retention capability
- Encryption (FIPS 140-2)
- Audit log integrity checks
Useful Commands
Quick Reference
# Check audit system health
curl http://localhost:8000/health/audit
# View metrics
curl http://localhost:8000/metrics | grep audit
# Flush pending events
curl -X POST http://localhost:8000/admin/audit/flush
# Check OpenSearch health
curl https://opensearch:9200/_cluster/health
# List audit indices
curl https://opensearch:9200/_cat/indices?v | grep audit
# Count events in index
curl https://opensearch:9200/audit-org123-*/_count
# Delete old indices (cleanup)
curl -X DELETE https://opensearch:9200/audit-*-2025-01-*
# Search recent events
curl -X POST https://opensearch:9200/audit-*/_search -d '{
"query": {"match_all": {}},
"sort": [{"occurred_at": "desc"}],
"size": 10
}'
Support Contacts
Escalation Path
-
L1 Support: DevOps on-call
- Basic health checks
- Restart services
- Check metrics/logs
-
L2 Support: SRE team
- Troubleshoot issues
- Scale infrastructure
- Config changes
-
L3 Support: Engineering team
- Code changes
- Architecture decisions
- Feature requests
Documentation
- User Guide:
/docs/05-security-and-audit/01-audit-trails.md - Configuration:
/docs/05-security-and-audit/02-audit-configuration.md - This Runbook:
/docs/05-security-and-audit/03-audit-operations-runbook.md - Architecture:
/features/audit-module/PRD.md
Communication Channels
- Incidents: PagerDuty #audit-system-critical
- Warnings: Slack #audit-system-alerts
- Questions: Slack #platform-support
- Changes: Slack #platform-changes
Appendix
Configuration Reference
See full configuration reference in Audit Configuration Guide
Architecture Diagrams
See detailed architecture in Audit Module PRD
Changelog
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2025-10-03 | Initial runbook creation |
Feedback
This runbook is maintained by the Platform Engineering team. For corrections or improvements:
- Open an issue in the internal docs repository
- Contact the Platform Engineering team
- Contribute via pull request