Audit Module Production Deployment Guide
This guide covers the complete production deployment process for the PDaaS Audit Module, including canary deployment, gradual rollout, and post-deployment monitoring.
Table of Contents
- Pre-Deployment Checklist
- Infrastructure Prerequisites
- Deployment Phases
- Canary Deployment
- Full Rollout
- Rollback Procedures
- Post-Deployment Monitoring
- Troubleshooting
Pre-Deployment Checklist
Testing and Quality Assurance
- All unit tests passing (488+ tests)
- All integration tests passing
- All E2E tests passing (if applicable)
- Load tests completed successfully
- Sustained load: 5,000 req/s for 1 hour
- Spike load: 20,000 req/s for 5 minutes
- Multi-tenant load: 100 orgs, 10,000 req/s
- Performance targets met
- Request overhead < 5ms (p95)
- OpenSearch write latency < 100ms (p95)
- Queue size < 1,000 events under normal load
- Error rate < 0.1%
Security Review
- Security review completed and approved
- No critical or high-severity vulnerabilities
- Sensitive data sanitization verified
- Multi-tenant isolation tested
- TLS/SSL configuration validated
- Authentication credentials secured
Infrastructure Readiness
- OpenSearch cluster provisioned and configured
- Cluster health: GREEN
- Sufficient storage capacity (3+ months)
- Replication configured (min 1 replica)
- Backup/snapshot configured
- Network connectivity validated
- Application can reach OpenSearch (port 9200)
- Firewall rules configured
- VPC/security groups configured
- Monitoring infrastructure ready
- Prometheus configured to scrape metrics
- AlertManager configured with alert rules
- OpenSearch Dashboards configured
- PagerDuty/Slack integration configured
Configuration and Documentation
- Production environment variables configured
- Excluded paths configuration reviewed
- Index lifecycle policies configured
- Operational runbook reviewed
- Team trained on runbook procedures
- Rollback plan documented and reviewed
Approvals
- Code review approved
- Security team approval
- Operations team approval
- Product owner approval
- Stakeholder notification sent
Infrastructure Prerequisites
OpenSearch Cluster Requirements
Cluster Sizing
Based on load testing results and expected traffic:
Production Cluster (Recommended):
- Nodes: 3 data nodes (minimum)
- Instance Type: r6g.2xlarge (8 vCPU, 64 GB RAM)
- Storage: 1 TB SSD per node (3 TB total)
- Replication: 1 replica (2 copies of data)
- Shards: 5 primary shards per index
Traffic Capacity:
- Events/second: 10,000-20,000 sustained
- Daily events: 500M-1B
- Daily storage: 50-100 GB
- Retention: 90 days
Cluster Configuration
# opensearch.yml
cluster.name: pdaas-audit-prod
node.name: ${HOSTNAME}
network.host: 0.0.0.0
# Security
plugins.security.ssl.http.enabled: true
plugins.security.ssl.transport.enabled: true
# Performance
indices.memory.index_buffer_size: 30%
thread_pool.write.queue_size: 1000
# Snapshots
path.repo: ["/mnt/snapshots"]
Index Templates
Create index templates before deployment:
# Create audit index template
curl -X PUT "https://opensearch.prod.internal:9200/_index_template/audit-template" \
-H 'Content-Type: application/json' \
-u "admin:password" \
-d '{
"index_patterns": ["audit-*"],
"template": {
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1,
"refresh_interval": "30s",
"index.codec": "best_compression"
},
"mappings": {
"properties": {
"occurred_at": {"type": "date"},
"actor_id": {"type": "keyword"},
"actor_type": {"type": "keyword"},
"organization_id": {"type": "keyword"},
"account_id": {"type": "keyword"},
"action": {"type": "keyword"},
"target": {"type": "keyword"},
"service": {"type": "keyword"},
"request_method": {"type": "keyword"},
"request_path": {"type": "keyword"},
"response_status_code": {"type": "short"},
"response_duration_ms": {"type": "float"},
"trace_id": {"type": "keyword"}
}
}
}
}'
Index Lifecycle Policy
Configure automated retention:
# Create ILM policy for 90-day retention
curl -X PUT "https://opensearch.prod.internal:9200/_plugins/_ism/policies/audit-lifecycle" \
-H 'Content-Type: application/json' \
-u "admin:password" \
-d '{
"policy": {
"description": "Audit log lifecycle: hot(30d) -> warm(30d) -> cold(30d) -> delete",
"default_state": "hot",
"states": [
{
"name": "hot",
"actions": [],
"transitions": [{
"state_name": "warm",
"conditions": {"min_index_age": "30d"}
}]
},
{
"name": "warm",
"actions": [
{"replica_count": {"number_of_replicas": 1}}
],
"transitions": [{
"state_name": "cold",
"conditions": {"min_index_age": "60d"}
}]
},
{
"name": "cold",
"actions": [
{"replica_count": {"number_of_replicas": 0}},
{"read_only": {}}
],
"transitions": [{
"state_name": "delete",
"conditions": {"min_index_age": "90d"}
}]
},
{
"name": "delete",
"actions": [{"delete": {}}],
"transitions": []
}
]
}
}'
Application Configuration
Production Environment Variables
Create production .env file:
# Environment
ENV=production
# Audit Module - Production Configuration
AUDIT_ENABLED=true
AUDIT_SERVICE_NAME=api
AUDIT_ENVIRONMENT=production
AUDIT_SERVICE_VERSION=1.0.0
# OpenSearch Connection
AUDIT_OPENSEARCH_HOST=opensearch.prod.internal
AUDIT_OPENSEARCH_PORT=9200
AUDIT_OPENSEARCH_USERNAME=audit_writer
AUDIT_OPENSEARCH_PASSWORD=${OPENSEARCH_AUDIT_PASSWORD} # From secrets manager
AUDIT_OPENSEARCH_USE_SSL=true
AUDIT_OPENSEARCH_VERIFY_CERTS=true
AUDIT_OPENSEARCH_CA_CERTS=/etc/ssl/certs/opensearch-ca.crt
AUDIT_OPENSEARCH_TIMEOUT=30
# Performance Tuning (Production)
AUDIT_BATCH_SIZE=500
AUDIT_FLUSH_INTERVAL_SECONDS=10.0
AUDIT_MAX_BODY_SIZE=20480
AUDIT_MAX_QUEUE_SIZE=50000
# Path Exclusion
AUDIT_EXCLUDED_PATHS=/health,/healthz,/metrics,/docs,/redoc,/openapi.json
# Index Configuration
AUDIT_INDEX_PREFIX=audit
AUDIT_INDEX_ROTATION=daily
# Retry Configuration
AUDIT_RETRY_ATTEMPTS=3
AUDIT_RETRY_BACKOFF_SECONDS=2.0
AUDIT_RETRY_BACKOFF_MULTIPLIER=2.0
OpenSearch User Setup
Create dedicated audit writer user with minimal permissions:
# Create audit_writer role
curl -X PUT "https://opensearch.prod.internal:9200/_plugins/_security/api/roles/audit_writer" \
-H 'Content-Type: application/json' \
-u "admin:password" \
-d '{
"cluster_permissions": ["cluster:monitor/health"],
"index_permissions": [{
"index_patterns": ["audit-*"],
"allowed_actions": [
"indices:data/write/index",
"indices:data/write/bulk",
"indices:admin/create"
]
}]
}'
# Create audit_writer user
curl -X PUT "https://opensearch.prod.internal:9200/_plugins/_security/api/internalusers/audit_writer" \
-H 'Content-Type: application/json' \
-u "admin:password" \
-d '{
"password": "STRONG_PASSWORD_HERE",
"opendistro_security_roles": ["audit_writer"]
}'
Deployment Phases
Phase 1: Infrastructure Setup (Day 0)
Duration: 2-4 hours
-
Provision OpenSearch cluster
# Using AWS CloudFormation, Terraform, or manual setup
terraform apply -var-file=production.tfvars -
Verify cluster health
curl -u "admin:password" "https://opensearch.prod.internal:9200/_cluster/health?pretty" -
Create index templates and policies
./scripts/setup-opensearch-production.sh -
Configure monitoring
# Apply Prometheus scrape config
kubectl apply -f k8s/prometheus-config.yaml
# Import OpenSearch Dashboards
python backend/audit/dashboards/import_dashboard.py \
--url https://opensearch.prod.internal:9200 \
--username admin \
--all -
Configure alerting
# Apply AlertManager rules
kubectl apply -f k8s/alertmanager-rules.yaml
Phase 2: Canary Deployment (Day 1-2)
Duration: 24-48 hours Traffic: 10% of production
See detailed Canary Deployment section below.
Phase 3: Gradual Rollout (Day 3-5)
Duration: 48-72 hours Traffic: 10% → 25% → 50% → 100%
See detailed Full Rollout section below.
Phase 4: Post-Deployment Monitoring (Day 6-36)
Duration: 30 days Traffic: 100%
See detailed Post-Deployment Monitoring section below.
Canary Deployment
Overview
Canary deployment enables risk-free production deployment by:
- Deploying to a small subset of traffic (10%)
- Monitoring for issues without impacting all users
- Quick rollback if problems detected
- Confidence building before full rollout
Canary Strategy
Approach: Route-based canary using excluded paths
Initial State (Before Canary):
# backend/api/app.py - init_audit()
config = get_audit_config()
# Audit disabled or fully excluded
config.enabled = False # OR
config.excluded_paths = ["/*"] # Exclude all paths
Canary State (10% traffic):
# backend/api/app.py - init_audit()
config = get_audit_config()
config.enabled = True
# Only audit 10% of endpoints (exclude 90%)
config.excluded_paths = [
"/health",
"/metrics",
"/docs",
"/redoc",
"/openapi.json",
# Exclude most endpoints (keep only critical for canary)
"/accounts/*",
"/memberships/*",
"/groups/*",
"/service-accounts/*",
# Keep: /authentication/*, /policies/*, /organizations/*
]
Canary Deployment Steps
Step 1: Pre-Deployment Validation (30 min)
# 1. Verify all tests pass
python -m pytest backend/audit/tests/ -v
# Expected: 488+ tests passing
# 2. Verify OpenSearch cluster health
curl -u "audit_writer:password" \
"https://opensearch.prod.internal:9200/_cluster/health?pretty"
# Expected: {"status": "green"}
# 3. Verify application can connect
python -c "
from backend.audit.config import get_audit_config
from backend.audit.client import OpenSearchClientFactory
import asyncio
async def test():
config = get_audit_config()
client = await OpenSearchClientFactory.get_client(config)
healthy = await client.health_check()
print(f'OpenSearch healthy: {healthy}')
asyncio.run(test())
"
# Expected: OpenSearch healthy: True
Step 2: Deploy Canary (15 min)
# 1. Update environment variables
export AUDIT_ENABLED=true
export AUDIT_ENVIRONMENT=production
# ... other production vars
# 2. Deploy application with canary configuration
# Using Kubernetes rolling update
kubectl set env deployment/pdaas-api AUDIT_ENABLED=true
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/health,/metrics,/accounts/*,/memberships/*,/groups/*,/service-accounts/*"
# Or using Docker
docker-compose -f docker-compose.prod.yml up -d --no-deps pdaas-api
# 3. Verify deployment
kubectl rollout status deployment/pdaas-api
# 4. Verify audit middleware is active
curl https://api.prod.internal/health
# Check logs for audit initialization
kubectl logs -l app=pdaas-api | grep "Audit system initialized"
Step 3: Monitor Canary (24 hours)
Monitoring Checklist:
-
Check Prometheus metrics (every 15 minutes for first hour, then hourly)
# Event ingestion rate
rate(audit_events_total[5m])
# Event processing latency (should be < 5ms p95)
histogram_quantile(0.95, rate(audit_events_latency_seconds_bucket[5m]))
# Error rate (should be < 0.1%)
rate(audit_errors_total[5m]) / rate(audit_events_total[5m])
# Queue size (should be < 1000)
audit_queue_size
# Circuit breaker state (should be 0 = CLOSED)
circuit_breaker_state -
Check OpenSearch Dashboards (every hour)
- Navigate to "Audit System Health" dashboard
- Verify events are being written
- Check for errors or anomalies
- Review latency distribution
-
Check application logs (every 30 minutes)
kubectl logs -l app=pdaas-api --tail=100 | grep -i "audit\|error" -
Check alerts (continuous)
- No critical alerts should fire
- Warning alerts acceptable if transient
-
Verify data in OpenSearch (every 2 hours)
# Count events written
curl -u "audit_writer:password" \
"https://opensearch.prod.internal:9200/audit-*/_count?pretty"
# Sample recent events
curl -u "audit_writer:password" \
"https://opensearch.prod.internal:9200/audit-*/_search?pretty&size=10&sort=occurred_at:desc"
Success Criteria (24 hours):
- No critical alerts triggered
- Error rate < 0.1%
- Request latency impact < 5ms (p95)
- OpenSearch write latency < 100ms (p95)
- Queue size < 1,000 events
- Circuit breaker remained CLOSED
- No application errors related to audit
- Events successfully written to OpenSearch
- Multi-tenant isolation verified (spot checks)
Step 4: Canary Decision (15 min)
If success criteria met:
- ✅ Proceed to gradual rollout (Phase 3)
- Document any minor issues for follow-up
- Communicate success to stakeholders
If success criteria NOT met:
- ❌ Initiate rollback (see Rollback Procedures)
- Conduct incident review
- Fix issues before retry
Canary Rollback
If issues detected during canary:
# Quick rollback: Disable audit via environment variable
kubectl set env deployment/pdaas-api AUDIT_ENABLED=false
# Or: Exclude all paths (safer, keeps code path active)
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/*"
# Verify rollback
kubectl rollout status deployment/pdaas-api
# Verify audit disabled
kubectl logs -l app=pdaas-api | grep "Audit system disabled"
Full Rollout
Overview
After successful canary (24-48 hours), gradually increase traffic:
- Day 3: 25% (remove more exclusions)
- Day 4: 50% (remove more exclusions)
- Day 5: 100% (audit all endpoints except health checks)
Rollout Steps
Day 3: 25% Traffic
# Update excluded paths to audit 25% of endpoints
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/health,/metrics,/accounts/*,/memberships/*"
# Monitor for 24 hours using same checklist as canary
Success Criteria:
- Same as canary, but with 2.5x traffic
Day 4: 50% Traffic
# Update excluded paths to audit 50% of endpoints
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/health,/metrics,/accounts/*"
# Monitor for 24 hours
Success Criteria:
- Same as canary, but with 5x traffic
- OpenSearch cluster still healthy
- Storage usage within projections
Day 5: 100% Traffic (Full Production)
# Update excluded paths to audit all except health/metrics
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/health,/healthz,/metrics"
# Monitor for 24 hours, then 7 days, then 30 days
Success Criteria:
- Same as canary, but with 10x traffic
- Full compliance coverage achieved
- No performance degradation
Rollout Decision Gates
Before proceeding to next phase, verify:
- No critical alerts in previous phase
- Error rate < 0.1% consistently
- Performance targets met (< 5ms p95 latency)
- OpenSearch cluster healthy (GREEN status)
- Storage usage trending as expected
- Team confidence high (no major issues)
If any gate fails:
- Pause rollout
- Investigate and fix issues
- Re-run previous phase
- Only proceed when gates pass
Rollback Procedures
Scenario 1: High Error Rate
Trigger: Error rate > 5% for 5+ minutes
Action:
# Immediate: Disable audit
kubectl set env deployment/pdaas-api AUDIT_ENABLED=false
# Or: Exclude all paths
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/*"
# Investigate errors
kubectl logs -l app=pdaas-api | grep -i "error" > error-log.txt
# Fix issues
# Re-deploy with fixes
# Retry from canary phase
Scenario 2: OpenSearch Unavailable
Trigger: Circuit breaker OPEN for 10+ minutes
Action:
# Verify circuit breaker is working (events going to logs)
kubectl logs -l app=pdaas-api | grep "Circuit breaker open"
# This is expected behavior - no rollback needed
# Events are being logged instead of lost
# Fix OpenSearch connectivity issue
# Circuit breaker will auto-close when OpenSearch recovers
Scenario 3: Performance Degradation
Trigger: Request latency increase > 10ms (p95)
Action:
# Investigate cause
kubectl top pods -l app=pdaas-api
kubectl logs -l app=pdaas-api | grep "latency\|slow"
# If caused by audit:
# Option 1: Tune batching parameters
kubectl set env deployment/pdaas-api AUDIT_BATCH_SIZE=1000
kubectl set env deployment/pdaas-api AUDIT_FLUSH_INTERVAL_SECONDS=15.0
# Option 2: Exclude more paths
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/health,/metrics,/high-traffic-endpoint/*"
# Option 3: Full rollback (last resort)
kubectl set env deployment/pdaas-api AUDIT_ENABLED=false
Scenario 4: Storage Overflow
Trigger: OpenSearch disk usage > 80%
Action:
# Check storage usage
curl -u "admin:password" \
"https://opensearch.prod.internal:9200/_cat/indices/audit-*?v&h=index,store.size&s=store.size:desc"
# Option 1: Reduce body size to save space
kubectl set env deployment/pdaas-api AUDIT_MAX_BODY_SIZE=5120 # 5KB instead of 20KB
# Option 2: Delete old indices manually
curl -X DELETE -u "admin:password" \
"https://opensearch.prod.internal:9200/audit-*-2025-09-*"
# Option 3: Increase retention policy speed
# (Update ILM policy to delete after 30 days instead of 90)
# Option 4: Add storage capacity
# (Resize OpenSearch cluster)
Emergency Full Rollback
When: Critical production incident caused by audit module
Steps:
# 1. Disable audit immediately
kubectl set env deployment/pdaas-api AUDIT_ENABLED=false
# 2. Verify application recovery
curl https://api.prod.internal/health
# 3. Verify no audit activity
kubectl logs -l app=pdaas-api | grep "Audit system disabled"
# 4. Create incident report
# 5. Conduct post-mortem
# 6. Fix root cause
# 7. Re-test in staging
# 8. Retry production deployment from Phase 1
Post-Deployment Monitoring
First 24 Hours (Critical Period)
Monitoring Frequency: Every 15 minutes for first hour, then hourly
Key Metrics:
- Event ingestion rate
- Event processing latency
- Error rate
- Queue size
- Circuit breaker state
- OpenSearch cluster health
- Storage growth rate
Actions:
- Have on-call engineer monitoring
- Keep rollback procedure ready
- Log all anomalies for review
First 7 Days (Stabilization Period)
Monitoring Frequency: 3x daily (morning, afternoon, evening)
Key Activities:
- Review dashboards for trends
- Analyze error logs
- Check storage usage growth
- Verify multi-tenant isolation
- Collect feedback from team
- Tune configuration if needed
First 30 Days (Validation Period)
Monitoring Frequency: Daily
Success Metrics:
- Availability: 99.9% uptime
- Performance: < 5ms latency impact (p95)
- Reliability: < 0.1% error rate
- Compliance: 100% coverage of security operations
- Storage: Usage within 10% of projections
- Incidents: Zero critical incidents
End of 30 Days:
- Conduct deployment retrospective
- Document lessons learned
- Update runbook with production insights
- Mark feature as "Generally Available" (GA)
Troubleshooting
Issue: Events Not Appearing in OpenSearch
Symptoms:
- Prometheus shows events emitted (
audit_events_totalincreasing) - OpenSearch shows no events (
audit-*indices empty or missing)
Diagnosis:
# 1. Check circuit breaker state
curl http://api.prod.internal/metrics | grep circuit_breaker_state
# 2. Check OpenSearch connectivity
kubectl exec -it pdaas-api-pod -- python -c "
from backend.audit.client import OpenSearchClientFactory
from backend.audit.config import get_audit_config
import asyncio
async def test():
config = get_audit_config()
client = await OpenSearchClientFactory.get_client(config)
healthy = await client.health_check()
print(f'Healthy: {healthy}')
asyncio.run(test())
"
# 3. Check application logs
kubectl logs -l app=pdaas-api | grep -i "opensearch\|circuit"
Resolution:
- If circuit breaker OPEN: Fix OpenSearch connectivity
- If authentication failed: Verify credentials
- If network issue: Check firewall/security groups
Issue: High Memory Usage
Symptoms:
- Application pod memory usage increasing
- OOMKilled errors in Kubernetes
Diagnosis:
# Check queue size
curl http://api.prod.internal/metrics | grep audit_queue_size
Resolution:
# Reduce queue size
kubectl set env deployment/pdaas-api AUDIT_MAX_QUEUE_SIZE=5000
# Increase flush frequency
kubectl set env deployment/pdaas-api AUDIT_FLUSH_INTERVAL_SECONDS=2.0
# Increase batch size (fewer flushes)
kubectl set env deployment/pdaas-api AUDIT_BATCH_SIZE=1000
Issue: OpenSearch Indices Growing Too Fast
Symptoms:
- Disk usage exceeding projections
- Cluster disk watermark warnings
Diagnosis:
# Check index sizes
curl -u "admin:password" \
"https://opensearch.prod.internal:9200/_cat/indices/audit-*?v&h=index,docs.count,store.size&s=store.size:desc"
# Check average event size
curl -u "admin:password" \
"https://opensearch.prod.internal:9200/audit-*/_stats?pretty" | jq '.indices | to_entries | map({index: .key, docs: .value.total.docs.count, size: .value.total.store.size_in_bytes}) | map(.avg_size = (.size / .docs))'
Resolution:
# Reduce body size
kubectl set env deployment/pdaas-api AUDIT_MAX_BODY_SIZE=5120 # 5KB
# Exclude high-volume, low-value endpoints
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/health,/metrics,/high-volume-endpoint/*"
# Accelerate ILM policy (delete sooner)
# Update ILM policy to delete after 30 days instead of 90
Best Practices
Configuration Management
- Use secrets manager for sensitive values (passwords, keys)
- Version control environment configuration
- Document changes to production configuration
- Test configuration in staging before production
Monitoring and Alerting
- Set up alerts before deployment
- Test alerts in staging
- Document alert runbooks for on-call engineers
- Review alerts weekly for tuning
Team Readiness
- Train team on runbook procedures
- Conduct drills for rollback scenarios
- Assign clear roles (deployer, monitor, approver)
- Establish communication channels (Slack, email)
Documentation
- Keep runbook updated with production learnings
- Document all incidents for pattern analysis
- Share knowledge across team
- Maintain deployment log with timestamps and decisions
Success Criteria
Deployment is considered successful when:
- ✅ All 488+ tests passing
- ✅ Zero critical incidents in 30 days
- ✅ Performance targets met (< 5ms latency, < 0.1% errors)
- ✅ 99.9% availability achieved
- ✅ Storage usage within projections
- ✅ Team confident in operating system
- ✅ All documentation complete and accurate
- ✅ Stakeholders satisfied with compliance coverage
Next Steps
After successful production deployment:
- Monitor for 30 days - Validate stability
- Collect feedback - From security, compliance, and operations teams
- Optimize configuration - Based on production data
- Plan enhancements - Epic 06-08 features (retention, workers, compliance)
- Share success - Document case study and lessons learned
Support
For deployment assistance:
- Operational Runbook: Audit Operations Runbook
- Configuration Guide: Audit Configuration Guide
- Architecture Overview: Audit Trails Overview
- Team Contact: #audit-module-support on Slack