Skip to main content

Audit Module Production Deployment Guide

This guide covers the complete production deployment process for the PDaaS Audit Module, including canary deployment, gradual rollout, and post-deployment monitoring.

Table of Contents

Pre-Deployment Checklist

Testing and Quality Assurance

  • All unit tests passing (488+ tests)
  • All integration tests passing
  • All E2E tests passing (if applicable)
  • Load tests completed successfully
    • Sustained load: 5,000 req/s for 1 hour
    • Spike load: 20,000 req/s for 5 minutes
    • Multi-tenant load: 100 orgs, 10,000 req/s
  • Performance targets met
    • Request overhead < 5ms (p95)
    • OpenSearch write latency < 100ms (p95)
    • Queue size < 1,000 events under normal load
    • Error rate < 0.1%

Security Review

  • Security review completed and approved
  • No critical or high-severity vulnerabilities
  • Sensitive data sanitization verified
  • Multi-tenant isolation tested
  • TLS/SSL configuration validated
  • Authentication credentials secured

Infrastructure Readiness

  • OpenSearch cluster provisioned and configured
    • Cluster health: GREEN
    • Sufficient storage capacity (3+ months)
    • Replication configured (min 1 replica)
    • Backup/snapshot configured
  • Network connectivity validated
    • Application can reach OpenSearch (port 9200)
    • Firewall rules configured
    • VPC/security groups configured
  • Monitoring infrastructure ready
    • Prometheus configured to scrape metrics
    • AlertManager configured with alert rules
    • OpenSearch Dashboards configured
    • PagerDuty/Slack integration configured

Configuration and Documentation

  • Production environment variables configured
  • Excluded paths configuration reviewed
  • Index lifecycle policies configured
  • Operational runbook reviewed
  • Team trained on runbook procedures
  • Rollback plan documented and reviewed

Approvals

  • Code review approved
  • Security team approval
  • Operations team approval
  • Product owner approval
  • Stakeholder notification sent

Infrastructure Prerequisites

OpenSearch Cluster Requirements

Cluster Sizing

Based on load testing results and expected traffic:

Production Cluster (Recommended):

  • Nodes: 3 data nodes (minimum)
  • Instance Type: r6g.2xlarge (8 vCPU, 64 GB RAM)
  • Storage: 1 TB SSD per node (3 TB total)
  • Replication: 1 replica (2 copies of data)
  • Shards: 5 primary shards per index

Traffic Capacity:

  • Events/second: 10,000-20,000 sustained
  • Daily events: 500M-1B
  • Daily storage: 50-100 GB
  • Retention: 90 days

Cluster Configuration

# opensearch.yml
cluster.name: pdaas-audit-prod
node.name: ${HOSTNAME}
network.host: 0.0.0.0

# Security
plugins.security.ssl.http.enabled: true
plugins.security.ssl.transport.enabled: true

# Performance
indices.memory.index_buffer_size: 30%
thread_pool.write.queue_size: 1000

# Snapshots
path.repo: ["/mnt/snapshots"]

Index Templates

Create index templates before deployment:

# Create audit index template
curl -X PUT "https://opensearch.prod.internal:9200/_index_template/audit-template" \
-H 'Content-Type: application/json' \
-u "admin:password" \
-d '{
"index_patterns": ["audit-*"],
"template": {
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1,
"refresh_interval": "30s",
"index.codec": "best_compression"
},
"mappings": {
"properties": {
"occurred_at": {"type": "date"},
"actor_id": {"type": "keyword"},
"actor_type": {"type": "keyword"},
"organization_id": {"type": "keyword"},
"account_id": {"type": "keyword"},
"action": {"type": "keyword"},
"target": {"type": "keyword"},
"service": {"type": "keyword"},
"request_method": {"type": "keyword"},
"request_path": {"type": "keyword"},
"response_status_code": {"type": "short"},
"response_duration_ms": {"type": "float"},
"trace_id": {"type": "keyword"}
}
}
}
}'

Index Lifecycle Policy

Configure automated retention:

# Create ILM policy for 90-day retention
curl -X PUT "https://opensearch.prod.internal:9200/_plugins/_ism/policies/audit-lifecycle" \
-H 'Content-Type: application/json' \
-u "admin:password" \
-d '{
"policy": {
"description": "Audit log lifecycle: hot(30d) -> warm(30d) -> cold(30d) -> delete",
"default_state": "hot",
"states": [
{
"name": "hot",
"actions": [],
"transitions": [{
"state_name": "warm",
"conditions": {"min_index_age": "30d"}
}]
},
{
"name": "warm",
"actions": [
{"replica_count": {"number_of_replicas": 1}}
],
"transitions": [{
"state_name": "cold",
"conditions": {"min_index_age": "60d"}
}]
},
{
"name": "cold",
"actions": [
{"replica_count": {"number_of_replicas": 0}},
{"read_only": {}}
],
"transitions": [{
"state_name": "delete",
"conditions": {"min_index_age": "90d"}
}]
},
{
"name": "delete",
"actions": [{"delete": {}}],
"transitions": []
}
]
}
}'

Application Configuration

Production Environment Variables

Create production .env file:

# Environment
ENV=production

# Audit Module - Production Configuration
AUDIT_ENABLED=true
AUDIT_SERVICE_NAME=api
AUDIT_ENVIRONMENT=production
AUDIT_SERVICE_VERSION=1.0.0

# OpenSearch Connection
AUDIT_OPENSEARCH_HOST=opensearch.prod.internal
AUDIT_OPENSEARCH_PORT=9200
AUDIT_OPENSEARCH_USERNAME=audit_writer
AUDIT_OPENSEARCH_PASSWORD=${OPENSEARCH_AUDIT_PASSWORD} # From secrets manager
AUDIT_OPENSEARCH_USE_SSL=true
AUDIT_OPENSEARCH_VERIFY_CERTS=true
AUDIT_OPENSEARCH_CA_CERTS=/etc/ssl/certs/opensearch-ca.crt
AUDIT_OPENSEARCH_TIMEOUT=30

# Performance Tuning (Production)
AUDIT_BATCH_SIZE=500
AUDIT_FLUSH_INTERVAL_SECONDS=10.0
AUDIT_MAX_BODY_SIZE=20480
AUDIT_MAX_QUEUE_SIZE=50000

# Path Exclusion
AUDIT_EXCLUDED_PATHS=/health,/healthz,/metrics,/docs,/redoc,/openapi.json

# Index Configuration
AUDIT_INDEX_PREFIX=audit
AUDIT_INDEX_ROTATION=daily

# Retry Configuration
AUDIT_RETRY_ATTEMPTS=3
AUDIT_RETRY_BACKOFF_SECONDS=2.0
AUDIT_RETRY_BACKOFF_MULTIPLIER=2.0

OpenSearch User Setup

Create dedicated audit writer user with minimal permissions:

# Create audit_writer role
curl -X PUT "https://opensearch.prod.internal:9200/_plugins/_security/api/roles/audit_writer" \
-H 'Content-Type: application/json' \
-u "admin:password" \
-d '{
"cluster_permissions": ["cluster:monitor/health"],
"index_permissions": [{
"index_patterns": ["audit-*"],
"allowed_actions": [
"indices:data/write/index",
"indices:data/write/bulk",
"indices:admin/create"
]
}]
}'

# Create audit_writer user
curl -X PUT "https://opensearch.prod.internal:9200/_plugins/_security/api/internalusers/audit_writer" \
-H 'Content-Type: application/json' \
-u "admin:password" \
-d '{
"password": "STRONG_PASSWORD_HERE",
"opendistro_security_roles": ["audit_writer"]
}'

Deployment Phases

Phase 1: Infrastructure Setup (Day 0)

Duration: 2-4 hours

  1. Provision OpenSearch cluster

    # Using AWS CloudFormation, Terraform, or manual setup
    terraform apply -var-file=production.tfvars
  2. Verify cluster health

    curl -u "admin:password" "https://opensearch.prod.internal:9200/_cluster/health?pretty"
  3. Create index templates and policies

    ./scripts/setup-opensearch-production.sh
  4. Configure monitoring

    # Apply Prometheus scrape config
    kubectl apply -f k8s/prometheus-config.yaml

    # Import OpenSearch Dashboards
    python backend/audit/dashboards/import_dashboard.py \
    --url https://opensearch.prod.internal:9200 \
    --username admin \
    --all
  5. Configure alerting

    # Apply AlertManager rules
    kubectl apply -f k8s/alertmanager-rules.yaml

Phase 2: Canary Deployment (Day 1-2)

Duration: 24-48 hours Traffic: 10% of production

See detailed Canary Deployment section below.

Phase 3: Gradual Rollout (Day 3-5)

Duration: 48-72 hours Traffic: 10% → 25% → 50% → 100%

See detailed Full Rollout section below.

Phase 4: Post-Deployment Monitoring (Day 6-36)

Duration: 30 days Traffic: 100%

See detailed Post-Deployment Monitoring section below.

Canary Deployment

Overview

Canary deployment enables risk-free production deployment by:

  • Deploying to a small subset of traffic (10%)
  • Monitoring for issues without impacting all users
  • Quick rollback if problems detected
  • Confidence building before full rollout

Canary Strategy

Approach: Route-based canary using excluded paths

Initial State (Before Canary):

# backend/api/app.py - init_audit()
config = get_audit_config()

# Audit disabled or fully excluded
config.enabled = False # OR
config.excluded_paths = ["/*"] # Exclude all paths

Canary State (10% traffic):

# backend/api/app.py - init_audit()
config = get_audit_config()
config.enabled = True

# Only audit 10% of endpoints (exclude 90%)
config.excluded_paths = [
"/health",
"/metrics",
"/docs",
"/redoc",
"/openapi.json",
# Exclude most endpoints (keep only critical for canary)
"/accounts/*",
"/memberships/*",
"/groups/*",
"/service-accounts/*",
# Keep: /authentication/*, /policies/*, /organizations/*
]

Canary Deployment Steps

Step 1: Pre-Deployment Validation (30 min)

# 1. Verify all tests pass
python -m pytest backend/audit/tests/ -v

# Expected: 488+ tests passing

# 2. Verify OpenSearch cluster health
curl -u "audit_writer:password" \
"https://opensearch.prod.internal:9200/_cluster/health?pretty"

# Expected: {"status": "green"}

# 3. Verify application can connect
python -c "
from backend.audit.config import get_audit_config
from backend.audit.client import OpenSearchClientFactory
import asyncio

async def test():
config = get_audit_config()
client = await OpenSearchClientFactory.get_client(config)
healthy = await client.health_check()
print(f'OpenSearch healthy: {healthy}')

asyncio.run(test())
"

# Expected: OpenSearch healthy: True

Step 2: Deploy Canary (15 min)

# 1. Update environment variables
export AUDIT_ENABLED=true
export AUDIT_ENVIRONMENT=production
# ... other production vars

# 2. Deploy application with canary configuration
# Using Kubernetes rolling update
kubectl set env deployment/pdaas-api AUDIT_ENABLED=true
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/health,/metrics,/accounts/*,/memberships/*,/groups/*,/service-accounts/*"

# Or using Docker
docker-compose -f docker-compose.prod.yml up -d --no-deps pdaas-api

# 3. Verify deployment
kubectl rollout status deployment/pdaas-api

# 4. Verify audit middleware is active
curl https://api.prod.internal/health

# Check logs for audit initialization
kubectl logs -l app=pdaas-api | grep "Audit system initialized"

Step 3: Monitor Canary (24 hours)

Monitoring Checklist:

  1. Check Prometheus metrics (every 15 minutes for first hour, then hourly)

    # Event ingestion rate
    rate(audit_events_total[5m])

    # Event processing latency (should be < 5ms p95)
    histogram_quantile(0.95, rate(audit_events_latency_seconds_bucket[5m]))

    # Error rate (should be < 0.1%)
    rate(audit_errors_total[5m]) / rate(audit_events_total[5m])

    # Queue size (should be < 1000)
    audit_queue_size

    # Circuit breaker state (should be 0 = CLOSED)
    circuit_breaker_state
  2. Check OpenSearch Dashboards (every hour)

    • Navigate to "Audit System Health" dashboard
    • Verify events are being written
    • Check for errors or anomalies
    • Review latency distribution
  3. Check application logs (every 30 minutes)

    kubectl logs -l app=pdaas-api --tail=100 | grep -i "audit\|error"
  4. Check alerts (continuous)

    • No critical alerts should fire
    • Warning alerts acceptable if transient
  5. Verify data in OpenSearch (every 2 hours)

    # Count events written
    curl -u "audit_writer:password" \
    "https://opensearch.prod.internal:9200/audit-*/_count?pretty"

    # Sample recent events
    curl -u "audit_writer:password" \
    "https://opensearch.prod.internal:9200/audit-*/_search?pretty&size=10&sort=occurred_at:desc"

Success Criteria (24 hours):

  • No critical alerts triggered
  • Error rate < 0.1%
  • Request latency impact < 5ms (p95)
  • OpenSearch write latency < 100ms (p95)
  • Queue size < 1,000 events
  • Circuit breaker remained CLOSED
  • No application errors related to audit
  • Events successfully written to OpenSearch
  • Multi-tenant isolation verified (spot checks)

Step 4: Canary Decision (15 min)

If success criteria met:

  • ✅ Proceed to gradual rollout (Phase 3)
  • Document any minor issues for follow-up
  • Communicate success to stakeholders

If success criteria NOT met:

  • ❌ Initiate rollback (see Rollback Procedures)
  • Conduct incident review
  • Fix issues before retry

Canary Rollback

If issues detected during canary:

# Quick rollback: Disable audit via environment variable
kubectl set env deployment/pdaas-api AUDIT_ENABLED=false

# Or: Exclude all paths (safer, keeps code path active)
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/*"

# Verify rollback
kubectl rollout status deployment/pdaas-api

# Verify audit disabled
kubectl logs -l app=pdaas-api | grep "Audit system disabled"

Full Rollout

Overview

After successful canary (24-48 hours), gradually increase traffic:

  • Day 3: 25% (remove more exclusions)
  • Day 4: 50% (remove more exclusions)
  • Day 5: 100% (audit all endpoints except health checks)

Rollout Steps

Day 3: 25% Traffic

# Update excluded paths to audit 25% of endpoints
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/health,/metrics,/accounts/*,/memberships/*"

# Monitor for 24 hours using same checklist as canary

Success Criteria:

  • Same as canary, but with 2.5x traffic

Day 4: 50% Traffic

# Update excluded paths to audit 50% of endpoints
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/health,/metrics,/accounts/*"

# Monitor for 24 hours

Success Criteria:

  • Same as canary, but with 5x traffic
  • OpenSearch cluster still healthy
  • Storage usage within projections

Day 5: 100% Traffic (Full Production)

# Update excluded paths to audit all except health/metrics
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/health,/healthz,/metrics"

# Monitor for 24 hours, then 7 days, then 30 days

Success Criteria:

  • Same as canary, but with 10x traffic
  • Full compliance coverage achieved
  • No performance degradation

Rollout Decision Gates

Before proceeding to next phase, verify:

  1. No critical alerts in previous phase
  2. Error rate < 0.1% consistently
  3. Performance targets met (< 5ms p95 latency)
  4. OpenSearch cluster healthy (GREEN status)
  5. Storage usage trending as expected
  6. Team confidence high (no major issues)

If any gate fails:

  • Pause rollout
  • Investigate and fix issues
  • Re-run previous phase
  • Only proceed when gates pass

Rollback Procedures

Scenario 1: High Error Rate

Trigger: Error rate > 5% for 5+ minutes

Action:

# Immediate: Disable audit
kubectl set env deployment/pdaas-api AUDIT_ENABLED=false

# Or: Exclude all paths
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/*"

# Investigate errors
kubectl logs -l app=pdaas-api | grep -i "error" > error-log.txt

# Fix issues
# Re-deploy with fixes
# Retry from canary phase

Scenario 2: OpenSearch Unavailable

Trigger: Circuit breaker OPEN for 10+ minutes

Action:

# Verify circuit breaker is working (events going to logs)
kubectl logs -l app=pdaas-api | grep "Circuit breaker open"

# This is expected behavior - no rollback needed
# Events are being logged instead of lost

# Fix OpenSearch connectivity issue
# Circuit breaker will auto-close when OpenSearch recovers

Scenario 3: Performance Degradation

Trigger: Request latency increase > 10ms (p95)

Action:

# Investigate cause
kubectl top pods -l app=pdaas-api
kubectl logs -l app=pdaas-api | grep "latency\|slow"

# If caused by audit:
# Option 1: Tune batching parameters
kubectl set env deployment/pdaas-api AUDIT_BATCH_SIZE=1000
kubectl set env deployment/pdaas-api AUDIT_FLUSH_INTERVAL_SECONDS=15.0

# Option 2: Exclude more paths
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/health,/metrics,/high-traffic-endpoint/*"

# Option 3: Full rollback (last resort)
kubectl set env deployment/pdaas-api AUDIT_ENABLED=false

Scenario 4: Storage Overflow

Trigger: OpenSearch disk usage > 80%

Action:

# Check storage usage
curl -u "admin:password" \
"https://opensearch.prod.internal:9200/_cat/indices/audit-*?v&h=index,store.size&s=store.size:desc"

# Option 1: Reduce body size to save space
kubectl set env deployment/pdaas-api AUDIT_MAX_BODY_SIZE=5120 # 5KB instead of 20KB

# Option 2: Delete old indices manually
curl -X DELETE -u "admin:password" \
"https://opensearch.prod.internal:9200/audit-*-2025-09-*"

# Option 3: Increase retention policy speed
# (Update ILM policy to delete after 30 days instead of 90)

# Option 4: Add storage capacity
# (Resize OpenSearch cluster)

Emergency Full Rollback

When: Critical production incident caused by audit module

Steps:

# 1. Disable audit immediately
kubectl set env deployment/pdaas-api AUDIT_ENABLED=false

# 2. Verify application recovery
curl https://api.prod.internal/health

# 3. Verify no audit activity
kubectl logs -l app=pdaas-api | grep "Audit system disabled"

# 4. Create incident report
# 5. Conduct post-mortem
# 6. Fix root cause
# 7. Re-test in staging
# 8. Retry production deployment from Phase 1

Post-Deployment Monitoring

First 24 Hours (Critical Period)

Monitoring Frequency: Every 15 minutes for first hour, then hourly

Key Metrics:

  • Event ingestion rate
  • Event processing latency
  • Error rate
  • Queue size
  • Circuit breaker state
  • OpenSearch cluster health
  • Storage growth rate

Actions:

  • Have on-call engineer monitoring
  • Keep rollback procedure ready
  • Log all anomalies for review

First 7 Days (Stabilization Period)

Monitoring Frequency: 3x daily (morning, afternoon, evening)

Key Activities:

  • Review dashboards for trends
  • Analyze error logs
  • Check storage usage growth
  • Verify multi-tenant isolation
  • Collect feedback from team
  • Tune configuration if needed

First 30 Days (Validation Period)

Monitoring Frequency: Daily

Success Metrics:

  • Availability: 99.9% uptime
  • Performance: < 5ms latency impact (p95)
  • Reliability: < 0.1% error rate
  • Compliance: 100% coverage of security operations
  • Storage: Usage within 10% of projections
  • Incidents: Zero critical incidents

End of 30 Days:

  • Conduct deployment retrospective
  • Document lessons learned
  • Update runbook with production insights
  • Mark feature as "Generally Available" (GA)

Troubleshooting

Issue: Events Not Appearing in OpenSearch

Symptoms:

  • Prometheus shows events emitted (audit_events_total increasing)
  • OpenSearch shows no events (audit-* indices empty or missing)

Diagnosis:

# 1. Check circuit breaker state
curl http://api.prod.internal/metrics | grep circuit_breaker_state

# 2. Check OpenSearch connectivity
kubectl exec -it pdaas-api-pod -- python -c "
from backend.audit.client import OpenSearchClientFactory
from backend.audit.config import get_audit_config
import asyncio

async def test():
config = get_audit_config()
client = await OpenSearchClientFactory.get_client(config)
healthy = await client.health_check()
print(f'Healthy: {healthy}')

asyncio.run(test())
"

# 3. Check application logs
kubectl logs -l app=pdaas-api | grep -i "opensearch\|circuit"

Resolution:

  • If circuit breaker OPEN: Fix OpenSearch connectivity
  • If authentication failed: Verify credentials
  • If network issue: Check firewall/security groups

Issue: High Memory Usage

Symptoms:

  • Application pod memory usage increasing
  • OOMKilled errors in Kubernetes

Diagnosis:

# Check queue size
curl http://api.prod.internal/metrics | grep audit_queue_size

Resolution:

# Reduce queue size
kubectl set env deployment/pdaas-api AUDIT_MAX_QUEUE_SIZE=5000

# Increase flush frequency
kubectl set env deployment/pdaas-api AUDIT_FLUSH_INTERVAL_SECONDS=2.0

# Increase batch size (fewer flushes)
kubectl set env deployment/pdaas-api AUDIT_BATCH_SIZE=1000

Issue: OpenSearch Indices Growing Too Fast

Symptoms:

  • Disk usage exceeding projections
  • Cluster disk watermark warnings

Diagnosis:

# Check index sizes
curl -u "admin:password" \
"https://opensearch.prod.internal:9200/_cat/indices/audit-*?v&h=index,docs.count,store.size&s=store.size:desc"

# Check average event size
curl -u "admin:password" \
"https://opensearch.prod.internal:9200/audit-*/_stats?pretty" | jq '.indices | to_entries | map({index: .key, docs: .value.total.docs.count, size: .value.total.store.size_in_bytes}) | map(.avg_size = (.size / .docs))'

Resolution:

# Reduce body size
kubectl set env deployment/pdaas-api AUDIT_MAX_BODY_SIZE=5120 # 5KB

# Exclude high-volume, low-value endpoints
kubectl set env deployment/pdaas-api AUDIT_EXCLUDED_PATHS="/health,/metrics,/high-volume-endpoint/*"

# Accelerate ILM policy (delete sooner)
# Update ILM policy to delete after 30 days instead of 90

Best Practices

Configuration Management

  • Use secrets manager for sensitive values (passwords, keys)
  • Version control environment configuration
  • Document changes to production configuration
  • Test configuration in staging before production

Monitoring and Alerting

  • Set up alerts before deployment
  • Test alerts in staging
  • Document alert runbooks for on-call engineers
  • Review alerts weekly for tuning

Team Readiness

  • Train team on runbook procedures
  • Conduct drills for rollback scenarios
  • Assign clear roles (deployer, monitor, approver)
  • Establish communication channels (Slack, email)

Documentation

  • Keep runbook updated with production learnings
  • Document all incidents for pattern analysis
  • Share knowledge across team
  • Maintain deployment log with timestamps and decisions

Success Criteria

Deployment is considered successful when:

  • ✅ All 488+ tests passing
  • ✅ Zero critical incidents in 30 days
  • ✅ Performance targets met (< 5ms latency, < 0.1% errors)
  • ✅ 99.9% availability achieved
  • ✅ Storage usage within projections
  • ✅ Team confident in operating system
  • ✅ All documentation complete and accurate
  • ✅ Stakeholders satisfied with compliance coverage

Next Steps

After successful production deployment:

  1. Monitor for 30 days - Validate stability
  2. Collect feedback - From security, compliance, and operations teams
  3. Optimize configuration - Based on production data
  4. Plan enhancements - Epic 06-08 features (retention, workers, compliance)
  5. Share success - Document case study and lessons learned

Support

For deployment assistance: