Incident Response Runbook for Self-Hosted Supabase

Build a structured incident response plan for your self-hosted Supabase with detection, triage, recovery procedures, and post-mortems.

Cover Image for Incident Response Runbook for Self-Hosted Supabase

When Supabase Cloud had a 3-hour outage in February 2026 affecting their US-East-2 region, affected teams had one option: wait for Supabase engineers to fix it. If you're running self-hosted Supabase, you have full control during incidents—but that control means nothing without a plan.

This runbook provides a structured approach to incident response for self-hosted Supabase deployments. It's designed to be printed, bookmarked, and referenced at 2 AM when your pager goes off.

Understanding Your Supabase Service Dependencies

Before diving into response procedures, you need to understand how Supabase services depend on each other. When PostgreSQL goes down, everything fails. When Auth crashes, only authentication breaks. This hierarchy determines your triage priority.

Critical Path (affects everything):

PostgreSQL → Kong (API Gateway) → All APIs

Service Dependency Map:

PostgreSQL
├── PostgREST (REST API)
├── GoTrue (Auth)
│   └── Requires SMTP for magic links
├── Realtime
├── Storage
│   └── Requires S3/MinIO backend
├── pg_graphql
└── Analytics (Logflare)
    └── Requires BigQuery or local Postgres

If your REST API fails but Auth works, PostgREST is likely the issue. If nothing works, start with PostgreSQL and Kong. Understanding these relationships saves critical minutes during outages.

Phase 1: Detection and Initial Assessment

The first five minutes of an incident set the tone for recovery. Here's a structured approach to quickly understand what's happening.

Immediate Health Check Script

Save this script and run it first when alerts fire:

#!/bin/bash
# supabase-health-check.sh

echo "=== Supabase Health Check ==="
echo "Timestamp: $(date -u)"
echo ""

# Check all containers
echo "Container Status:"
docker compose ps --format "table {{.Name}}\t{{.Status}}\t{{.Ports}}"
echo ""

# Check container resource usage
echo "Resource Usage:"
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
echo ""

# Test PostgreSQL connection
echo "PostgreSQL Connection:"
docker compose exec -T db pg_isready -U postgres && echo "OK" || echo "FAILED"
echo ""

# Test API Gateway
echo "Kong API Gateway:"
curl -s -o /dev/null -w "%{http_code}" http://localhost:8000/rest/v1/ && echo " OK" || echo " FAILED"
echo ""

# Test Auth
echo "Auth Service:"
curl -s -o /dev/null -w "%{http_code}" http://localhost:8000/auth/v1/health && echo " OK" || echo " FAILED"
echo ""

# Check disk space
echo "Disk Space:"
df -h / | tail -1
echo ""

# Check recent logs for errors
echo "Recent Error Logs (last 5 minutes):"
docker compose logs --since 5m 2>&1 | grep -i "error\|fatal\|panic" | tail -10

Run this script immediately when you suspect an issue. The output tells you which services are affected and where to focus.

Severity Classification

Not all incidents are equal. Classify severity to determine your response intensity:

SeverityCriteriaResponse TimeExample
P1 - CriticalComplete service outage, data loss riskImmediate, all handsDatabase down, storage corruption
P2 - HighMajor feature broken, many users affectedWithin 15 minutesAuth failures, API errors
P3 - MediumFeature degraded, workaround existsWithin 1 hourSlow queries, Realtime disconnects
P4 - LowMinor issue, limited impactNext business dayDashboard glitch, non-critical logs

For P1 and P2 incidents, proceed immediately to Phase 2.

Phase 2: Containment and Communication

Once you understand the scope, communicate clearly and prevent further damage.

Communication Template

Use this template for stakeholder updates. Adapt it for Slack, email, or your status page:

INCIDENT: [Brief description]
SEVERITY: P[1-4]
STATUS: [Investigating | Identified | Monitoring | Resolved]
STARTED: [UTC timestamp]
IMPACT: [What's broken and who's affected]

CURRENT ACTIONS:
- [What you're doing right now]

NEXT UPDATE: [Time for next update, typically 15-30 min for P1/P2]

Example:

INCIDENT: Database connection failures
SEVERITY: P1
STATUS: Investigating
STARTED: 2026-04-30 03:15 UTC
IMPACT: All API requests failing. ~500 active users affected.

CURRENT ACTIONS:
- Checking PostgreSQL container status
- Reviewing recent deployment changes

NEXT UPDATE: 03:30 UTC

Containment Decisions

Sometimes you need to deliberately limit damage while you investigate:

Enable maintenance mode (if your app supports it):

# Example: Add maintenance response to Kong
# This returns 503 to all requests
docker compose exec kong kong config db_import /tmp/maintenance.yml

Isolate the database (if you suspect data corruption):

# Stop all services except the database
docker compose stop rest auth realtime storage
# Now investigate the database safely

Scale back connections (if connection exhaustion is suspected):

# Check current connections
docker compose exec db psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"
# Terminate idle connections
docker compose exec db psql -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '5 minutes';"

Phase 3: Service-Specific Recovery Procedures

Here are step-by-step recovery procedures for the most common failure scenarios.

PostgreSQL Recovery

PostgreSQL failures are the most critical since all other services depend on it.

Symptom: Database unreachable, container restarting

Step 1: Check container logs

docker compose logs db --tail 100

Step 2: Common causes and fixes

Out of disk space:

# Check disk usage
df -h /var/lib/docker
# If full, clean up Docker
docker system prune -af
# Check WAL files
docker compose exec db du -sh /var/lib/postgresql/data/pg_wal

Corrupted shared memory:

# Restart with clean state
docker compose down
docker compose up -d db
# Wait for recovery
docker compose logs -f db

Max connections reached:

# Increase max_connections in postgresql.conf or restart Supavisor
docker compose restart supavisor

Step 3: Verify recovery

docker compose exec db pg_isready -U postgres
docker compose exec db psql -U postgres -c "SELECT 1;"

Auth (GoTrue) Recovery

Symptom: Login failures, JWT validation errors, magic links not sending

Step 1: Check service health

docker compose logs auth --tail 50
curl http://localhost:8000/auth/v1/health

Step 2: Common causes

JWT secret mismatch:

# Verify JWT_SECRET matches across services
grep JWT_SECRET .env
# Compare with what PostgREST is using
docker compose exec rest env | grep JWT

SMTP configuration issues (magic links failing):

# Test SMTP from auth container
docker compose exec auth wget -qO- --post-data='{"email":"[email protected]"}' \
  --header='Content-Type: application/json' \
  http://localhost:9999/admin/users
# Check SMTP env vars
docker compose exec auth env | grep SMTP

Step 3: Restart sequence

# Auth depends on database, restart both if needed
docker compose restart db
sleep 10
docker compose restart auth

Storage Recovery

Symptom: File uploads failing, existing files inaccessible

Step 1: Check storage backend connectivity

# For S3-compatible backends (MinIO, R2, etc.)
docker compose logs storage --tail 50
# Test backend connectivity
docker compose exec storage curl -I http://minio:9000/health/live

Step 2: Common fixes

MinIO/S3 credentials expired or changed:

# Verify credentials match
grep -E "STORAGE_S3|AWS" .env
docker compose exec storage env | grep -E "S3|AWS"

Bucket permissions:

# Check bucket exists and is accessible
docker compose exec minio mc ls local/supabase

Step 3: Restart storage chain

docker compose restart storage imgproxy

Phase 4: Verification and Monitoring

Never declare victory too early. Verify the fix holds under load.

Post-Recovery Verification Checklist

# 1. All containers healthy
docker compose ps | grep -v "Up"  # Should return nothing

# 2. Database accepts connections
docker compose exec db psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"

# 3. API responds correctly
curl -H "apikey: $ANON_KEY" http://localhost:8000/rest/v1/

# 4. Auth works
curl -X POST http://localhost:8000/auth/v1/token?grant_type=password \
  -H "apikey: $ANON_KEY" \
  -H "Content-Type: application/json" \
  -d '{"email":"[email protected]","password":"testpass"}'

# 5. Realtime connects
# Test with your application or wscat

# 6. Storage uploads work
curl -X POST http://localhost:8000/storage/v1/object/test-bucket/test.txt \
  -H "apikey: $SERVICE_KEY" \
  -H "Content-Type: text/plain" \
  -d "test content"

Extended Monitoring Period

After any P1 or P2 incident, increase monitoring for 24 hours:

  • Set up temporary alerts with lower thresholds
  • Check logs every few hours for recurring errors
  • Monitor resource usage trends
  • Verify backup jobs completed successfully after the incident

Phase 5: Post-Incident Review

Every incident is a learning opportunity. Within 48 hours, conduct a blameless post-mortem.

Post-Mortem Template

## Incident Post-Mortem: [Title]

### Summary
- **Duration**: [Start time] to [End time] ([X] hours)
- **Severity**: P[X]
- **Impact**: [Users affected, data lost, revenue impact]

### Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | First alert triggered |
| HH:MM | Engineer acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service restored |

### Root Cause
[Detailed technical explanation]

### What Went Well
- [Detection was fast because...]
- [Recovery was smooth because...]

### What Could Be Improved
- [Detection could have been faster if...]
- [Recovery would have been easier if...]

### Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| [Specific improvement] | [Name] | [Date] |

### Lessons Learned
[Key takeaways for the team]

Common Action Items to Consider

Based on incident patterns, these improvements often emerge from post-mortems:

  1. Better monitoring: Set up alerting and notifications before the next incident
  2. Backup verification: Test your restore procedures regularly
  3. Documentation updates: Update this runbook with lessons learned
  4. Capacity planning: If resource exhaustion caused the incident, review your capacity planning

Simplifying Incident Response with Supascale

Building and maintaining incident response procedures is one of the hidden costs of self-hosting. Supascale helps reduce this operational burden with:

  • Health monitoring dashboard: See the status of all Supabase services at a glance
  • One-click restores: If something goes catastrophically wrong, restore from any backup point with a single click
  • Service management UI: Restart individual services without SSH access
  • Centralized logging: Review logs across all services in one place

When you're responding to an incident at 2 AM, having a management interface that shows exactly what's wrong—and lets you fix it quickly—makes the difference between a 15-minute recovery and a 3-hour nightmare.

Check our features to see how Supascale simplifies self-hosted Supabase operations, or review our pricing to see if it fits your needs.

Keep This Runbook Updated

A runbook is only useful if it reflects your actual environment. Review and update this document:

  • After every incident (add lessons learned)
  • When you change your infrastructure
  • Quarterly, even if no incidents occurred
  • When team members change

Print a copy. Bookmark it. Make sure everyone on your team knows where to find it.


Further Reading