When Supabase Cloud had a 3-hour outage in February 2026 affecting their US-East-2 region, affected teams had one option: wait for Supabase engineers to fix it. If you're running self-hosted Supabase, you have full control during incidents—but that control means nothing without a plan.
This runbook provides a structured approach to incident response for self-hosted Supabase deployments. It's designed to be printed, bookmarked, and referenced at 2 AM when your pager goes off.
Understanding Your Supabase Service Dependencies
Before diving into response procedures, you need to understand how Supabase services depend on each other. When PostgreSQL goes down, everything fails. When Auth crashes, only authentication breaks. This hierarchy determines your triage priority.
Critical Path (affects everything):
PostgreSQL → Kong (API Gateway) → All APIs
Service Dependency Map:
PostgreSQL
├── PostgREST (REST API)
├── GoTrue (Auth)
│ └── Requires SMTP for magic links
├── Realtime
├── Storage
│ └── Requires S3/MinIO backend
├── pg_graphql
└── Analytics (Logflare)
└── Requires BigQuery or local Postgres
If your REST API fails but Auth works, PostgREST is likely the issue. If nothing works, start with PostgreSQL and Kong. Understanding these relationships saves critical minutes during outages.
Phase 1: Detection and Initial Assessment
The first five minutes of an incident set the tone for recovery. Here's a structured approach to quickly understand what's happening.
Immediate Health Check Script
Save this script and run it first when alerts fire:
#!/bin/bash
# supabase-health-check.sh
echo "=== Supabase Health Check ==="
echo "Timestamp: $(date -u)"
echo ""
# Check all containers
echo "Container Status:"
docker compose ps --format "table {{.Name}}\t{{.Status}}\t{{.Ports}}"
echo ""
# Check container resource usage
echo "Resource Usage:"
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
echo ""
# Test PostgreSQL connection
echo "PostgreSQL Connection:"
docker compose exec -T db pg_isready -U postgres && echo "OK" || echo "FAILED"
echo ""
# Test API Gateway
echo "Kong API Gateway:"
curl -s -o /dev/null -w "%{http_code}" http://localhost:8000/rest/v1/ && echo " OK" || echo " FAILED"
echo ""
# Test Auth
echo "Auth Service:"
curl -s -o /dev/null -w "%{http_code}" http://localhost:8000/auth/v1/health && echo " OK" || echo " FAILED"
echo ""
# Check disk space
echo "Disk Space:"
df -h / | tail -1
echo ""
# Check recent logs for errors
echo "Recent Error Logs (last 5 minutes):"
docker compose logs --since 5m 2>&1 | grep -i "error\|fatal\|panic" | tail -10
Run this script immediately when you suspect an issue. The output tells you which services are affected and where to focus.
Severity Classification
Not all incidents are equal. Classify severity to determine your response intensity:
| Severity | Criteria | Response Time | Example |
|---|---|---|---|
| P1 - Critical | Complete service outage, data loss risk | Immediate, all hands | Database down, storage corruption |
| P2 - High | Major feature broken, many users affected | Within 15 minutes | Auth failures, API errors |
| P3 - Medium | Feature degraded, workaround exists | Within 1 hour | Slow queries, Realtime disconnects |
| P4 - Low | Minor issue, limited impact | Next business day | Dashboard glitch, non-critical logs |
For P1 and P2 incidents, proceed immediately to Phase 2.
Phase 2: Containment and Communication
Once you understand the scope, communicate clearly and prevent further damage.
Communication Template
Use this template for stakeholder updates. Adapt it for Slack, email, or your status page:
INCIDENT: [Brief description] SEVERITY: P[1-4] STATUS: [Investigating | Identified | Monitoring | Resolved] STARTED: [UTC timestamp] IMPACT: [What's broken and who's affected] CURRENT ACTIONS: - [What you're doing right now] NEXT UPDATE: [Time for next update, typically 15-30 min for P1/P2]
Example:
INCIDENT: Database connection failures SEVERITY: P1 STATUS: Investigating STARTED: 2026-04-30 03:15 UTC IMPACT: All API requests failing. ~500 active users affected. CURRENT ACTIONS: - Checking PostgreSQL container status - Reviewing recent deployment changes NEXT UPDATE: 03:30 UTC
Containment Decisions
Sometimes you need to deliberately limit damage while you investigate:
Enable maintenance mode (if your app supports it):
# Example: Add maintenance response to Kong # This returns 503 to all requests docker compose exec kong kong config db_import /tmp/maintenance.yml
Isolate the database (if you suspect data corruption):
# Stop all services except the database docker compose stop rest auth realtime storage # Now investigate the database safely
Scale back connections (if connection exhaustion is suspected):
# Check current connections docker compose exec db psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;" # Terminate idle connections docker compose exec db psql -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '5 minutes';"
Phase 3: Service-Specific Recovery Procedures
Here are step-by-step recovery procedures for the most common failure scenarios.
PostgreSQL Recovery
PostgreSQL failures are the most critical since all other services depend on it.
Symptom: Database unreachable, container restarting
Step 1: Check container logs
docker compose logs db --tail 100
Step 2: Common causes and fixes
Out of disk space:
# Check disk usage df -h /var/lib/docker # If full, clean up Docker docker system prune -af # Check WAL files docker compose exec db du -sh /var/lib/postgresql/data/pg_wal
Corrupted shared memory:
# Restart with clean state docker compose down docker compose up -d db # Wait for recovery docker compose logs -f db
Max connections reached:
# Increase max_connections in postgresql.conf or restart Supavisor docker compose restart supavisor
Step 3: Verify recovery
docker compose exec db pg_isready -U postgres docker compose exec db psql -U postgres -c "SELECT 1;"
Auth (GoTrue) Recovery
Symptom: Login failures, JWT validation errors, magic links not sending
Step 1: Check service health
docker compose logs auth --tail 50 curl http://localhost:8000/auth/v1/health
Step 2: Common causes
JWT secret mismatch:
# Verify JWT_SECRET matches across services grep JWT_SECRET .env # Compare with what PostgREST is using docker compose exec rest env | grep JWT
SMTP configuration issues (magic links failing):
# Test SMTP from auth container
docker compose exec auth wget -qO- --post-data='{"email":"[email protected]"}' \
--header='Content-Type: application/json' \
http://localhost:9999/admin/users
# Check SMTP env vars
docker compose exec auth env | grep SMTP
Step 3: Restart sequence
# Auth depends on database, restart both if needed docker compose restart db sleep 10 docker compose restart auth
Storage Recovery
Symptom: File uploads failing, existing files inaccessible
Step 1: Check storage backend connectivity
# For S3-compatible backends (MinIO, R2, etc.) docker compose logs storage --tail 50 # Test backend connectivity docker compose exec storage curl -I http://minio:9000/health/live
Step 2: Common fixes
MinIO/S3 credentials expired or changed:
# Verify credentials match grep -E "STORAGE_S3|AWS" .env docker compose exec storage env | grep -E "S3|AWS"
Bucket permissions:
# Check bucket exists and is accessible docker compose exec minio mc ls local/supabase
Step 3: Restart storage chain
docker compose restart storage imgproxy
Phase 4: Verification and Monitoring
Never declare victory too early. Verify the fix holds under load.
Post-Recovery Verification Checklist
# 1. All containers healthy
docker compose ps | grep -v "Up" # Should return nothing
# 2. Database accepts connections
docker compose exec db psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;"
# 3. API responds correctly
curl -H "apikey: $ANON_KEY" http://localhost:8000/rest/v1/
# 4. Auth works
curl -X POST http://localhost:8000/auth/v1/token?grant_type=password \
-H "apikey: $ANON_KEY" \
-H "Content-Type: application/json" \
-d '{"email":"[email protected]","password":"testpass"}'
# 5. Realtime connects
# Test with your application or wscat
# 6. Storage uploads work
curl -X POST http://localhost:8000/storage/v1/object/test-bucket/test.txt \
-H "apikey: $SERVICE_KEY" \
-H "Content-Type: text/plain" \
-d "test content"
Extended Monitoring Period
After any P1 or P2 incident, increase monitoring for 24 hours:
- Set up temporary alerts with lower thresholds
- Check logs every few hours for recurring errors
- Monitor resource usage trends
- Verify backup jobs completed successfully after the incident
Phase 5: Post-Incident Review
Every incident is a learning opportunity. Within 48 hours, conduct a blameless post-mortem.
Post-Mortem Template
## Incident Post-Mortem: [Title] ### Summary - **Duration**: [Start time] to [End time] ([X] hours) - **Severity**: P[X] - **Impact**: [Users affected, data lost, revenue impact] ### Timeline | Time (UTC) | Event | |------------|-------| | HH:MM | First alert triggered | | HH:MM | Engineer acknowledged | | HH:MM | Root cause identified | | HH:MM | Fix deployed | | HH:MM | Service restored | ### Root Cause [Detailed technical explanation] ### What Went Well - [Detection was fast because...] - [Recovery was smooth because...] ### What Could Be Improved - [Detection could have been faster if...] - [Recovery would have been easier if...] ### Action Items | Action | Owner | Due Date | |--------|-------|----------| | [Specific improvement] | [Name] | [Date] | ### Lessons Learned [Key takeaways for the team]
Common Action Items to Consider
Based on incident patterns, these improvements often emerge from post-mortems:
- Better monitoring: Set up alerting and notifications before the next incident
- Backup verification: Test your restore procedures regularly
- Documentation updates: Update this runbook with lessons learned
- Capacity planning: If resource exhaustion caused the incident, review your capacity planning
Simplifying Incident Response with Supascale
Building and maintaining incident response procedures is one of the hidden costs of self-hosting. Supascale helps reduce this operational burden with:
- Health monitoring dashboard: See the status of all Supabase services at a glance
- One-click restores: If something goes catastrophically wrong, restore from any backup point with a single click
- Service management UI: Restart individual services without SSH access
- Centralized logging: Review logs across all services in one place
When you're responding to an incident at 2 AM, having a management interface that shows exactly what's wrong—and lets you fix it quickly—makes the difference between a 15-minute recovery and a 3-hour nightmare.
Check our features to see how Supascale simplifies self-hosted Supabase operations, or review our pricing to see if it fits your needs.
Keep This Runbook Updated
A runbook is only useful if it reflects your actual environment. Review and update this document:
- After every incident (add lessons learned)
- When you change your infrastructure
- Quarterly, even if no incidents occurred
- When team members change
Print a copy. Bookmark it. Make sure everyone on your team knows where to find it.
Further Reading
- Troubleshooting Self-Hosted Supabase - Detailed fixes for specific issues
- Monitoring Self-Hosted Supabase - Set up observability before incidents happen
- High Availability for Self-Hosted Supabase - Prevent incidents through redundancy
- Supascale Documentation - Get started with simplified Supabase management
