Rollback & Incidents¶
When a deploy goes wrong, recovery is almost always "ship a known-good image tag again." This runbook covers how to roll back in each environment, what the deploy script already does on a failed health check, manual recovery over the tunnel, migration and disk-full caveats, and a short incident checklist.
Rollback = redeploy a previous tag
Because every image is published with an immutable tag (commit SHA for test, version tag for production), rolling back means pointing the server at an earlier tag and bringing the stack back up. There is no separate "undo" — you redeploy.
Rolling back¶
Production¶
Prefer re-running the pipeline on the previous version tag so all the gates and the standard deploy flow run:
- Go to Actions → Build and Deploy to Production, choose Run workflow, and select the previous good tag (e.g.
v1.3.9); or - Re-push the tag locally if it was deleted, then let the workflow run.
If you need to recover immediately and the previous image is already in GHCR, do it on the server:
ssh ubuntu@<prod-host> # via the Cloudflare Tunnel SSH config
cd /opt/app-name-prod
IMAGE_TAG=v1.3.9 docker compose pull --policy always
IMAGE_TAG=v1.3.9 docker compose up -d --wait --wait-timeout 60
Test¶
Redeploy a previous commit SHA — either re-run the test workflow on the older commit, or on the server:
ssh ubuntu@<test-host>
cd /opt/app-name
IMAGE_TAG=<previous-sha> docker compose pull --policy always
IMAGE_TAG=<previous-sha> docker compose up -d --wait --wait-timeout 60
Manual server rollback skips the gates
Setting IMAGE_TAG by hand bypasses SonarCloud, the E2E gate, and Trivy. Use it only to stop the bleeding; follow up with a proper pipeline run once the incident is contained.
What the deploy script already does on failure¶
The remote deploy.sh waits for containers to become healthy and fails loudly if they do not:
if ! docker compose up -d --wait --wait-timeout 60; then
echo "❌ Docker Compose Up failed or timed out! App logs:"
docker compose logs app | tail -n 50
exit 1
fi
So on a failed health check the pipeline step turns red, the last 50 lines of app logs are printed in the GitHub Actions log, and the job exits non-zero. The previous containers were already stopped (docker compose down) earlier in the script, so a failed deploy generally leaves the stack down, not silently serving stale traffic — that is why a prompt rollback matters.
Manual recovery over the tunnel¶
SSH in using the Cloudflare Tunnel host alias (see Cloudflare Tunnel), then inspect and restart:
ssh ubuntu@<host>
cd /opt/app-name # or /opt/app-name-prod
docker compose ps # what is up / unhealthy?
docker compose logs app | tail -n 100 # recent app errors
docker compose logs nginx | tail -n 50 # proxy errors
# Restart a single service or the whole stack
docker compose restart app
docker compose up -d --wait --wait-timeout 60
Database migration caveats¶
Migrations are not automatically reversed
The deploy script runs docker compose run --rm app alembic upgrade head forward on every deploy. Rolling the image back to an older tag does not roll the schema back. If a migration is incompatible with the older code, the rolled-back app may fail to start.
Before rolling back across a migration boundary:
- Check whether the migration is backward-compatible with the previous image.
- If it is not, you may need to restore the database. See Migrations for the migration model and Dump & Load for restoring from a snapshot.
- Take a fresh dump before attempting any destructive recovery.
Disk-full recovery¶
The deploy script's 80% guard runs docker system prune -af --volumes automatically, but if a box is already wedged you can do it by hand:
df -h / # confirm the disk is the problem
docker system prune -af --volumes # reclaim all unused images + volumes
docker image prune -af # if still tight
df -h /
--volumes deletes unused volumes
This removes any volume not attached to a running container — potentially including database data if the DB container is down. Verify your data is on an in-use or backed-up volume before running it. See Docker & GHCR.
Incident checklist¶
- Confirm scope — which environment, and is the site actually down?
docker compose ps+ acurl /api/health. - Capture evidence — copy the failing GitHub Actions log and
docker compose logs app | tail -n 200before changing anything. - Stabilise — roll back to the last known-good tag (pipeline re-run, or
IMAGE_TAG=<good> docker compose up -d --wait). - Check the database — did a migration run? If schema changed, decide whether a restore is needed before more deploys.
- Free disk if needed —
docker system prune -af(and--volumesonly after confirming data safety). - Verify recovery —
docker compose psall healthy, health endpoint 200. - Follow up — fix forward through the normal pipeline so the gates run; record what happened.
Related: Deploy to Test · Deploy to Production · Docker & GHCR.