Skip to content

Rollback & Incidents

When a deploy goes wrong, recovery is almost always "ship a known-good image tag again." This runbook covers how to roll back in each environment, what the deploy script already does on a failed health check, manual recovery over the tunnel, migration and disk-full caveats, and a short incident checklist.

Rollback = redeploy a previous tag

Because every image is published with an immutable tag (commit SHA for test, version tag for production), rolling back means pointing the server at an earlier tag and bringing the stack back up. There is no separate "undo" — you redeploy.

Rolling back

Production

Prefer re-running the pipeline on the previous version tag so all the gates and the standard deploy flow run:

  1. Go to Actions → Build and Deploy to Production, choose Run workflow, and select the previous good tag (e.g. v1.3.9); or
  2. Re-push the tag locally if it was deleted, then let the workflow run.

If you need to recover immediately and the previous image is already in GHCR, do it on the server:

ssh ubuntu@<prod-host>            # via the Cloudflare Tunnel SSH config
cd /opt/app-name-prod
IMAGE_TAG=v1.3.9 docker compose pull --policy always
IMAGE_TAG=v1.3.9 docker compose up -d --wait --wait-timeout 60

Test

Redeploy a previous commit SHA — either re-run the test workflow on the older commit, or on the server:

ssh ubuntu@<test-host>
cd /opt/app-name
IMAGE_TAG=<previous-sha> docker compose pull --policy always
IMAGE_TAG=<previous-sha> docker compose up -d --wait --wait-timeout 60

Manual server rollback skips the gates

Setting IMAGE_TAG by hand bypasses SonarCloud, the E2E gate, and Trivy. Use it only to stop the bleeding; follow up with a proper pipeline run once the incident is contained.

What the deploy script already does on failure

The remote deploy.sh waits for containers to become healthy and fails loudly if they do not:

if ! docker compose up -d --wait --wait-timeout 60; then
  echo "❌ Docker Compose Up failed or timed out! App logs:"
  docker compose logs app | tail -n 50
  exit 1
fi

So on a failed health check the pipeline step turns red, the last 50 lines of app logs are printed in the GitHub Actions log, and the job exits non-zero. The previous containers were already stopped (docker compose down) earlier in the script, so a failed deploy generally leaves the stack down, not silently serving stale traffic — that is why a prompt rollback matters.

Manual recovery over the tunnel

SSH in using the Cloudflare Tunnel host alias (see Cloudflare Tunnel), then inspect and restart:

ssh ubuntu@<host>
cd /opt/app-name           # or /opt/app-name-prod

docker compose ps                       # what is up / unhealthy?
docker compose logs app | tail -n 100   # recent app errors
docker compose logs nginx | tail -n 50  # proxy errors

# Restart a single service or the whole stack
docker compose restart app
docker compose up -d --wait --wait-timeout 60

Database migration caveats

Migrations are not automatically reversed

The deploy script runs docker compose run --rm app alembic upgrade head forward on every deploy. Rolling the image back to an older tag does not roll the schema back. If a migration is incompatible with the older code, the rolled-back app may fail to start.

Before rolling back across a migration boundary:

  • Check whether the migration is backward-compatible with the previous image.
  • If it is not, you may need to restore the database. See Migrations for the migration model and Dump & Load for restoring from a snapshot.
  • Take a fresh dump before attempting any destructive recovery.

Disk-full recovery

The deploy script's 80% guard runs docker system prune -af --volumes automatically, but if a box is already wedged you can do it by hand:

df -h /                                  # confirm the disk is the problem
docker system prune -af --volumes        # reclaim all unused images + volumes
docker image prune -af                   # if still tight
df -h /

--volumes deletes unused volumes

This removes any volume not attached to a running container — potentially including database data if the DB container is down. Verify your data is on an in-use or backed-up volume before running it. See Docker & GHCR.

Incident checklist

  1. Confirm scope — which environment, and is the site actually down? docker compose ps + a curl /api/health.
  2. Capture evidence — copy the failing GitHub Actions log and docker compose logs app | tail -n 200 before changing anything.
  3. Stabilise — roll back to the last known-good tag (pipeline re-run, or IMAGE_TAG=<good> docker compose up -d --wait).
  4. Check the database — did a migration run? If schema changed, decide whether a restore is needed before more deploys.
  5. Free disk if neededdocker system prune -af (and --volumes only after confirming data safety).
  6. Verify recoverydocker compose ps all healthy, health endpoint 200.
  7. Follow up — fix forward through the normal pipeline so the gates run; record what happened.

Related: Deploy to Test · Deploy to Production · Docker & GHCR.