Docker Retention Cleanup

Cessy uses retention labels to make Docker cleanup safe on shared hosts. Broad Docker prune commands can delete runtime or workspace state, so automated cleanup only removes objects that explicitly opt in as disposable CI or test artifacts.

Safety model

Cleanup may delete an object only when all retention labels agree that it is disposable:

ces.owner=ci|test
ces.environment=ci
ces.safe_to_prune=true
ces.preserve=false
ces.ttl_hours=<positive number>

Runtime and workspace objects should be visible in reports but protected by default:

ces.owner=workspace|production|staging|dev
ces.environment=production|staging|dev|ci
ces.app_id=<uuid>
ces.workspace_id=<id>
ces.tenant_id=<id>
ces.preserve=true
ces.safe_to_prune=false

Unlabeled ces-ws-* volumes are hard-blocked. Treat those names as possible workspace volumes unless the object creation path has been audited and labeled.

Cleanup workflow

Run dry-run first whenever you inspect or operate the cleanup manually:

$npm run docker:retention:dry-run -- --json

Apply mode deletes only expired objects that pass the label and TTL rules:

$npm run docker:retention:apply -- --json

Reports group objects by owner, environment, app id, and workspace id. Each object includes a reason such as ttl_expired, ttl_not_expired, preserve_true, or ces_workspace_volume_without_explicit_safe_to_prune.

Scheduled cleanup

The Docker retention cleanup GitHub workflow runs daily. It produces a dry-run report, applies label-gated cleanup on scheduled runs, and checks Docker disk usage after cleanup.

DOCKER_DISK_USAGE_ALERT_PERCENT controls the alert threshold and defaults to 80. When Docker disk usage remains at or above the threshold after cleanup, the workflow fails so the red run becomes the alert.

Runner pool isolation

CI workflows can use repository variables to separate low-Docker jobs from Docker-heavy jobs:

CI_LOW_DOCKER_RUNNER_LABELS=["self-hosted","hetzner","low-docker"]
CI_DOCKER_RUNNER_LABELS=["self-hosted","hetzner","docker"]

Low-Docker jobs include typecheck, lint, static checks, affected unit tests, and post-merge unit coverage. Docker-heavy jobs include image builds, integration tests, and retention cleanup. When variables are not configured, workflows fall back to the shared self-hosted Hetzner runner labels.

When the alert fires

  1. Open the failed workflow run and inspect the dry-run and apply reports.
  2. Confirm whether large blocked objects are preserved runtime objects or unlabeled ces-ws-* volumes.
  3. Do not delete unlabeled workspace volumes from the host shell.
  4. Fix labels at the object creation path when an object is disposable.
  5. Re-run dry-run, then apply only after the safe deletion set is limited to disposable CI or test objects.
  6. If build cache dominates usage, schedule a maintenance window instead of pruning while image builds are active.