HF Space Recovery

Use this skill to recover OpenEnv Hub deployments quickly with minimal blast radius.

Execute This Workflow

1) Confirm release tuple

Use a single release tuple across all commands:

Namespace: openenv
Version: vX.Y.Z
Space suffix: -vX-Y-Z

Default to a version suffix and treat unsuffixed Spaces as legacy.

2) Snapshot runtime status

Collect all versioned spaces and isolate non-running ones:

hf spaces ls --author openenv --limit 500 --expand=runtime \
  | jq -r '.[] | select(.id|test("-v[0-9]+-[0-9]+-[0-9]+$")) \
    | [.id, .runtime.stage, (.runtime.raw.errorMessage // "")] | @tsv' \
  | sort

Treat RUNNING and SLEEPING as healthy. Triage everything else.

3) Classify and extract signal

RUNTIME_ERROR: read traceback from .runtime.raw.errorMessage.
BUILD_ERROR: read build error text from runtime info, then patch Dockerfile/deps.
APP_STARTING longer than 10 minutes: inspect event stream and metrics before changing code.

hf spaces info openenv/<space-id> --expand=runtime
curl -sS -m 10 https://huggingface.co/api/spaces/openenv/<space-id>/events | sed -n '1,140p'
curl -sS -m 10 -i https://huggingface.co/api/spaces/openenv/<space-id>/metrics | sed -n '1,120p'

Read references/troubleshooting.md for symptom-to-fix mappings.

4) Apply minimal fix and targeted redeploy

Prefer targeted redeploys over full-fleet pushes:

scripts/prepare_hf_deployment.sh \
  --hf-namespace openenv \
  --env <env_name> \
  --skip-collection

Use openenv CLI as a supplement, not a replacement, for release triage:

Validate env layout quickly (uv run openenv validate ...) when applicable.
Keep release deploys on scripts/prepare_hf_deployment.sh to preserve suffix/pinning behavior.

5) Unstick runtime when code is already good

If Space remains in APP_STARTING with no actionable error:

uv run --with huggingface_hub python - <<'PY'
from huggingface_hub import HfApi
api = HfApi()
api.restart_space("openenv/<space-id>", factory_reboot=True)
PY

If still stuck, force recreation as last resort:

hf repo delete openenv/<space-id> --repo-type space
scripts/prepare_hf_deployment.sh --hf-namespace openenv --env <env_name> --skip-collection

6) Verify and close

Verify both runtime stage and health endpoint:

hf spaces info openenv/<space-id> --expand=runtime
curl -sS -m 10 https://<space-subdomain>.hf.space/health

Then verify fleet-wide:

hf spaces ls --author openenv --limit 500 --expand=runtime \
  | jq -r '.[] | select(.id|test("-v[0-9]+-[0-9]+-[0-9]+$")) \
    | select(.runtime.stage!="RUNNING" and .runtime.stage!="SLEEPING") \
    | [.id, .runtime.stage] | @tsv' | sort

7) Reconcile collection

When targeted deploys are done, update collection membership for the same version:

python3 scripts/manage_hf_collection.py \
  --version vX.Y.Z \
  --collection-namespace openenv \
  --space-id openenv/<space-id>

Add one --space-id per redeployed space.