Skip to main content
AI/MLjeremylongshore

vastai-deploy-integration

'Deploy ML training jobs and inference services on Vast.ai GPU cloud.

Stars
2,267
Source
jeremylongshore/claude-code-plugins-plus-skills
Updated
2026-05-31
Slug
jeremylongshore--claude-code-plugins-plus-skills--vastai-deploy-integration
View on GitHubRaw SKILL.md

// install — copy + paste into any project

mkdir -p .claude/skills && curl -fsSL https://raw.githubusercontent.com/jeremylongshore/claude-code-plugins-plus-skills/HEAD/plugins/saas-packs/vastai-pack/skills/vastai-deploy-integration/SKILL.md -o .claude/skills/vastai-deploy-integration.md

Drops the SKILL.md into .claude/skills/vastai-deploy-integration.md. Works with Claude Code, Cursor, and any agent that loads SKILL.md files from .claude/skills/.

Vast.ai Deploy Integration

Overview

Deploy ML training jobs and inference services on Vast.ai GPU cloud. Covers Docker image optimization, automated provisioning scripts, data transfer strategies, and deployment automation.

Prerequisites

  • Vast.ai CLI authenticated
  • Docker image published to a registry
  • Training/inference code tested locally

Instructions

Step 1: Optimized Docker Image

# Dockerfile.vastai — optimized for fast pulls on Vast.ai
FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime

# Install dependencies in a single layer
COPY requirements.txt /tmp/
RUN pip install --no-cache-dir -r /tmp/requirements.txt && rm /tmp/requirements.txt

# Copy application code
COPY src/ /workspace/src/
COPY scripts/ /workspace/scripts/

WORKDIR /workspace
CMD ["python", "src/train.py"]
# Build and push
docker build -t ghcr.io/yourorg/training:v1 -f Dockerfile.vastai .
docker push ghcr.io/yourorg/training:v1

Step 2: Automated Deployment Script

#!/usr/bin/env python3
"""deploy.py — Automated Vast.ai deployment with monitoring."""
import subprocess, json, time, argparse, sys

def deploy(args):
    # Search for matching offer
    query = (f"num_gpus={args.gpus} gpu_name={args.gpu} "
             f"reliability>{args.reliability} dph_total<={args.max_price} "
             f"disk_space>={args.disk} rentable=true")

    offers = json.loads(subprocess.run(
        ["vastai", "search", "offers", query, "--order", "dph_total",
         "--raw", "--limit", "5"],
        capture_output=True, text=True, check=True).stdout)

    if not offers:
        print(f"ERROR: No offers matching: {query}", file=sys.stderr)
        sys.exit(1)

    offer = offers[0]
    print(f"Selected: {offer['gpu_name']} ${offer['dph_total']:.3f}/hr "
          f"(ID: {offer['id']})")

    # Create instance
    cmd = ["vastai", "create", "instance", str(offer["id"]),
           "--image", args.image, "--disk", str(args.disk)]
    if args.onstart:
        cmd.extend(["--onstart-cmd", args.onstart])

    result = json.loads(subprocess.run(
        cmd, capture_output=True, text=True, check=True).stdout)
    instance_id = result["new_contract"]
    print(f"Instance {instance_id} provisioning...")

    # Wait for running
    for _ in range(30):
        info = json.loads(subprocess.run(
            ["vastai", "show", "instance", str(instance_id), "--raw"],
            capture_output=True, text=True).stdout)
        if info.get("actual_status") == "running":
            print(f"READY: ssh -p {info['ssh_port']} root@{info['ssh_host']}")
            return instance_id, info
        time.sleep(10)

    raise TimeoutError("Instance did not start")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--gpu", default="RTX_4090")
    parser.add_argument("--gpus", type=int, default=1)
    parser.add_argument("--image", required=True)
    parser.add_argument("--disk", type=int, default=50)
    parser.add_argument("--max-price", type=float, default=0.50)
    parser.add_argument("--reliability", type=float, default=0.95)
    parser.add_argument("--onstart", default="")
    deploy(parser.parse_args())

Step 3: Data Transfer Strategies

# Small datasets (<5GB): SCP directly
scp -P $PORT ./data.tar.gz root@$HOST:/workspace/

# Large datasets (>5GB): Use rsync with compression
rsync -avz --progress -e "ssh -p $PORT" ./data/ root@$HOST:/workspace/data/

# Very large datasets: Pre-stage on cloud storage
ssh -p $PORT root@$HOST "wget -q https://storage.example.com/dataset.tar.gz -O /workspace/data.tar.gz"

Step 4: Health Check After Deploy

ssh -p $PORT -o StrictHostKeyChecking=no root@$HOST << 'CHECK'
echo "=== Deploy Health Check ==="
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
df -h /workspace | tail -1
echo "=== Ready ==="
CHECK

Output

  • Optimized Docker image for fast Vast.ai pulls
  • Automated deployment script with GPU/price selection
  • Data transfer patterns (SCP, rsync, cloud storage)
  • Post-deploy health check verification

Error Handling

Error Cause Solution
Docker pull timeout Image too large (>10GB) Use multi-stage builds; minimize image layers
Disk space exhausted Insufficient disk allocation Increase --disk parameter
SSH timeout after deploy Instance still loading image Wait longer or use smaller base image
CUDA version mismatch Image CUDA > host CUDA Filter offers by cuda_max_good

Resources

Next Steps

For event-driven workflows, see vastai-webhooks-events.

Examples

One-command deploy: python deploy.py --gpu A100 --image ghcr.io/org/train:v1 --max-price 2.00 --disk 100

Multi-GPU deploy: Set --gpus 4 and --gpu H100_SXM for distributed training with torchrun.