Skip to main content
AI/MLjeremylongshore

coreweave-performance-tuning

'Optimize CoreWeave GPU inference latency and throughput.

Stars
2,267
Source
jeremylongshore/claude-code-plugins-plus-skills
Updated
2026-05-31
Slug
jeremylongshore--claude-code-plugins-plus-skills--coreweave-performance-tuning
View on GitHubRaw SKILL.md

// install — copy + paste into any project

mkdir -p .claude/skills && curl -fsSL https://raw.githubusercontent.com/jeremylongshore/claude-code-plugins-plus-skills/HEAD/plugins/saas-packs/coreweave-pack/skills/coreweave-performance-tuning/SKILL.md -o .claude/skills/coreweave-performance-tuning.md

Drops the SKILL.md into .claude/skills/coreweave-performance-tuning.md. Works with Claude Code, Cursor, and any agent that loads SKILL.md files from .claude/skills/.

CoreWeave Performance Tuning

GPU Selection by Workload

Workload Recommended GPU Why
LLM inference (7-13B) A100 80GB Good balance of memory and cost
LLM inference (70B+) 8xH100 NVLink for tensor parallelism
Image generation L40 Good for diffusion models
Training (large models) 8xH100 SXM5 Fastest interconnect
Batch processing A100 40GB Cost-effective

Inference Optimization

# Continuous batching with vLLM
containers:
  - name: vllm
    args:
      - "--model=meta-llama/Llama-3.1-8B-Instruct"
      - "--max-num-batched-tokens=8192"
      - "--max-num-seqs=256"
      - "--gpu-memory-utilization=0.90"
      - "--enable-prefix-caching"
      - "--dtype=float16"

Autoscaling Tuning

# HPA based on GPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: DCGM_FI_DEV_GPU_UTIL
        target:
          type: AverageValue
          averageValue: "70"

Performance Benchmarks

Metric A100-80GB H100-80GB
Llama-8B tokens/sec ~2,000 ~4,500
Llama-70B tokens/sec ~200 (4x) ~500 (4x)
Cold start (vLLM) 30-60s 20-40s

Resources

Next Steps

For cost optimization, see coreweave-cost-tuning.