CoreWeave Performance Tuning

GPU Selection by Workload

Workload	Recommended GPU	Why
LLM inference (7-13B)	A100 80GB	Good balance of memory and cost
LLM inference (70B+)	8xH100	NVLink for tensor parallelism
Image generation	L40	Good for diffusion models
Training (large models)	8xH100 SXM5	Fastest interconnect
Batch processing	A100 40GB	Cost-effective

Inference Optimization

# Continuous batching with vLLM
containers:
  - name: vllm
    args:
      - "--model=meta-llama/Llama-3.1-8B-Instruct"
      - "--max-num-batched-tokens=8192"
      - "--max-num-seqs=256"
      - "--gpu-memory-utilization=0.90"
      - "--enable-prefix-caching"
      - "--dtype=float16"

Autoscaling Tuning

# HPA based on GPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: DCGM_FI_DEV_GPU_UTIL
        target:
          type: AverageValue
          averageValue: "70"

Performance Benchmarks

Metric	A100-80GB	H100-80GB
Llama-8B tokens/sec	~2,000	~4,500
Llama-70B tokens/sec	~200 (4x)	~500 (4x)
Cold start (vLLM)	30-60s	20-40s

Resources

Next Steps

For cost optimization, see coreweave-cost-tuning.