Skip to main content
AI/MLjeremylongshore

coreweave-observability

'Set up GPU monitoring and observability for CoreWeave workloads.

Stars
2,267
Source
jeremylongshore/claude-code-plugins-plus-skills
Updated
2026-05-31
Slug
jeremylongshore--claude-code-plugins-plus-skills--coreweave-observability
View on GitHubRaw SKILL.md

// install — copy + paste into any project

mkdir -p .claude/skills && curl -fsSL https://raw.githubusercontent.com/jeremylongshore/claude-code-plugins-plus-skills/HEAD/plugins/saas-packs/coreweave-pack/skills/coreweave-observability/SKILL.md -o .claude/skills/coreweave-observability.md

Drops the SKILL.md into .claude/skills/coreweave-observability.md. Works with Claude Code, Cursor, and any agent that loads SKILL.md files from .claude/skills/.

CoreWeave Observability

Overview

CoreWeave runs GPU-intensive workloads on Kubernetes where hardware failures, memory exhaustion, and underutilization directly impact cost and reliability. Observability must cover DCGM GPU metrics, Kubernetes pod health, inference latency, and job completion rates. Proactive monitoring prevents wasted spend on idle GPUs and catches OOM conditions before they cascade.

Key Metrics

Metric Type Target Alert Threshold
GPU utilization Gauge > 60% < 20% for 30m
GPU memory usage Gauge < 85% > 95% for 5m
Inference latency p99 Histogram < 200ms > 500ms
Job completion rate Counter > 99% < 95% per hour
Pod restart count Counter 0 > 3 in 15m
Node GPU temperature Gauge < 80C > 85C for 10m

Instrumentation

async function trackInference(model: string, fn: () => Promise<any>) {
  const start = Date.now();
  try {
    const result = await fn();
    metrics.record('coreweave.inference.latency', Date.now() - start, { model, status: 'ok' });
    metrics.increment('coreweave.inference.completed', { model });
    return result;
  } catch (err) {
    metrics.increment('coreweave.inference.errors', { model, error: err.code });
    throw err;
  }
}

Health Check Dashboard

async function coreweaveHealth(): Promise<Record<string, string>> {
  const gpu = await queryPrometheus('avg(DCGM_FI_DEV_GPU_UTIL)');
  const mem = await queryPrometheus('avg(DCGM_FI_DEV_FB_USED/(DCGM_FI_DEV_FB_USED+DCGM_FI_DEV_FB_FREE))');
  const pods = await queryPrometheus('kube_deployment_status_replicas_available{namespace="inference"}');
  return {
    gpu_utilization: gpu > 20 ? 'healthy' : 'underutilized',
    gpu_memory: mem < 0.9 ? 'healthy' : 'critical',
    inference_pods: pods > 0 ? 'healthy' : 'down',
  };
}

Alerting Rules

const alerts = [
  { metric: 'DCGM_FI_DEV_GPU_UTIL', condition: 'avg < 20', window: '30m', severity: 'warning' },
  { metric: 'gpu_memory_pct', condition: '> 0.95', window: '5m', severity: 'critical' },
  { metric: 'inference_latency_p99', condition: '> 500ms', window: '10m', severity: 'warning' },
  { metric: 'pod_restart_count', condition: '> 3', window: '15m', severity: 'critical' },
];

Structured Logging

function logGpuEvent(event: string, node: string, data: Record<string, any>) {
  console.log(JSON.stringify({
    service: 'coreweave', event, node,
    gpu_model: data.gpu_model, utilization: data.util,
    memory_pct: data.memPct, temperature: data.temp,
    timestamp: new Date().toISOString(),
  }));
}

Error Handling

Signal Meaning Action
GPU util < 20% sustained Idle GPUs burning cost Scale down or reassign workload
GPU memory > 95% OOM imminent Reduce batch size or add nodes
Pod CrashLoopBackOff Driver or config failure Check DCGM logs, restart node
Inference latency spike Contention or throttling Review GPU temp and queue depth
Node NotReady Hardware or network issue Cordon node, migrate pods

Resources

Next Steps

For incident response, see coreweave-incident-runbook.