Skip to main content
Generaldvcrn

ml-model-eval-benchmark

Compare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.

Stars
15
Source
dvcrn/openclaw-skills-marketplace
Updated
2026-05-29
Slug
dvcrn--openclaw-skills-marketplace--ml-model-eval-benchmark
View on GitHubRaw SKILL.md

// install — copy + paste into any project

mkdir -p .claude/skills && curl -fsSL https://raw.githubusercontent.com/dvcrn/openclaw-skills-marketplace/HEAD/plugins/0x-professor--ml-model-eval-benchmark/skills/ml-model-eval-benchmark/SKILL.md -o .claude/skills/ml-model-eval-benchmark.md

Drops the SKILL.md into .claude/skills/ml-model-eval-benchmark.md. Works with Claude Code, Cursor, and any agent that loads SKILL.md files from .claude/skills/.

ML Model Eval Benchmark

Overview

Produce consistent model ranking outputs from metric-weighted evaluation inputs.

Workflow

  1. Define metric weights and accepted metric ranges.
  2. Ingest model metrics for each candidate.
  3. Compute weighted score and ranking.
  4. Export leaderboard and promotion recommendation.

Use Bundled Resources

  • Run scripts/benchmark_models.py to generate benchmark outputs.
  • Read references/benchmarking-guide.md for weighting and tie-break guidance.

Guardrails

  • Keep metric names and scales consistent across candidates.
  • Record weighting assumptions in the output.