Skip to main content
Generalmajesticlabs-dev

skill-eval

Test and iterate on a skill using parallel eval runs. Spawns with-skill and baseline runs, grades results, and helps improve the skill based on feedback. Use when testing skill quality or iterating on skill improvements.

Stars
39
Source
majesticlabs-dev/majestic-marketplace
Updated
2026-05-13
Slug
majesticlabs-dev--majestic-marketplace--skill-eval
View on GitHubRaw SKILL.md

// install — copy + paste into any project

mkdir -p .claude/skills && curl -fsSL https://raw.githubusercontent.com/majesticlabs-dev/majestic-marketplace/HEAD/plugins/majestic-tools/skills/skill-eval/SKILL.md -o .claude/skills/skill-eval.md

Drops the SKILL.md into .claude/skills/skill-eval.md. Works with Claude Code, Cursor, and any agent that loads SKILL.md files from .claude/skills/.

Skill Eval

Test a skill with parallel eval runs and iterate based on results.

Input

SKILL_PATH = $ARGUMENTS

If empty: AskUserQuestion("Which skill do you want to test? Provide the path.")

Workflow

Step 1: Validate Skill Exists

If not exists(SKILL_PATH/SKILL.md):
  ERROR "No SKILL.md found at SKILL_PATH"

SKILL_CONTENT = Read(SKILL_PATH/SKILL.md)
SKILL_NAME = extract name from frontmatter
SKILL_DESCRIPTION = extract description from frontmatter

Step 2: Load or Create Evals

EVALS_PATH = SKILL_PATH/evals/evals.json

If exists(EVALS_PATH):
  EVALS = Read(EVALS_PATH)
  Present evals to user, ask: "Use these evals, modify, or create new ones?"
Else:
  Generate 2-3 realistic test prompts based on SKILL_CONTENT
  Each prompt should be what a real user would actually say
  Include enough detail: file paths, context, specific requests
  Present to user: "Here are test cases I'd like to try. Look right, or want changes?"

Wait for user confirmation before proceeding.

Eval format:

{
  "skill_name": "SKILL_NAME",
  "evals": [
    {
      "id": 1,
      "prompt": "User's realistic task prompt",
      "expected_output": "Description of expected result",
      "expectations": [
        "The output includes X",
        "The skill used approach Y"
      ]
    }
  ]
}

Save confirmed evals to EVALS_PATH.

Step 3: Set Up Workspace

WORKSPACE = SKILL_PATH-workspace
ITERATION = find highest iteration-N in WORKSPACE + 1, or 1 if none
ITER_DIR = WORKSPACE/iteration-ITERATION

mkdir -p ITER_DIR

Step 4: Spawn All Runs

For each EVAL in EVALS, spawn TWO subagents in the SAME turn:

With-skill run:

Agent(prompt: """
Execute this task:
- Read the skill at: SKILL_PATH/SKILL.md
- Follow the skill's instructions to complete this task: EVAL.prompt
- Save all outputs to: ITER_DIR/eval-EVAL.id/with_skill/outputs/
""")

Baseline run (no skill):

Agent(prompt: """
Execute this task (no special instructions):
- Task: EVAL.prompt
- Save all outputs to: ITER_DIR/eval-EVAL.id/without_skill/outputs/
""")

Launch ALL runs (with-skill + baseline for every eval) in parallel.

Step 5: Draft Assertions While Runs Execute

While runs are in progress, draft quantitative assertions for each eval:

  • Assertions must be objectively verifiable
  • Give each a descriptive name
  • Skip assertions for subjective outcomes (writing style, design quality)

Write eval_metadata.json for each eval directory:

{
  "eval_id": 1,
  "eval_name": "descriptive-name",
  "prompt": "The user's task prompt",
  "assertions": ["Output includes X", "File is valid JSON"]
}

Explain assertions to user while waiting.

Step 6: Grade Results

Once all runs complete:

For each run (with_skill and without_skill):

Agent(prompt: """
Read agents/skill-grader.md and follow its instructions.
- expectations: EVAL.expectations
- transcript_path: ITER_DIR/eval-EVAL.id/CONFIG/transcript.md
- outputs_dir: ITER_DIR/eval-EVAL.id/CONFIG/outputs/
- eval_prompt: EVAL.prompt
""", subagent_type: "majestic-tools:agents:skill-grader")

Step 7: Present Results

For each eval, show:

  • Prompt
  • With-skill pass rate vs baseline pass rate
  • Key differences in outputs
  • Any eval feedback from grader

Summary table:

| Eval | With Skill | Baseline | Delta |
|------|-----------|----------|-------|
| 1    | 85%       | 35%      | +50%  |

Ask user: "How do these look? Any specific feedback on the outputs?"

Step 8: Iterate (if needed)

If user has feedback:
  1. Analyze feedback — generalize, don't overfit to specific cases
  2. Improve the skill (Edit SKILL.md)
  3. Explain changes made and why
  4. Ask: "Want to rerun the evals with the updated skill?"
  5. If yes: goto Step 3 (new iteration)

If user is satisfied:
  "Skill looks good!"

Improvement Guidelines

When rewriting skills based on feedback:

  • Generalize — don't overfit to the 2-3 test cases; the skill will be used many times
  • Keep it lean — remove instructions that aren't pulling their weight
  • Explain the why — tell the model why things matter instead of rigid MUSTs
  • Look for repeated work — if all runs wrote similar helper scripts, bundle the script in the skill
  • Read transcripts — check if the skill wastes time on unproductive steps

Error Handling

Condition Action
Subagent run times out Note timeout, grade available outputs
No outputs produced FAIL all expectations for that run
Skill has syntax errors Fix before running evals
User wants to stop iterating Accept and summarize current state