Skill Eval

Test a skill with parallel eval runs and iterate based on results.

Input

SKILL_PATH = $ARGUMENTS

If empty: AskUserQuestion("Which skill do you want to test? Provide the path.")

Workflow

Step 1: Validate Skill Exists

If not exists(SKILL_PATH/SKILL.md):
  ERROR "No SKILL.md found at SKILL_PATH"

SKILL_CONTENT = Read(SKILL_PATH/SKILL.md)
SKILL_NAME = extract name from frontmatter
SKILL_DESCRIPTION = extract description from frontmatter

Step 2: Load or Create Evals

EVALS_PATH = SKILL_PATH/evals/evals.json

If exists(EVALS_PATH):
  EVALS = Read(EVALS_PATH)
  Present evals to user, ask: "Use these evals, modify, or create new ones?"
Else:
  Generate 2-3 realistic test prompts based on SKILL_CONTENT
  Each prompt should be what a real user would actually say
  Include enough detail: file paths, context, specific requests
  Present to user: "Here are test cases I'd like to try. Look right, or want changes?"

Wait for user confirmation before proceeding.

Eval format:

{
  "skill_name": "SKILL_NAME",
  "evals": [
    {
      "id": 1,
      "prompt": "User's realistic task prompt",
      "expected_output": "Description of expected result",
      "expectations": [
        "The output includes X",
        "The skill used approach Y"
      ]
    }
  ]
}

Save confirmed evals to EVALS_PATH.

Step 3: Set Up Workspace

WORKSPACE = SKILL_PATH-workspace
ITERATION = find highest iteration-N in WORKSPACE + 1, or 1 if none
ITER_DIR = WORKSPACE/iteration-ITERATION

mkdir -p ITER_DIR

Step 4: Spawn All Runs

For each EVAL in EVALS, spawn TWO subagents in the SAME turn:

With-skill run:

Agent(prompt: """
Execute this task:
- Read the skill at: SKILL_PATH/SKILL.md
- Follow the skill's instructions to complete this task: EVAL.prompt
- Save all outputs to: ITER_DIR/eval-EVAL.id/with_skill/outputs/
""")

Baseline run (no skill):

Agent(prompt: """
Execute this task (no special instructions):
- Task: EVAL.prompt
- Save all outputs to: ITER_DIR/eval-EVAL.id/without_skill/outputs/
""")

Launch ALL runs (with-skill + baseline for every eval) in parallel.

Step 5: Draft Assertions While Runs Execute

While runs are in progress, draft quantitative assertions for each eval:

Assertions must be objectively verifiable
Give each a descriptive name
Skip assertions for subjective outcomes (writing style, design quality)

Write eval_metadata.json for each eval directory:

{
  "eval_id": 1,
  "eval_name": "descriptive-name",
  "prompt": "The user's task prompt",
  "assertions": ["Output includes X", "File is valid JSON"]
}

Explain assertions to user while waiting.

Step 6: Grade Results

Once all runs complete:

For each run (with_skill and without_skill):

Agent(prompt: """
Read agents/skill-grader.md and follow its instructions.
- expectations: EVAL.expectations
- transcript_path: ITER_DIR/eval-EVAL.id/CONFIG/transcript.md
- outputs_dir: ITER_DIR/eval-EVAL.id/CONFIG/outputs/
- eval_prompt: EVAL.prompt
""", subagent_type: "majestic-tools:agents:skill-grader")

Step 7: Present Results

For each eval, show:

Prompt
With-skill pass rate vs baseline pass rate
Key differences in outputs
Any eval feedback from grader

Summary table:

| Eval | With Skill | Baseline | Delta |
|------|-----------|----------|-------|
| 1    | 85%       | 35%      | +50%  |

Ask user: "How do these look? Any specific feedback on the outputs?"

Step 8: Iterate (if needed)

If user has feedback:
  1. Analyze feedback — generalize, don't overfit to specific cases
  2. Improve the skill (Edit SKILL.md)
  3. Explain changes made and why
  4. Ask: "Want to rerun the evals with the updated skill?"
  5. If yes: goto Step 3 (new iteration)

If user is satisfied:
  "Skill looks good!"

Improvement Guidelines

When rewriting skills based on feedback:

Generalize — don't overfit to the 2-3 test cases; the skill will be used many times
Keep it lean — remove instructions that aren't pulling their weight
Explain the why — tell the model why things matter instead of rigid MUSTs
Look for repeated work — if all runs wrote similar helper scripts, bundle the script in the skill
Read transcripts — check if the skill wastes time on unproductive steps

Error Handling

Condition	Action
Subagent run times out	Note timeout, grade available outputs
No outputs produced	FAIL all expectations for that run
Skill has syntax errors	Fix before running evals
User wants to stop iterating	Accept and summarize current state