Skip to main content
AI/MLwshobson

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

Stars
36,167
Source
wshobson/agents
Updated
2026-05-29
Slug
wshobson--agents--llm-evaluation
View on GitHubRaw SKILL.md

// install — copy + paste into any project

mkdir -p .claude/skills && curl -fsSL https://raw.githubusercontent.com/wshobson/agents/HEAD/plugins/llm-application-dev/skills/llm-evaluation/SKILL.md -o .claude/skills/llm-evaluation.md

Drops the SKILL.md into .claude/skills/llm-evaluation.md. Works with Claude Code, Cursor, and any agent that loads SKILL.md files from .claude/skills/.

LLM Evaluation

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

When to Use This Skill

  • Measuring LLM application performance systematically
  • Comparing different models or prompts
  • Detecting performance regressions before deployment
  • Validating improvements from prompt changes
  • Building confidence in production systems
  • Establishing baselines and tracking progress over time
  • Debugging unexpected model behavior

Core Evaluation Types

1. Automated Metrics

Fast, repeatable, scalable evaluation using computed scores.

Text Generation:

  • BLEU: N-gram overlap (translation)
  • ROUGE: Recall-oriented (summarization)
  • METEOR: Semantic similarity
  • BERTScore: Embedding-based similarity
  • Perplexity: Language model confidence

Classification:

  • Accuracy: Percentage correct
  • Precision/Recall/F1: Class-specific performance
  • Confusion Matrix: Error patterns
  • AUC-ROC: Ranking quality

Retrieval (RAG):

  • MRR: Mean Reciprocal Rank
  • NDCG: Normalized Discounted Cumulative Gain
  • Precision@K: Relevant in top K
  • Recall@K: Coverage in top K

2. Human Evaluation

Manual assessment for quality aspects difficult to automate.

Dimensions:

  • Accuracy: Factual correctness
  • Coherence: Logical flow
  • Relevance: Answers the question
  • Fluency: Natural language quality
  • Safety: No harmful content
  • Helpfulness: Useful to the user

3. LLM-as-Judge

Use stronger LLMs to evaluate weaker model outputs.

Approaches:

  • Pointwise: Score individual responses
  • Pairwise: Compare two responses
  • Reference-based: Compare to gold standard
  • Reference-free: Judge without ground truth

Quick Start

from dataclasses import dataclass
from typing import Callable
import numpy as np

@dataclass
class Metric:
    name: str
    fn: Callable

    @staticmethod
    def accuracy():
        return Metric("accuracy", calculate_accuracy)

    @staticmethod
    def bleu():
        return Metric("bleu", calculate_bleu)

    @staticmethod
    def bertscore():
        return Metric("bertscore", calculate_bertscore)

    @staticmethod
    def custom(name: str, fn: Callable):
        return Metric(name, fn)

class EvaluationSuite:
    def __init__(self, metrics: list[Metric]):
        self.metrics = metrics

    async def evaluate(self, model, test_cases: list[dict]) -> dict:
        results = {m.name: [] for m in self.metrics}

        for test in test_cases:
            prediction = await model.predict(test["input"])

            for metric in self.metrics:
                score = metric.fn(
                    prediction=prediction,
                    reference=test.get("expected"),
                    context=test.get("context")
                )
                results[metric.name].append(score)

        return {
            "metrics": {k: np.mean(v) for k, v in results.items()},
            "raw_scores": results
        }

# Usage
suite = EvaluationSuite([
    Metric.accuracy(),
    Metric.bleu(),
    Metric.bertscore(),
    Metric.custom("groundedness", check_groundedness)
])

test_cases = [
    {
        "input": "What is the capital of France?",
        "expected": "Paris",
        "context": "France is a country in Europe. Paris is its capital."
    },
]

results = await suite.evaluate(model=your_model, test_cases=test_cases)

Detailed patterns and worked examples

Detailed pattern documentation lives in references/details.md. Read that file when the navigation tier above is insufficient.