Performance Metrics
Overview
Schema and procedures for capturing performance data in metrics/. Metrics enable evidence-based evaluation of agent effectiveness, cost efficiency, and quality outcomes.
Constraints
- Never log credentials, API keys, or PII in metric entries
- Log entries are append-only; do not modify or delete existing JSONL records
- Log at task completion, not mid-task; mid-task state belongs in
memory/progress files - Use the defined JSONL schema; do not invent new top-level fields without updating the reference
Metric Categories
Efficiency Metrics
| Metric | Description | Target |
|---|---|---|
| Task completion time | Wall-clock time from request to delivery | Track trend, no fixed target |
| Token usage per task | Total input + output tokens consumed | Minimize for comparable quality |
| Agent loading overhead | Tokens spent on agent/skill file reads | < 5% of total task tokens |
| Context summarization frequency | How often summarization triggers per task | < 2 per standard task |
Quality Metrics
| Metric | Description | Target |
|---|---|---|
| First-pass acceptance rate | Tasks accepted without rework | > 80% |
| Rework count | Number of revision cycles per task | < 2 |
| Hallucination incidents | Outputs containing fabricated information | < 5% of tasks |
| Accuracy score | Correctness of structured data extraction | > 95% |
| Test coverage | Percentage of code covered by generated tests | Track per project |
Cost Metrics
| Metric | Description | Target |
|---|---|---|
| Cost per task | Total API cost (input + output tokens at rate) | Track trend |
| LLM routing ratio | Percentage of tasks routed to each LLM | Track distribution |
| Selective loading savings | Tokens saved vs. loading all agents | > 50% reduction |
Log Format
Metrics are stored in metrics/ as JSONL files (one JSON object per line).
File Naming
metrics/{date}-task-log.jsonl
Example: metrics/2026-02-20-task-log.jsonl
Task Completion Entry
Logged at the end of each task:
{
"timestamp": "2026-02-20T14:30:00Z",
"task_id": "unique-id",
"task_type": "implementation",
"task_description": "Build REST API for user authentication",
"agents_used": ["software-engineer", "architect"],
"skills_used": ["hexagonal-architecture"],
"tokens": {
"input": 12500,
"output": 3200,
"total": 15700
},
"cost_usd": 0.043,
"llm": "claude-opus-4-6",
"context_summarizations": 0,
"phases": 2,
"rework_cycles": 1,
"accepted": true,
"hallucination_detected": false,
"duration_seconds": 180
}
Field Reference
| Field | Type | Description |
|---|---|---|
timestamp |
string | ISO 8601 completion time |
task_id |
string | Unique identifier for this task |
task_type |
string | implementation, design, bugfix, testing, documentation, analysis |
task_description |
string | Brief description of the task |
agents_used |
string[] | Agent names that were loaded |
skills_used |
string[] | Skill names that were loaded |
tokens.input |
number | Input tokens from API usage field |
tokens.output |
number | Output tokens from API usage field |
tokens.total |
number | Sum of input + output |
cost_usd |
number | Estimated cost based on token rates |
llm |
string | Model ID used |
context_summarizations |
number | Times summarization was triggered |
phases |
number | Number of loading phases |
rework_cycles |
number | Number of revision cycles |
accepted |
boolean | Whether the user accepted the output |
hallucination_detected |
boolean | Whether a hallucination was flagged |
duration_seconds |
number | Wall-clock seconds from start to delivery |
When to Log
| Event | Action |
|---|---|
| Task completed | Log full task completion entry |
| Configuration change | Log in metrics/config-changelog.jsonl (see Feedback & Learning skill) |
| Hallucination detected | Flag in task entry + log separately if correction applied |
| Context summarization triggered | Increment counter in current task entry |
Output
JSONL log entries written to metrics/ and/or a summary report of metric trends. Be concise — report anomalies and trend signals; omit entries within normal range.
Reporting
Periodically review metrics to identify patterns:
- Weekly: Review task completion entries for rework trends and hallucination rate
- Monthly: Aggregate cost metrics and LLM routing distribution
- Per-project: Compare first-pass acceptance rate across task types
Summaries can be written to metrics/reports/ for historical reference.