Skip to main content
Generalhashgraph-online

reflect

On-demand human self-assessment of AI usage quality. Scores 5 dimensions from session data. Not agent performance review.

Stars
336
Source
hashgraph-online/awesome-codex-plugins
Updated
2026-05-27
Slug
hashgraph-online--awesome-codex-plugins--reflect
View on GitHubRaw SKILL.md

// install — copy + paste into any project

mkdir -p .claude/skills && curl -fsSL https://raw.githubusercontent.com/hashgraph-online/awesome-codex-plugins/HEAD/plugins/epicsagas/epic-harness/registry/skills/reflect/SKILL.md -o .claude/skills/reflect.md

Drops the SKILL.md into .claude/skills/reflect.md. Works with Claude Code, Cursor, and any agent that loads SKILL.md files from .claude/skills/.

Reflect — Human AI-Usage Self-Assessment

This skill is for you (the human) to reflect on how well you're leveraging AI as a thought amplifier — not a review of agent performance.

Data source: The reflect hook (session-end) automatically collects observations, analyzes patterns, and updates metrics.json. This skill consumes that hook-produced data to produce a human-readable self-assessment.

Hook (auto)                    Skill (on-demand /reflect)
─────────────                  ──────────────────────────
observe → obs/*.jsonl  ──→     epic-harness reflect --context 30
evolve → metrics.json  ──→     5-dimension scorecard
seed → evolved skills  ──→     Action items for the human
ingest → memory graph  ──→     Trend analysis

Iron Law

No score without evidence. Every rating must directly cite at least one of: obs stats, evolution patterns, memory nodes, or session summaries. Block self-serving bias: "doing well" conclusions require concrete metrics.

Process

Step 0 — Collect Context

# Uses Rust subcommand — works on all platforms (Linux, macOS, Windows)
epic-harness reflect --context 30 > /tmp/reflect_ctx.json

Fallback if subcommand fails:

echo "obs_files: $(ls "$HARNESS_DIR/obs/" | wc -l)"
python3 -c "import json; m=json.load(open('$HARNESS_DIR/metrics.json')); print('total_sessions:', m.get('total_sessions',0))"

Query harness-mem (if active):

mem_recall(hint="AI usage patterns decisions metacognition", limit=8)
mem_list(type="decision", limit=5)
mem_list(type="pattern", limit=5)

Step 1 — 5-Dimension Reflection

Score each dimension independently: 1–10 + evidence citation + one-line diagnosis.

Dimension 1: Thought Amplification

Question: Is AI a mere executor (code typist) or a genuine thought partner?

Metrics:

  • Agent tool call ratio (Agent / total_obs — higher = delegated thinking)
  • Skill invocation frequency (meta-layer usage)
  • council/discover/spec execution history
  • harness-mem decision node count

Scoring:

Score Signal
8–10 Agent delegation ≥ 5%, diverse skill usage, council execution recorded
5–7 Mostly Bash/Read/Edit, occasional Agent, complex decisions made solo
1–4 Bash+Edit 90%+, AI at autocomplete level

Dimension 2: Self-Improvement

Question: Learning from mistakes, or repeating the same patterns?

Metrics:

  • evolution_stats.pattern_frequency — recurring patterns?
  • evolution_stats.stagnation_count — stagnant sessions
  • evolution_stats.trend_last10 — improving/stable/declining distribution
  • Evolved skills count vs timespan

Scoring:

Score Signal
8–10 Trend improving ≥ 60%, no pattern recurrence, evolved skills growing
5–7 Mostly stable trend, some pattern repeats, evolved skills plateau
1–4 Trend declining or frequent stagnation, same mistakes repeated

Dimension 3: Metacognitive Expansion

Question: Are conversations with AI helping recognize and upgrade your own thinking?

Metrics:

  • harness-mem concept/pattern node count and recency
  • Decision/ADR recording frequency
  • Session snapshots mentioning "learnings"
  • /discover /spec execution history (problem-framing practice)

Scoring:

Score Signal
8–10 Regular decision nodes, /discover /spec used, ADRs exist
5–7 Intermittent recording, decision rationale only in code, no explicit notes
1–4 Almost no memory nodes, context breaks between sessions

Dimension 4: Prompt Engineering

Question: Is prompt quality improving over time?

Metrics:

  • output_quality dimension average trend (metrics score_history)
  • tool_success rate trend
  • Evolved skills with prompt improvement patterns
  • Session average score trajectory (early vs recent)

Scoring:

Score Signal
8–10 output_quality ≥ 0.80, score trend upward, evolved prompts increasing
5–7 output_quality 0.65–0.80, improvement plateau
1–4 output_quality < 0.65, downward trend, frequent re-edits

Dimension 5: Execution Efficiency

Question: Achieving the same goals faster and cheaper through AI?

Metrics:

  • execution_cost dimension average (1.0 = optimal)
  • Bash dominance (Bash > 50% suggests excessive low-level repetition)
  • Context compaction frequency (too frequent = context waste)
  • Agent sub-agent parallel usage

Scoring:

Score Signal
8–10 execution_cost ≥ 0.90, parallel Agent usage, compaction < 20% of sessions
5–7 Bash 40–60%, mostly single-agent serial execution
1–4 Bash 70%+, no sub-agent usage, trivial tasks delegated to AI

Step 2 — Summary Scorecard

## AI Thought-Amplifier Reflection Report
Generated: {ISO-8601}  |  Window: {N} days  |  Total sessions: {total_sessions}

| Dimension            | Score | Grade | Key Evidence |
|----------------------|-------|-------|--------------|
| Thought Amplification | X/10 | 🔴/🟡/🟢 | Agent {N}x ({P}%), Skill {M}x |
| Self-Improvement      | X/10 | 🔴/🟡/🟢 | trend={T}, stagnation={S}x |
| Metacognitive Expansion | X/10 | 🔴/🟡/🟢 | decisions {D}, session notes {M} |
| Prompt Engineering    | X/10 | 🔴/🟡/🟢 | output_quality={Q}, trend={Δ} |
| Execution Efficiency  | X/10 | 🔴/🟡/🟢 | execution_cost={C}, Bash%={B} |
| **Overall**           | **X/10** | | |

Grade: 🟢 8–10 (good)  🟡 5–7 (fair)  🔴 1–4 (needs improvement)

Step 3 — Cold Summary

3–5 sentences. Rules:

  • Open with the lowest-scoring dimension
  • No "doing well" claims without numeric evidence
  • State the single biggest bottleneck
  • Classify usage as "execution automation" or "thought amplification"

Step 4 — Action Items (minimum 3)

Format: [Priority] Title — Concrete action — Expected impact

### Next Reflection Actions

1. [HIGH] {title}
   - Action: {concrete steps}
   - Metric: {how to measure}
   - Deadline: {session count or date}

2. [MED] ...
3. [LOW] ...

Recommended action pool (select by low-scoring dimensions):

  • Thought Amplification low → council mode weekly, /spec writing habit
  • Self-Improvement low → /evolve history periodic review, manual pattern notes
  • Metacognition low → mem_add(type=decision) after every important decision
  • Prompt low → post low-quality session prompt review → seed improved evolved skill
  • Efficiency low → Agent parallel sub-agent patterns, script repetitive Bash calls

Step 5 — Save to Memory

Save reflection to harness-mem (if mem tools active):

mem_add(
  type="session",
  title="AI usage reflection {date}",
  tags=["reflection", "metacognition"],
  importance=0.8,
  body="Overall: {score}/10. Lowest: {lowest_dim}. Top action: {top_action}"
)

Anti-Rationalization

Excuse Rebuttal What to do instead
"Too few sessions for accurate reflection" Even 3 sessions reveal patterns. Insufficient data is not an excuse for a high score. Reflect on available data; add data collection improvement as an action item.
"Bash-heavy because it's a Rust project" Tool usage distribution doesn't only reflect task type. Check for Bash calls that could be delegated to Agent. Find ≥ 3 Bash calls that could be replaced by Agent delegation.
"Score 0.75 is good enough" 0.75 is not an absolute standard. Trend vs previous period matters more. Compare last 5-session average vs prior 5-session average in score_history.
"No memory needed — context is enough" Context dies at session end. Without cross-session learning continuity, you start from scratch every time. Start the habit of mem_add after every important decision — now.
"Code output is already good, so it's fine" Code output quality ≠ thought amplification. Good code can still mean AI is doing the thinking for you. Count how many decisions in the last 5 sessions you actually designed yourself.

Evidence Required

  • /tmp/reflect_ctx.json or manually collected data exists
  • Each of 5 dimensions cites ≥ 1 numeric evidence
  • Summary scorecard output complete
  • Cold summary states 1 specific bottleneck
  • Minimum 3 action items, each with measurement metric
  • Anti-Rationalization table applied to low-scoring dimensions (1–4)

Red Flags

  • All dimensions ≥ 7 → suspect positive bias. Re-verify each score against evidence.
  • Action items at "use more often" level → rewrite with concrete actions and metrics.
  • Summary under 200 chars → insufficient analysis. Cite more evidence.
  • Script failure stops reflection → fall back to manual collection and continue.
  • harness-mem save omitted → reflection won't carry into next session.