Skip to main content
GeneralSteveGJones

kb-validate-citations

Spot-check citations in library files for obvious hallucinations. Resolves DOIs against the DOI registry, validates arXiv IDs, checks the format of journal references. Soft-fails on transient lookup failures rather than blocking. Recommended after kb-ingest and as part of periodic library hygiene.

Stars
38
Source
SteveGJones/ai-first-sdlc-practices
Updated
2026-05-11
Slug
SteveGJones--ai-first-sdlc-practices--kb-validate-citations
View on GitHubRaw SKILL.md

// install — copy + paste into any project

mkdir -p .claude/skills && curl -fsSL https://raw.githubusercontent.com/SteveGJones/ai-first-sdlc-practices/HEAD/plugins/sdlc-knowledge-base/skills/kb-validate-citations/SKILL.md -o .claude/skills/kb-validate-citations.md

Drops the SKILL.md into .claude/skills/kb-validate-citations.md. Works with Claude Code, Cursor, and any agent that loads SKILL.md files from .claude/skills/.

Validate Knowledge Base Citations

Catch obvious citation hallucinations in library files. Deep research engines occasionally fabricate DOIs, attribute findings to the wrong author, or invent journal names. Manual spot-checking 10-20% of citations remains necessary; this skill automates the obvious failure cases so manual review can focus on subtler issues.

Argument

Optional file path or glob to scope validation. Default: all library files in library/**/*.md excluding the index files.

Examples:

  • library/dora-metrics.md — single file
  • library/dora-*.md — files matching a pattern
  • (no argument) — all library files

Preflight

  • Verify the knowledge base is configured.
  • Verify network access (DOI resolution requires HTTPS to https://doi.org).

Steps

1. Discover library files

Use Glob to find files matching the scope. Exclude:

  • library/_shelf-index.md
  • library/_index.md
  • library/log.md
  • library/raw/**

2. Extract citations from each file

For each library file, parse the ## Key References section and the inline citations in ## Core Findings. Identify three citation types:

DOI citations — strings matching the DOI regex: 10\.\d{4,9}/[-._;()/:A-Z0-9]+ (case-insensitive). DOIs are the strongest signal because they're machine-resolvable.

arXiv IDs — strings matching arXiv:\d{4}\.\d{4,5} or arxiv\.org/abs/\d{4}\.\d{4,5}.

Journal references — anything matching the pattern <authors>. (<year>). "<title>." <Journal Name>, <volume>(<issue>), <pages>. or similar bibliographic format. Best-effort extraction; not all citations follow a standard format.

3. Validate each citation type

DOI resolution

For each DOI, attempt to resolve it via:

curl -sLI "https://doi.org/<doi>" -o /dev/null -w "%{http_code}"

Expected: HTTP 302 or 200 (redirected to the publisher landing page). Failure: HTTP 404 (DOI does not exist) or 500 (DOI registry error).

Classify each result:

  • OK: 200/302
  • NOT FOUND: 404 — flag as likely hallucinated
  • TRANSIENT: 500/timeout/network failure — soft-fail with "could not verify"

Rate limit: pause 100ms between DOI lookups to avoid hammering doi.org. If rate-limited (429), back off and retry once with a longer pause.

arXiv ID validation

For each arXiv ID:

  1. Validate format: must be YYMM.NNNNN (4-5 digit suffix, 2007 or later)
  2. Optionally verify existence:
    curl -sI "https://arxiv.org/abs/<id>" -o /dev/null -w "%{http_code}"
    
  3. Expected: HTTP 200
  4. NOT FOUND: HTTP 404 — flag as likely hallucinated

Journal reference format

For each journal reference, perform structural checks only (do not attempt to verify the journal exists in any registry — too many legitimate journals to maintain a list):

  • Year is in the past or current year (no 2099 typos)
  • Volume and issue, when present, are integers
  • Page numbers, when present, follow <start>-<end> format
  • Author list looks plausible (not a single word, not all caps unless it's an organisation)

These are weak checks — they catch obvious typos but not real hallucinations. The strong signals are DOI and arXiv resolution.

4. Produce the report

Output format:

# Citation Validation Report

**Date:** YYYY-MM-DD
**Scope:** all files (or pattern)
**Files scanned:** N
**Citations checked:** M

## Summary

| Result | Count |
|---|---|
| OK (resolved) | A |
| NOT FOUND (likely hallucinated) | B |
| TRANSIENT (could not verify) | C |
| Format issues | D |

**Total issues:** B + D (transient failures are not counted as issues — they're informational)

---

## NOT FOUND citations

### library/dora-metrics.md

- DOI 10.1109/TSE.2010.99999 — DOI registry returned 404
  Context: "Madeyski (2010) reports defect reduction of..."
  Recommendation: verify the citation manually; consider whether the DOI was hallucinated

(repeat for each NOT FOUND)

## TRANSIENT failures (informational)

### library/dora-metrics.md

- DOI 10.1109/TSE.2010.5544067 — registry timeout, not validated
  Recommendation: re-run kb-validate-citations later; do not act on this

(repeat for each TRANSIENT)

## Format issues

### library/dora-metrics.md

- "Forsgren (3024)" — year appears to be a typo
  Recommendation: check the original source

(repeat)

---

## Recommended actions

1. Manually verify B citations marked NOT FOUND — these are the highest-confidence hallucination candidates
2. Re-run validation once C transient failures should have cleared
3. Fix D format issues

5. Append to log.md (if present)

## [YYYY-MM-DD] validate-citations | <scope>

Files scanned: N
Citations checked: M
NOT FOUND: B
TRANSIENT: C
Format issues: D

6. Exit code

  • Exit 0 if no NOT FOUND results AND no format issues (transient failures don't count)
  • Exit 1 if any hard failures (NOT FOUND or format issues)

This lets the skill be wired into pre-push validation as an opt-in gate.

What this skill does NOT do

  • It does not verify that the cited paper actually says what the library file claims it says (LLM hallucination of content, not metadata — beyond automated reach; manual spot-checking remains necessary)
  • It does not modify library files to fix issues
  • It does not block on transient failures — soft-fail and report
  • It does not validate non-academic citations (blog posts, GitHub repos, white papers without DOIs)

Errors

  • No knowledge base configured — run /sdlc-knowledge-base:kb-init first
  • No network access — fail with explanation; the skill cannot work offline
  • DOI registry persistently down — soft-fail all DOIs as TRANSIENT and recommend re-running later
  • Rate limited persistently — back off, then partial-fail with a "ran out of attempts" note for unresolved DOIs

Example

/sdlc-knowledge-base:kb-validate-citations

# Citation Validation Report
**Date:** 2026-04-08
**Scope:** all files
**Files scanned:** 22
**Citations checked:** 187

## Summary
| Result | Count |
|---|---|
| OK | 178 |
| NOT FOUND | 2 |
| TRANSIENT | 5 |
| Format issues | 2 |

Total issues: 4

(report continues...)