Confusable Unicode Audit
Detect Trojan Source and homoglyph risks in source files, dependency names, and release metadata. This enforces no-confusable-unicode and maps curl Practice 8 into an AIWG control.
Detection Targets
- Bidirectional controls: U+202A through U+202E, U+2066 through U+2069.
- Zero-width characters: U+200B through U+200F, U+FEFF.
- Non-ASCII identifiers in source code.
- Mixed-script identifiers, especially Latin plus Cyrillic or Greek.
- Package/dependency names containing non-ASCII or confusable characters.
- Optional metadata scan: commit subject, PR titles, release notes.
Allowlist
Legitimate non-ASCII is declared in .aiwg/security/confusable-unicode-allowlist.yaml:
version: 1
allow:
- path: "docs/i18n/**"
reason: "localized documentation"
- identifier: "naive_bayes"
codepoints: ["U+00EF"]
reason: "historical exported API spelling"
Output
Reports show file, line, column, Unicode code point, character name, and remediation. Bidi and zero-width controls are always HIGH severity.
References
agentic/code/frameworks/security-engineering/rules/no-confusable-unicode.md- Unicode TR39
- Trojan Source / CVE-2021-42574