Confusable Unicode Audit

Detect Trojan Source and homoglyph risks in source files, dependency names, and release metadata. This enforces no-confusable-unicode and maps curl Practice 8 into an AIWG control.

Detection Targets

Bidirectional controls: U+202A through U+202E, U+2066 through U+2069.
Zero-width characters: U+200B through U+200F, U+FEFF.
Non-ASCII identifiers in source code.
Mixed-script identifiers, especially Latin plus Cyrillic or Greek.
Package/dependency names containing non-ASCII or confusable characters.
Optional metadata scan: commit subject, PR titles, release notes.

Allowlist

Legitimate non-ASCII is declared in .aiwg/security/confusable-unicode-allowlist.yaml:

version: 1
allow:
  - path: "docs/i18n/**"
    reason: "localized documentation"
  - identifier: "naive_bayes"
    codepoints: ["U+00EF"]
    reason: "historical exported API spelling"

Output

Reports show file, line, column, Unicode code point, character name, and remediation. Bidi and zero-width controls are always HIGH severity.

References

agentic/code/frameworks/security-engineering/rules/no-confusable-unicode.md
Unicode TR39
Trojan Source / CVE-2021-42574