ADR-015: Multi-Language AST Analysis via Tree-sitter Parser Registry

Status

Accepted (proposed 2026-04-30; extends ADR-002)

Date

2026-04-30

Context

ADR-002 establishes a multi-layered repository analysis engine. In practice, src/utils/ast-analyzer.ts only fully exercises TypeScript and JavaScript via @typescript-eslint/typescript-estree, even though package.json declares tree-sitter-python, tree-sitter-go, tree-sitter-rust, tree-sitter-java, tree-sitter-ruby, tree-sitter-bash, and tree-sitter-yaml as runtime dependencies. CLAUDE.md explicitly flags this gap: "Tree-sitter initialization currently simplified (focuses on TypeScript/JavaScript)".

Drift detection on polyglot repos is therefore unreliable, and a key promise of the project (cross-language documentation drift) is not actually delivered for non-TS/JS code. Compounding this, ADR-014 removes the LLM-based semantic analyzer that previously masked some of this gap, so the AST layer now needs to carry the full weight of language coverage. Issue #112 tracks the implementation work.

Decision

Adopt a lazy-loaded parser registry pattern.

Define a ParserRegistry type in src/utils/ast-analyzer.ts mapping file extensions to factory functions returning a tree-sitter Parser instance:

const registry: ParserRegistry = {
  ".py": () => loadParser("tree-sitter-python"),
  ".go": () => loadParser("tree-sitter-go"),
  ".rs": () => loadParser("tree-sitter-rust"),
  ".java": () => loadParser("tree-sitter-java"),
  ".rb": () => loadParser("tree-sitter-ruby"),
  ".sh": () => loadParser("tree-sitter-bash"),
  ".yaml": () => loadParser("tree-sitter-yaml"),
  ".yml": () => loadParser("tree-sitter-yaml"),
};

Parsers are loaded on first use and cached for the process lifetime.
Each parser exposes a uniform extractSignatures(source) -> SignatureSet API, where SignatureSet has the same shape as today's TS/JS output (functions, classes, exports, imports, complexity).
Unknown extensions fall back to a structural-only path (file size, line count, no signature extraction).
The drift detector in src/utils/drift-detector.ts consumes SignatureSet uniformly, regardless of source language, so adding a language is a single-file PR with no downstream changes.
Test fixtures live in tests/fixtures/multi-lang/{python,go,rust,...} so each language has a representative example exercised by the suite.
A performance benchmark is added to ensure parsing a 500-file polyglot repo completes in under 30 seconds in CI.

Alternatives Considered

Eagerly load all parsers at startup

Pros: Simpler initialization; no first-use latency.
Cons: Adds non-trivial startup latency for users whose repos are single-language, which is the common case.
Decision: Rejected.

Use a single multi-language parser like Semgrep or Sourcegraph's Comby

Pros: One dependency instead of seven.
Cons: Each adds a heavy native dependency and mismatches the existing tree-sitter investment in package.json. Tree-sitter is also closer to language-native AST shape, which yields more accurate signatures.
Decision: Rejected.

Spawn an external parsing service per language

Pros: Process isolation; language-native parsers can be reused.
Cons: Complicates deployment, breaks the stateless server model, and adds IPC overhead disproportionate to the value.
Decision: Rejected.

Defer multi-language support indefinitely; document TS/JS-only as a feature gap

Pros: Lowest immediate effort.
Cons: Polyglot drift detection is core to the documcp value proposition, and ADR-014 has just removed the LLM fallback that was implicitly compensating.
Decision: Rejected.

Consequences

Positive

Drift detection works across all declared languages.
Adding a language becomes a small, well-defined contribution (good first issue territory).
Slight startup-time win because parsers are not loaded eagerly.
Signature extraction becomes language-agnostic at the call site.
AST coverage compensates for the removal of LLM-based semantic analysis (ADR-014).

Negative

Signature extraction quality varies across languages because tree-sitter grammars differ in maturity and completeness.
Some idioms (e.g., decorators in Python, generics in Rust) need language-specific handling that complicates the uniform API.
The SignatureSet type may need optional language-specific extensions over time.

Risks and Mitigations

Risk: tree-sitter native bindings may fail to install on uncommon platforms. Mitigation: Graceful fallback to structural-only mode and clear error messaging.
Risk: Parser version drift across the seven tree-sitter-* packages. Mitigation: Pinned versions in package.json and a Dependabot allowlist (see issue #117).
Risk: Memory growth from caching seven parsers in long-running servers. Mitigation: Lazy loading (parsers only cached after first use) and a documented optional release/reset hook for hosts that care.

Implementation Tracking

#112 Harden tree-sitter parser support beyond TS/JS — milestone v0.7.0 — Phase 3 Completion.

Evidence

package.json runtime dependencies declare tree-sitter-bash, tree-sitter-go, tree-sitter-java, tree-sitter-javascript, tree-sitter-python, tree-sitter-ruby, tree-sitter-rust, tree-sitter-typescript, tree-sitter-yaml, and web-tree-sitter; only TS/JS are exercised today.
CLAUDE.md, Phase 3 limitations: "Tree-sitter initialization currently simplified (focuses on TypeScript/JavaScript)".
src/utils/ast-analyzer.ts: imports @typescript-eslint/typescript-estree and uses it as the primary path; tree-sitter integration is stubbed.
src/utils/drift-detector.ts: consumes whatever ast-analyzer returns and would benefit directly from broader language coverage.
ADR-014: removes the LLM hybrid path that previously compensated for missing language coverage.
tree-sitter project documentation: reference for grammar maturity (TS/JS/Python/Go are stable; Rust/Java/Ruby/Bash are usable; YAML is auxiliary).

Status​

Date​

Context​

Decision​

Alternatives Considered​

Eagerly load all parsers at startup​

Use a single multi-language parser like Semgrep or Sourcegraph's Comby​

Spawn an external parsing service per language​

Defer multi-language support indefinitely; document TS/JS-only as a feature gap​

Consequences​

Positive​

Negative​

Risks and Mitigations​

Implementation Tracking​

Evidence​

Related Decisions​