Skip to content

fix(ocaml): use LANGUAGE_OCAML_INTERFACE grammar for .mli files#730

Merged
carlos-alm merged 3 commits intomainfrom
fix/ocaml-interface-grammar
Apr 1, 2026
Merged

fix(ocaml): use LANGUAGE_OCAML_INTERFACE grammar for .mli files#730
carlos-alm merged 3 commits intomainfrom
fix/ocaml-interface-grammar

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Separate .mli from .ml parsing: .mli files now use the dedicated LANGUAGE_OCAML_INTERFACE grammar instead of LANGUAGE_OCAML, which was producing error-polluted ASTs and missing interface-specific declarations
  • Both engines updated: Rust native engine gets OcamlInterface variant; WASM engine gets ocaml-interface registry entry with tree-sitter-ocaml_interface.wasm
  • Interface-specific extractors: Both TS and Rust extractors now handle value_specification (val), external, module_type_definition, and exception_definition nodes — shared with the existing OCaml extractor since most node types overlap
  • Smart val classification: val name : type declarations are classified as function when the type contains -> (function_type), otherwise variable

Test plan

  • All 6 existing .ml parser tests pass unchanged
  • 7 new .mli parser tests added and passing:
    • val specs as functions (arrow type) and variables (non-arrow)
    • external declarations as functions
    • module type definitions as interfaces
    • exception definitions as types
    • type definitions in interfaces
    • open statements in interfaces
  • CI: Rust compilation on Linux (Windows linker issue prevents local Rust build)
  • Verify WASM grammar builds in CI via build-wasm.ts

Closes #720

OCaml interface files (.mli) use a distinct grammar that lacks expression
syntax. Parsing them with LANGUAGE_OCAML produces error-polluted ASTs and
misses interface declarations (val, external, module type, exception).

Rust native engine:
- Add OcamlInterface variant to LanguageKind
- Map .mli to OcamlInterface, use LANGUAGE_OCAML_INTERFACE
- Route to shared OCaml extractor with interface-specific handlers

WASM engine:
- Build tree-sitter-ocaml_interface.wasm from grammars/interface
- Add ocaml-interface registry entry for .mli extension
- Extend TS extractor for value_specification, external,
  module_type_definition, exception_definition nodes

Closes #720
@claude
Copy link
Copy Markdown

claude bot commented Apr 1, 2026

Claude finished @carlos-alm's task —— View job


PR Review: OCaml Interface Grammar Support

Todo List:

  • Phase 1: Root cause analysis - understand the problem being solved
  • Phase 2: Configuration change detection
  • Phase 3: Test integrity review
  • Phase 4: Code quality assessment
  • Phase 5: Backlog compliance evaluation
  • Provide final structured assessment

Analysis Summary:

Problem Being Solved: .mli (OCaml interface) files were being parsed with the generic LANGUAGE_OCAML grammar, which is designed for implementation files (.ml). This was producing error-polluted ASTs and missing interface-specific declarations like val specifications, external declarations, module type definitions, and exception definitions.

Phase 1: Root Cause Analysis

  • Real Problem: This addresses a genuine parsing accuracy issue. OCaml interface files have different syntax constructs than implementation files.
  • Justified Approach: Using a dedicated LANGUAGE_OCAML_INTERFACE grammar is the correct solution.
  • Not a Workaround: This is a proper fix, not hiding symptoms.

Phase 2: Configuration Changes

  • No linting rules disabled or relaxed
  • No test timeouts increased or coverage thresholds lowered
  • No build configuration weakening detected
  • Clean addition of new language support

Phase 3: Test Integrity

  • All 6 existing .ml parser tests preserved unchanged
  • 7 new .mli interface tests added (not weakened)
  • Tests are comprehensive and properly validate the new functionality
  • No test assertions weakened or removed

Phase 4: Code Quality

  • Both engines updated consistently (Rust + WASM)
  • Smart val classification based on type signatures
  • Proper interface-specific node handlers
  • Good code reuse between .ml and .mli parsing

Root Cause Analysis

Problem Being Solved: OCaml interface files (.mli) contain different syntax constructs than implementation files (.ml). Using the wrong grammar produced error-polluted ASTs and missed interface-specific declarations.

Why This Approach: Using dedicated LANGUAGE_OCAML_INTERFACE grammar is the architecturally correct solution - aligns with tree-sitter's design for language-specific grammars.

Risk Assessment: Low risk. This is an additive change that doesn't affect existing .ml parsing and uses established patterns.

Backlog Compliance

  • Zero-dep: ✅ No new runtime dependencies - builds on existing tree-sitter infrastructure
  • Foundation-aligned: ✅ Fully complies with all principles - improves accuracy without changing core design
  • Problem-fit: 4/5 - Directly improves AI agent accuracy by providing clean ASTs and catching interface-specific symbols that were previously missed
  • Breaking: No - purely additive, existing .ml files unaffected
  • Tier: 1 (zero-dep ✅ + foundation-aligned ✅)

Critical Concerns

None. This is a well-implemented language accuracy improvement.

Minor Notes

  • hasDescendantType recursive traversal could be expensive for deeply nested types (acceptable for typical OCaml)
  • Would benefit from CI confirmation of Rust compilation (Windows linker issue prevents local verification)

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE WITH CONDITIONS
  • Reasoning: Solid implementation that directly improves parsing accuracy for OCaml interface files. Well-tested, follows established patterns, no architectural concerns. Minor deduction only because CI verification of Rust compilation is pending.

Conditions for approval: Verify CI build passes, confirming Rust compilation succeeds and WASM grammar builds correctly.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 1, 2026

Greptile Summary

This PR correctly fixes OCaml .mli interface file parsing by routing them to the dedicated LANGUAGE_OCAML_INTERFACE tree-sitter grammar instead of re-using LANGUAGE_OCAML, which was producing error-polluted ASTs. The fix is applied consistently across both the Rust native engine and the WASM engine. Previous review comments about comment accuracy and exception aliases are both addressed in this revision.

  • .mli extension split off into its own ocaml-interface registry entry in both parser.ts and parser_registry.rs, each pointing to the correct grammar
  • tree-sitter-ocaml_interface.wasm build entry added to build-wasm.ts using the grammars/interface sub-path from tree-sitter-ocaml
  • val name : type specs are classified as function/variable based on whether the type contains a function_type descendant — works correctly for top-level function types but can over-match for values whose types merely contain an arrow (e.g. val callbacks : (event -> unit) list)
  • 11 new tests cover all new node handlers across both .ml and .mli grammars, including the exception alias fallback path

Confidence Score: 5/5

Safe to merge — all changes are additive, well-tested, and the only remaining finding is a P2 edge-case misclassification in the function-type heuristic.

All P0/P1 concerns from prior rounds are resolved. The single remaining note is a P2 edge case where hasDescendantType may over-classify a value like val callbacks : (event -> unit) list as a function. This does not break parsing, cause data loss, or affect the primary use-case. Per the confidence guidelines, a P2-only finding does not reduce the score below 5.

Both src/extractors/ocaml.ts and crates/codegraph-core/src/extractors/ocaml.rs share the hasDescendantType/has_descendant_kind heuristic worth revisiting for higher classification precision.

Important Files Changed

Filename Overview
crates/codegraph-core/src/extractors/mod.rs Routes OcamlInterface to the same OcamlExtractor as Ocaml via an OR pattern — correct and minimal change.
crates/codegraph-core/src/extractors/ocaml.rs Adds four new node handlers plus has_descendant_kind; the descendant-search for function_type can over-classify values-of-function-type (e.g. callback lists) as functions.
crates/codegraph-core/src/parser_registry.rs Adds OcamlInterface variant, routes .mli extension to it, and wires LANGUAGE_OCAML_INTERFACE — correct and complete.
scripts/build-wasm.ts Adds tree-sitter-ocaml_interface.wasm build entry pointing to grammars/interface sub-path — name matches the reference in parser.ts.
src/domain/parser.ts Splits .mli from .ml into a separate ocaml-interface registry entry with the correct WASM file and reuses extractOCamlSymbols.
src/extractors/ocaml.ts Adds four TS handler functions mirroring the Rust side; hasDescendantType may over-classify values whose types merely contain an arrow as functions.
src/types.ts Adds ocaml-interface to the LanguageId union type — straightforward, no issues.
tests/parsers/ocaml.test.ts Adds 4 new .ml tests and 7 new .mli tests covering all new handlers including the exception alias fallback path.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[File input] --> B{Extension?}
    B -- ".ml" --> C[LanguageKind::Ocaml\nocaml registry entry\ntree-sitter-ocaml.wasm]
    B -- ".mli" --> D[LanguageKind::OcamlInterface\nocaml-interface registry entry\ntree-sitter-ocaml_interface.wasm]
    C --> E[OcamlExtractor / extractOCamlSymbols]
    D --> E
    E --> F{Node type?}
    F -- value_definition/let_binding --> G[function / variable]
    F -- type_definition --> H[type]
    F -- module_definition --> I[module]
    F -- open_module --> K[import]
    F -- value_specification --> L{Outer type is\nfunction_type?}
    L -- yes --> M[function]
    L -- no --> N[variable]
    F -- external --> O[function]
    F -- module_type_definition --> P[interface]
    F -- exception_definition --> Q[type]
Loading

Reviews (2): Last reviewed commit: "fix: add ocaml-interface to LanguageId t..." | Re-trigger Greptile

Comment on lines +53 to +65
// Interface-specific (.mli) node types
case 'value_specification':
handleOCamlValueSpec(node, ctx);
break;
case 'external':
handleOCamlExternal(node, ctx);
break;
case 'module_type_definition':
handleOCamlModuleTypeDef(node, ctx);
break;
case 'exception_definition':
handleOCamlExceptionDef(node, ctx);
break;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Misleading "interface-specific" comment — handlers apply to .ml too

The comment // Interface-specific (.mli) node types is inaccurate. All four node types (external, exception_definition, module_type_definition, value_specification) are also valid in regular .ml files:

  • external length : string -> int = "caml_ml_string_length" is extremely common in .ml files for C FFI
  • exception My_error of string appears in .ml implementations
  • module type S = sig ... end can be defined inside .ml files

Because both LanguageKind::Ocaml and LanguageKind::OcamlInterface route to the same extractor (see mod.rs), these handlers will fire for .ml files too — which is actually a beneficial side-effect, but the comment implies they are unreachable for .ml inputs. The same misleading comment appears in the Rust extractor at crates/codegraph-core/src/extractors/ocaml.rs:27.

Consider updating the comment to reflect that these node types are now handled for both grammars:

Suggested change
// Interface-specific (.mli) node types
case 'value_specification':
handleOCamlValueSpec(node, ctx);
break;
case 'external':
handleOCamlExternal(node, ctx);
break;
case 'module_type_definition':
handleOCamlModuleTypeDef(node, ctx);
break;
case 'exception_definition':
handleOCamlExceptionDef(node, ctx);
break;
// Shared node types present in both .ml and .mli files
case 'value_specification':
handleOCamlValueSpec(node, ctx);
break;
case 'external':
handleOCamlExternal(node, ctx);
break;
case 'module_type_definition':
handleOCamlModuleTypeDef(node, ctx);
break;
case 'exception_definition':
handleOCamlExceptionDef(node, ctx);
break;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fd2a1fe. Updated the comment to "Shared node types present in both .ml and .mli files" in both the TS extractor (src/extractors/ocaml.ts) and the Rust extractor (crates/codegraph-core/src/extractors/ocaml.rs). Also added .ml test coverage for external, exception, and module type definitions to document this beneficial side-effect.

Comment on lines +297 to +309
function handleOCamlExceptionDef(node: TreeSitterNode, ctx: ExtractorOutput): void {
const ctorDecl = findChild(node, 'constructor_declaration');
if (!ctorDecl) return;
const nameNode = findChild(ctorDecl, 'constructor_name');
if (!nameNode) return;

ctx.definitions.push({
name: nameNode.text,
kind: 'type',
line: node.startPosition.row + 1,
endLine: nodeEndLine(node),
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Exception aliases silently skipped

For the exception aliasing syntax exception Foo = Bar, the tree-sitter-ocaml grammar produces an exception_definition node whose first payload is a constructor_path (pointing to Bar), not a constructor_declaration. The handler returns early if findChild(node, 'constructor_declaration') fails, so aliased exceptions are silently dropped.

This is an acceptable limitation for a first pass, but worth noting. A more robust approach would also fall back to extracting the name directly from a constructor_name child on the exception_definition node itself:

function handleOCamlExceptionDef(node: TreeSitterNode, ctx: ExtractorOutput): void {
  // Standard: `exception Foo of bar`
  const ctorDecl = findChild(node, 'constructor_declaration');
  const nameNode = ctorDecl
    ? findChild(ctorDecl, 'constructor_name')
    : findChild(node, 'constructor_name'); // fallback for `exception Foo = Bar`
  if (!nameNode) return;

  ctx.definitions.push({
    name: nameNode.text,
    kind: 'type',
    line: node.startPosition.row + 1,
    endLine: nodeEndLine(node),
  });
}

The same pattern applies to the Rust counterpart at crates/codegraph-core/src/extractors/ocaml.rs:278-295.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fd2a1fe. Updated both the TS and Rust extractors to fall back to constructor_name directly on the exception_definition node when constructor_declaration is not found. This handles the exception Foo = Bar alias syntax. Added a test case for exception aliases in the .ml test suite as well.

- Add 'ocaml-interface' to LanguageId union in types.ts (fixes TS2322)
- Update misleading "Interface-specific" comments to "Shared node types
  present in both .ml and .mli files" in both TS and Rust extractors
- Handle exception alias syntax (exception Foo = Bar) by falling back
  to constructor_name on the exception_definition node directly
- Add .ml test coverage for external, exception, exception alias, and
  module type definitions that now fire for .ml files too
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed all Greptile review feedback in fd2a1fe:

  1. Misleading comment (P2): Updated "Interface-specific (.mli) node types" to "Shared node types present in both .ml and .mli files" in both TS and Rust extractors.

  2. Exception aliases silently skipped (P2): Added fallback to constructor_name on the exception_definition node for exception Foo = Bar alias syntax, in both TS and Rust extractors.

  3. Missing .ml test coverage (outside-diff P2): Added 4 new test cases to the OCaml parser describe block covering external, exception, exception alias, and module type definitions in .ml files.

  4. TS2322 type error (CI blocker): Added 'ocaml-interface' to the LanguageId union type in src/types.ts.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 5fa7e7e into main Apr 1, 2026
18 checks passed
@carlos-alm carlos-alm deleted the fix/ocaml-interface-grammar branch April 1, 2026 07:51
@github-actions github-actions bot locked and limited conversation to collaborators Apr 1, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(ocaml): use LANGUAGE_OCAML_INTERFACE grammar for .mli files

1 participant