Overview

Experimental support for Dart grammar for tree-sitter (code).

Overview

Unlike the existing tree-sitter-dart project, this one is fully automatically converted from the Dart spec grammar, defined using ANTLR (currenly based on v0.40, copied from commit).

This has both advantages:

Better coverage, in theory we should support as much Dart as the spec parser (but see Known Issues below)
Once the language evolves and the spec parser is updated, we could simply re-generate

and disadvatanges:

ANTLR is LL($\infty$), while tree-sitter is GLR, that means the grammar is structured in a very differnt way, and direct translation both stretches tree-sitter generation and results in a not very efficient parser. While efficience is not a huge problem for us at this point (and the generated parser is still way more efficient than the one generated by ANTLR), it seems that the "not-quite-LRish" converted grammar triggers a bug(?) in tree-sitter, causing a valid code to be not recognized. See Known Issues section.
The spec parser is not really intended to have any parse tree consumers, so the grammar is structured in a way producing deeply nested trees even for simple syntactic constructions. This makes the produced parse tree not very useful for editors, since there is too much noise. Fortunately tree-sitter provide two ways to hide the intermediate nodes. Unfortuntely I don't see a way to deduce what to hide from the spec grammar automatically. The current plan is to have a human maintained list of visible rules.
Some tree-sitter concepts, like node supertypes and node fields don't have counterparts in the spec grammar, so there is nothing to convert automatically, so if we want to add those, we would have to also supply this information to the coverter as an additional input.

Evaluation

As an evaluation, I've tried parsing all Dart files in the Dart SDK repo (commit).

SDK is good in the sense that there are lots of test files that specifically test various syntax pecularities.

Methodology

NOTE: I only compare the binary SUCCESS vs FAILED results. That's obviously not very precise: it can be the case that we parse some file incorrectly, but there is still no error. This won't be detected.

Getting results for the new parser

I've found all Dart files using

ag -g '.*\.dart$' dart-sdk-path

Then I ran

tree-sitter parse -q -s -t sdk-sources.lst

Baseline for the comparison

It's most natural to compare the new parser with the spec parser, that was used as a source.

Additionally I also test the new parser vs the actual Dart implementation, to be some absolute numbers.

Results

The new parser vs the spec parser


Total number of files	46007
Number of cases that failed with the converted parser, but passed with the spec parser	32
Number of cases that passed with the converted parser, but failed with the spec parser	570
Total number of cases where the converted parser and the spec parser disagree	602
False negatives in the converted parser vs the spec parser	0.070%
False positives in the converted parser vs the spec parser	1.239%
Precision the converted parser vs the spec parser	98.692%

The overall precision of 98.69% is not super exciting, I'd hope to get much closer to 100% using the automatic conversion. But fortunately for the most applications of tree-sitter false positives don't matter that much: it's probably ok to parse more than the actual implementation, as long as correct code is parsed correctly. And if we only take false negatives into account, we get exciting 99.93% precision.

I've analyzed most of false negatives and found 2 issues, explaining them. See Known Issues.

The new parser vs the actual implementation

NOTE: the spec parser used in the conversion does support augment keyword related syntax, but doesn't support macro. But in the actual implementation, both are gated by the macros language feature, so either both are enabled, or both are disabled. So we could either increase the false negatives number by enabling the feature or increase the false positives number by keeping the feature disabled. I've opted into keeping it disabled, as I'm more interested in false negative cases.


Total number of cases	46007
Number of cases that failed with the converted parser, but passed with the reference parser	231
Number of cases that passed with the converted parser, but failed with the reference parser	856
Total number of cases where the converted parser and the reference parser disagree	1087
False negatives in the converted parser vs the reference parser	0.502%
False positives in the converted parser vs the reference parser	1.861%
Precision the converted parser vs the reference parser	97.637%

Known Issues

Record literals vs types vs patterns

There are a number of cases there record literals are not parsed correctly. This issue cause nearly half of false negative cases.

Examples:

Record literals in expression statements:
```
f() {
  (x,y);
}
```
(x,y) is parsed as a type here, so and indentifier is expected.
Record literals as a function argument:
```
var v = f((x,y));
```
Here (x,y) is parsed as a record pattern, so the whole f((x,y)) becomes an object pattern.
Record literals inside list literals:
```
var v = [(x,y)];
```
Here again (x,y) is parsed as a record pattern, making the whole thing list pattern.

The reason is highly overloaded syntax for records: (x,y) can be a record type, a record literal or a record patter. Normally a GLR parser should resolve these conflicts, but apparently tree-sitter doesn't. This looks to be a bug in tree-sitter.

I've filed a bug for tree-sitter for this issue.

`//` inside multi-line comment

Single line comment // inside multi-line comment can hide /* or */. In Dart // has no special meaning inside a multi-line comment, but because comments are implemented using the tree-sitter's extras feature and extras are recursive in tree-sitter, it actually allows single line comments to appear inside mult-line ones, so

/* // /*
*/ */

is not parsed correctly (/* on the first line gets swallowed by the single line comment rule).

This issue seems to be causing the other half of false negatives.

I've filed a FR asking to make extras non-recursive.

Spurious identifiers

Dart is pretty picky wrt which identifier is allowed in which place. For example, as is fine as a variable name, but not as a type name. This leads to somewhat complicated grammar rules for identifiers and certainly contributes to the issue with record literals.

Technically in ANTLR this is achieved by defining an IDENTIFIER token to be very generic and relying on more specific keywords/reserved words/contextual keywords tokens to win over it. Such that a string "as" is always tokenized as AS and never as IDENTIFIER. On top of that, there are these complex identifier parser rules, that regulate which keyword tokens can appear where.

And the translator is currently translating this to tree-sitter syntax as is. The good thing is there are similar token priority rules in tree-sitter, so AS token also always wins against the IDENTIFIER. But tree-sitter does context-sensitive lexing (which is generally great, but bites us here), so in places where the grammar doesn't expect AS token, it is not even matched, so "as" is tokenized as IDENTIFIER. This leads, for example, to things like

class as {}

being parsed without any error.

This only causes false positives, so probably not a big deal. I even wonder if this should be a syntax error at all.

Comparison to the exisiting tree-sitter-dart

The existing tree-sitter-dart repo provides hand-written tree-sitter grammar for Dart. As a result, produced parse trees are much nicer, so for example for the editor use case, it would work better than the converted grammar (at least until we hide the intermediate rules). It also supports most of Dart (I can only see extension types, which are missing, but should be easy to add).

I did the same testing as in the Evaluation section, running it on all Dart files in the Dart SDK and comparing the results with the actual implementation.


Total number of cases	46007
Number of cases that failed with the existing parser, but passed with the reference parser	2368
Number of cases that passed with the existing parser, but failed with the reference parser	889
Total number of cases where the existing parser and the reference parser disagree	3257
False negatives in the existing parser vs the reference parser	5.147%
False positives in the existing parser vs the reference parser	1.932%
Precision the existing parser vs the reference parser	92.921%

Precision of almost 93% is actually quite good, but slightly worse than 97.6% we got with the converted parser. Also, somehow, here failures are shifted towards false negatives, unlike the converted parser.

And if we compare the false negatives rate (that's what we actually care about, since that's the case where the user gave us correct code and we return an error), we get a 10x difference: 0.5% in the converted parser vs 5.1% in the existing parser.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
docs		docs
evaluation_results		evaluation_results
test/corpus		test/corpus
.gitignore		.gitignore
grammar.js		grammar.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Evaluation

Methodology

Getting results for the new parser

Baseline for the comparison

Results

The new parser vs the spec parser

The new parser vs the actual implementation

Known Issues

Record literals vs types vs patterns

`//` inside multi-line comment

Spurious identifiers

Comparison to the exisiting tree-sitter-dart

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Evaluation

Methodology

Getting results for the new parser

Baseline for the comparison

Results

The new parser vs the spec parser

The new parser vs the actual implementation

Known Issues

Record literals vs types vs patterns

// inside multi-line comment

Spurious identifiers

Comparison to the exisiting tree-sitter-dart

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages

`//` inside multi-line comment