Integrate purescript-cst#3608
Conversation
|
Note, I have not done anything with the old parser yet (other than removing the |
|
I have updated this PR with an overview. |
Yes, there's actually a TODO in |
|
The writeup is fantastic. I think it would be good to put most of what you’ve written here into the relevant modules as comments - perhaps with the exception of comparisons to the current parser, since they won’t be relevant for very much longer. |
|
I have added several new tests and added module comments where appropriate. |
|
I think it would be good to have this code in |
|
Sounds good to me. |
Fixes #3555
I will work on a full write-up of all the user facing changes and design considerations, but I figured I'd get this code out sooner rather than later.
Breaking Changes
\n,\r, and ASCII space are the only allowed whitespace (though anything is allowed in raw strings).\n\r\t\'\"\\,\x[0-9a-fA-F]{1,6}(unicode hex escape), and\[\r\n ]+\(gap escapes).\is no longer a valid operator.@is no longer a valid operator.forallis no longer a valid expression-level identifier.a@Foo bmust bea@(Foo b)).a :: Type -> Type b :: Typemust now be(a :: Type -> Type) (b :: Type))::has lowest precedence, rather than sitting between operators and function application).Language Enhancements
I went ahead and added these, but we don't have to include them. I think they are useful (and fairly conservative).
whereis now allowed incasebranches, and it behaves the same as declarations. Note that this does not change where they are bound.whereis still just sugar forlet, which means thatwhereis always bound to the right-hand-side of a single=or case->._is now allowed in numeric literals, and is an ignored character (as in Rust and Swift).Overview
The main purpose of this PR is to provide a parser and data structure for PureScript's exact surface syntax. In the process, I've aimed to cleanup and fix many edge cases and bugs with both the grammar and parser. This can hopefully serve as the basis for a grammar specification and for syntactic tooling .
The current parser and AST have several disadvantages. The AST itself is both very lossy, and inadequate for any specific purpose. A lot of desugaring is done in the parser, and many comments are dropped, so the AST is not a good basis for syntactic tooling. Since comment annotations are incomplete, it's difficult to extend
docswith additional forms. It contains many nodes that are internal to the typechecker, but then also has many nodes which the typechecker doesn't even handle.I have not made any changes to the AST, or updated any of the tooling other than to go through the new parser. However, my hope is that the current AST can be gradually cleaned up and pruned so it's only a typechecking IR.
Language.PureScript.CST.MonadThis contains the barebones parser Monad, which is essentially
StateT s (Except e) a, but CPSed and combined. The primary parser state holds onto the token stream and accumulated failures. Currently the parser does not do recovery (earlier versions did, but it required some more finessing), but it will report multiple errors for trivial things like=in record constructors. Otherwise, this module is straightforward and not particularly interesting.Language.PureScript.CST.TypesThis contains data types for the entire PureScript surface language. Every token is represented in the tree, and every token is annotated with whitespace and comments (both leading and trailing). This means one can write an exact printer so that
print . parse = id. Every constructor is laid out with tokens in left-to-right order. The core productions are given a slot for arbitrary annotations, however this is not used by the parser. It takes a very long time to compile because of all the derived instances.Language.PureScript.CST.PrintThis is just a simple token printer. It's not a full fledged formatter, but it is used by the layout golden tests. Printing each token in the tree with this printer will result in the exact input that was given to the lexer.
Language.PureScript.CST.PositionsThis contains utilities for calculating positions and offsets. While tokens are annotated with ranges, CST nodes are not, but they can be dynamically derived with the functions in this module, which will return the first and last tokens for a given node.
Language.PureScript.CST.ConvertThis contains functions for converting the CST into the core AST. It is mostly boilerplate, and does the job of resolving ranges for all the nodes and attaching comments.
Language.PureScript.CST.ErrorsThis contains a data type for all the different parse errors and a pretty printer.
Language.PureScript.CST.LexerThis is the core lexer. It is handwritten, and significantly faster than the previous lexer. This could probably be written with Alex, but I was able to write the entire lexer in Haskell in the time it would've take me to learn how to do everything I needed with Alex. FWIW, I've commented all the productions. There isn't anything particularly fancy other than that the lexer is lazy, which means we are able to only lex what we actually parse. This means we can parse module headers, and avoid the work of lexing the rest of the source. It also does the job of calling the layout algorithm.
Language.PureScript.CST.LayoutThe parser itself is unaware of indentation, and instead only parses explicit delimiters which are inserted by the layout algorithm (much like Haskell). This is convenient because the actual grammar can be specified apart from the indentation rules. Haskell's layout algorithm is somewhat unfortunate though as it relies on a parse error side condition. This is notoriously difficult to interpret and implement apart from the way GHC does it. Haskell is somewhat hindered by a few productions which make it impossible to implement a purely lexical algorithm. PureScript does not have these problematic productions (particularly
foo, bar :: SomeTypesyntax in declarations), but it does have a few gotchas of it's own. The algorithm is "non-trivial" to say the least, but it is implemented as a purely lexical delimiter parser on a token-by-token basis, which is highly convenient, since it can be replicated in any language or toolchain. There is likely room to simplify it, but there are some seemingly innocuous things that complicate it..(also in foralls!) or labels in record literals. If this algorithm had been written before that feature, I would have required layout keywords to be quotes in properties.I've tried to comment most of the cases, and I can further elaborate if it's deemed confusing. I've added golden tests for many of the weird edge cases, which show that this algorithm is very robust.
It does however break a few things that the current parser accepts, but it is quite rare. The current parser offside rule is very fragile. It only checks exact columns, so it's possible to write things that disobey the spirit of the offside rule, but still parse.
The old parser would accept this because
awas not at exactly column 23, and indeed there's a fair amount of old code in core that uses this, but it clearly disobeys the offside rule. This extends to other weird cases like:Or
Or
I'll leave that last one as an exercise for the reader.
Overall though the vast majority of existing code will continue to parse fine.
Language.PureScript.CST.ParserThis is the parser for the entire PureScript surface language. I've chosen to implement with Happy for a few reasons.
=in a record. I have not turned on reporting of expected alternative tokens since I didn't find it super useful, but we can explore this.There are a handful of cases in the grammar that require unbounded lookahead, which could certainly be handled a GLR parser, but I've opted to use backtracking in those cases. This is primarily because binder and expression reductions conflict. Haskell solves this by parsing all binders as expressions and then maintaining an ad-hoc expression-to-binder parser. I did this initially, but it's very difficult to maintain and audit. I use the monadic reduction rules and partial parsing provided by Happy to implement backtracking for those cases. In practice, I think the cost of reparsing those tokens is probably the same as doing a conversion. The end result is much more maintainable.
Performance
In my benchmarking, the new lexer and parser yield a 6-7x improvement over the old one. I've also written it so that module headers can be parsed lazily, and we can continue parsing while sharing the previous work. This helps when calculating the dependency tree. This means time to first compile in a clean build is much faster. However, in a project that has already been built, calling the compiler again still has overhead. We had previously thought that parsing was the primary overhead, but I don't think that's completely true in practice. Even when nothing is changed, the current compiler orchestrates everything in build order, when it could calculate changed modules fully in parallel. The parser improvements do help, but just not as much as we might have hoped.