Integrate purescript-cst by natefaubion · Pull Request #3608 · purescript/purescript

natefaubion · 2019-04-14T21:34:10Z

I will work on a full write-up of all the user facing changes and design considerations, but I figured I'd get this code out sooner rather than later.

Breaking Changes

\n, \r, and ASCII space are the only allowed whitespace (though anything is allowed in raw strings).
The only string escapes are \n\r\t\'\"\\, \x[0-9a-fA-F]{1,6} (unicode hex escape), and \[\r\n ]+\ (gap escapes).
\ is no longer a valid operator.
@ is no longer a valid operator.
forall is no longer a valid expression-level identifier.
Octal literals have been removed (only hex is supported).
Precedence of constructors with arguments in binders (a@Foo b must be a@(Foo b)).
Precedence of kind annotations (a :: Type -> Type b :: Type must now be (a :: Type -> Type) (b :: Type))
Precedence of type annotations (:: has lowest precedence, rather than sitting between operators and function application).
Various edge cases with indentation/layout.

Language Enhancements

I went ahead and added these, but we don't have to include them. I think they are useful (and fairly conservative).

where is now allowed in case branches, and it behaves the same as declarations. Note that this does not change where they are bound. where is still just sugar for let, which means that where is always bound to the right-hand-side of a single = or case ->.
```
example = case ... of
  _ -> ...
    -- this is bound to the branch since it is indented past the case binder.
    where foo = ...
  -- this is bound to `example`.
  where baz = ...
```
_ is now allowed in numeric literals, and is an ignored character (as in Rust and Swift).
```
1_000_000 == 1000000
```

Overview

The main purpose of this PR is to provide a parser and data structure for PureScript's exact surface syntax. In the process, I've aimed to cleanup and fix many edge cases and bugs with both the grammar and parser. This can hopefully serve as the basis for a grammar specification and for syntactic tooling .

The current parser and AST have several disadvantages. The AST itself is both very lossy, and inadequate for any specific purpose. A lot of desugaring is done in the parser, and many comments are dropped, so the AST is not a good basis for syntactic tooling. Since comment annotations are incomplete, it's difficult to extend docs with additional forms. It contains many nodes that are internal to the typechecker, but then also has many nodes which the typechecker doesn't even handle.

I have not made any changes to the AST, or updated any of the tooling other than to go through the new parser. However, my hope is that the current AST can be gradually cleaned up and pruned so it's only a typechecking IR.

`Language.PureScript.CST.Monad`

This contains the barebones parser Monad, which is essentially StateT s (Except e) a, but CPSed and combined. The primary parser state holds onto the token stream and accumulated failures. Currently the parser does not do recovery (earlier versions did, but it required some more finessing), but it will report multiple errors for trivial things like = in record constructors. Otherwise, this module is straightforward and not particularly interesting.

`Language.PureScript.CST.Types`

This contains data types for the entire PureScript surface language. Every token is represented in the tree, and every token is annotated with whitespace and comments (both leading and trailing). This means one can write an exact printer so that print . parse = id. Every constructor is laid out with tokens in left-to-right order. The core productions are given a slot for arbitrary annotations, however this is not used by the parser. It takes a very long time to compile because of all the derived instances.

`Language.PureScript.CST.Print`

This is just a simple token printer. It's not a full fledged formatter, but it is used by the layout golden tests. Printing each token in the tree with this printer will result in the exact input that was given to the lexer.

`Language.PureScript.CST.Positions`

This contains utilities for calculating positions and offsets. While tokens are annotated with ranges, CST nodes are not, but they can be dynamically derived with the functions in this module, which will return the first and last tokens for a given node.

`Language.PureScript.CST.Convert`

This contains functions for converting the CST into the core AST. It is mostly boilerplate, and does the job of resolving ranges for all the nodes and attaching comments.

`Language.PureScript.CST.Errors`

This contains a data type for all the different parse errors and a pretty printer.

`Language.PureScript.CST.Lexer`

This is the core lexer. It is handwritten, and significantly faster than the previous lexer. This could probably be written with Alex, but I was able to write the entire lexer in Haskell in the time it would've take me to learn how to do everything I needed with Alex. FWIW, I've commented all the productions. There isn't anything particularly fancy other than that the lexer is lazy, which means we are able to only lex what we actually parse. This means we can parse module headers, and avoid the work of lexing the rest of the source. It also does the job of calling the layout algorithm.

`Language.PureScript.CST.Layout`

The parser itself is unaware of indentation, and instead only parses explicit delimiters which are inserted by the layout algorithm (much like Haskell). This is convenient because the actual grammar can be specified apart from the indentation rules. Haskell's layout algorithm is somewhat unfortunate though as it relies on a parse error side condition. This is notoriously difficult to interpret and implement apart from the way GHC does it. Haskell is somewhat hindered by a few productions which make it impossible to implement a purely lexical algorithm. PureScript does not have these problematic productions (particularly foo, bar :: SomeType syntax in declarations), but it does have a few gotchas of it's own. The algorithm is "non-trivial" to say the least, but it is implemented as a purely lexical delimiter parser on a token-by-token basis, which is highly convenient, since it can be replicated in any language or toolchain. There is likely room to simplify it, but there are some seemingly innocuous things that complicate it.

"Naked" commas (case, patterns, guards, fundeps) are a constant source of complexity, and indeed too much of this is what prevents Haskell from having such an algorithm.
Unquoted properties for layout keywords introduce a domino effect of complexity since we have to mask and unmask any usage of . (also in foralls!) or labels in record literals. If this algorithm had been written before that feature, I would have required layout keywords to be quotes in properties.

I've tried to comment most of the cases, and I can further elaborate if it's deemed confusing. I've added golden tests for many of the weird edge cases, which show that this algorithm is very robust.

It does however break a few things that the current parser accepts, but it is quite rare. The current parser offside rule is very fragile. It only checks exact columns, so it's possible to write things that disobey the spirit of the offside rule, but still parse.

example = case foo of Foo a ->
  a

The old parser would accept this because a was not at exactly column 23, and indeed there's a fair amount of old code in core that uses this, but it clearly disobeys the offside rule. This extends to other weird cases like:

example = case foo of
            Foo a ->
        a
            Bar b -> b

Or

example = do \a ->
  a

Or

example = do
  foo <-
a b

I'll leave that last one as an exercise for the reader.

Overall though the vast majority of existing code will continue to parse fine.

`Language.PureScript.CST.Parser`

This is the parser for the entire PureScript surface language. I've chosen to implement with Happy for a few reasons.

Being a LALR(1) parser generator, I feel like it keeps the complexity honest. Most of the grammar (and I feel like it's apparent in the size of this file) is very straightforward when combined with the layout algorithm. There are a few tricky cases, but not too bad.
Performance is improved a lot (I don't have exact number though for just the parser).
Since it doesn't backtrack (mostly), error locations are precise. In some cases the previous parser wielded backtracking quite aggressively, and I'm sure all of us have had to binary-search-comment a file only to find out that we added an extra comma or used an = in a record. I have not turned on reporting of expected alternative tokens since I didn't find it super useful, but we can explore this.
It has some implementation details showing, but otherwise is very close to what a spec CFG would look like.

There are a handful of cases in the grammar that require unbounded lookahead, which could certainly be handled a GLR parser, but I've opted to use backtracking in those cases. This is primarily because binder and expression reductions conflict. Haskell solves this by parsing all binders as expressions and then maintaining an ad-hoc expression-to-binder parser. I did this initially, but it's very difficult to maintain and audit. I use the monadic reduction rules and partial parsing provided by Happy to implement backtracking for those cases. In practice, I think the cost of reparsing those tokens is probably the same as doing a conversion. The end result is much more maintainable.

Performance

In my benchmarking, the new lexer and parser yield a 6-7x improvement over the old one. I've also written it so that module headers can be parsed lazily, and we can continue parsing while sharing the previous work. This helps when calculating the dependency tree. This means time to first compile in a clean build is much faster. However, in a project that has already been built, calling the compiler again still has overhead. We had previously thought that parsing was the primary overhead, but I don't think that's completely true in practice. Even when nothing is changed, the current compiler orchestrates everything in build order, when it could calculate changed modules fully in parallel. The parser improvements do help, but just not as much as we might have hoped.

natefaubion · 2019-04-14T21:35:29Z

Note, I have not done anything with the old parser yet (other than removing the parseModulesFromFiles entry points) since that is very likely to cause conflicts.

hdgarrood

Would you like to merge #3576 first?

I think it would be good to add test cases for all of the issues listed in #3555 which this fixes, but we don't have to do that immediately.

tests/Language/PureScript/Ide/UsageSpec.hs

src/Language/PureScript/CST/Parser.y

src/Language/PureScript/CST/Errors.hs

natefaubion · 2019-04-15T00:39:31Z

I have updated this PR with an overview.

natefaubion · 2019-04-15T02:35:12Z

Would you like to merge #3576 first?

Yes, there's actually a TODO in Convert.hs for this.

hdgarrood · 2019-04-27T16:37:15Z

The writeup is fantastic. I think it would be good to put most of what you’ve written here into the relevant modules as comments - perhaps with the exception of comparisons to the current parser, since they won’t be relevant for very much longer.

natefaubion · 2019-04-28T19:08:40Z

I have added several new tests and added module comments where appropriate.

hdgarrood · 2019-05-05T14:41:09Z

I think it would be good to have this code in master for at least a couple of weeks before releasing, so that we can test it out a bit better, so shall we merge it now? We can always revert it if we find that something breaks in a way that we can't see how to fix right away.

garyb · 2019-05-05T14:53:15Z

Sounds good to me.

natefaubion added 8 commits April 9, 2019 17:46

Import purescript-cst

dbdd14b

Use cst parse in compile command

85d7698

Get purs compile to use new parser

cdd277b

Remove old module parsers

d681df2

Fix some failing tests

641f4f8

Fix more test failures

bc09745

Use PSString for CST literals and labels

8a5fa74

Add cst tests

26b357d

natefaubion added the type: breaking change label Apr 14, 2019

hdgarrood reviewed Apr 15, 2019

View reviewed changes

tests/Language/PureScript/Ide/UsageSpec.hs Show resolved Hide resolved

src/Language/PureScript/CST/Parser.y Show resolved Hide resolved

src/Language/PureScript/CST/Errors.hs Show resolved Hide resolved

natefaubion mentioned this pull request Apr 16, 2019

Load externs concurrently #3609

Merged

Merge remote-tracking branch 'origin' into cst

52fe90d

natefaubion added 4 commits April 28, 2019 11:51

Move module unwrapping under buildModule to prevent an unreported error

7411f36

Add more tests

5a436af

Add module comments

5708105

Remove TODO which is no longer relevant

c568900

hdgarrood approved these changes Apr 28, 2019

View reviewed changes

natefaubion merged commit 317dce2 into purescript:master May 5, 2019

joneshf mentioned this pull request Feb 16, 2020

Split into packages #3793

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate purescript-cst#3608

Integrate purescript-cst#3608
natefaubion merged 13 commits intopurescript:masterfrom
natefaubion:cst

natefaubion commented Apr 14, 2019 •

edited

Loading

Uh oh!

natefaubion commented Apr 14, 2019

Uh oh!

hdgarrood left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

natefaubion commented Apr 15, 2019

Uh oh!

natefaubion commented Apr 15, 2019

Uh oh!

hdgarrood commented Apr 27, 2019

Uh oh!

natefaubion commented Apr 28, 2019

Uh oh!

hdgarrood commented May 5, 2019

Uh oh!

garyb commented May 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

natefaubion commented Apr 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Breaking Changes

Language Enhancements

Overview

Language.PureScript.CST.Monad

Language.PureScript.CST.Types

Language.PureScript.CST.Print

Language.PureScript.CST.Positions

Language.PureScript.CST.Convert

Language.PureScript.CST.Errors

Language.PureScript.CST.Lexer

Language.PureScript.CST.Layout

Language.PureScript.CST.Parser

Performance

Uh oh!

natefaubion commented Apr 14, 2019

Uh oh!

hdgarrood left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

natefaubion commented Apr 15, 2019

Uh oh!

natefaubion commented Apr 15, 2019

Uh oh!

hdgarrood commented Apr 27, 2019

Uh oh!

natefaubion commented Apr 28, 2019

Uh oh!

hdgarrood commented May 5, 2019

Uh oh!

garyb commented May 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

natefaubion commented Apr 14, 2019 •

edited

Loading

`Language.PureScript.CST.Monad`

`Language.PureScript.CST.Types`

`Language.PureScript.CST.Print`

`Language.PureScript.CST.Positions`

`Language.PureScript.CST.Convert`

`Language.PureScript.CST.Errors`

`Language.PureScript.CST.Lexer`

`Language.PureScript.CST.Layout`

`Language.PureScript.CST.Parser`