Skip to content

Integrate purescript-cst#3608

Merged
natefaubion merged 13 commits intopurescript:masterfrom
natefaubion:cst
May 5, 2019
Merged

Integrate purescript-cst#3608
natefaubion merged 13 commits intopurescript:masterfrom
natefaubion:cst

Conversation

@natefaubion
Copy link
Contributor

@natefaubion natefaubion commented Apr 14, 2019

Fixes #3555

I will work on a full write-up of all the user facing changes and design considerations, but I figured I'd get this code out sooner rather than later.


Breaking Changes

  • \n, \r, and ASCII space are the only allowed whitespace (though anything is allowed in raw strings).
  • The only string escapes are \n\r\t\'\"\\, \x[0-9a-fA-F]{1,6} (unicode hex escape), and \[\r\n ]+\ (gap escapes).
  • \ is no longer a valid operator.
  • @ is no longer a valid operator.
  • forall is no longer a valid expression-level identifier.
  • Octal literals have been removed (only hex is supported).
  • Precedence of constructors with arguments in binders (a@Foo b must be a@(Foo b)).
  • Precedence of kind annotations (a :: Type -> Type b :: Type must now be (a :: Type -> Type) (b :: Type))
  • Precedence of type annotations (:: has lowest precedence, rather than sitting between operators and function application).
  • Various edge cases with indentation/layout.

Language Enhancements

I went ahead and added these, but we don't have to include them. I think they are useful (and fairly conservative).

  • where is now allowed in case branches, and it behaves the same as declarations. Note that this does not change where they are bound. where is still just sugar for let, which means that where is always bound to the right-hand-side of a single = or case ->.

    example = case ... of
      _ -> ...
        -- this is bound to the branch since it is indented past the case binder.
        where foo = ...
      -- this is bound to `example`.
      where baz = ...
  • _ is now allowed in numeric literals, and is an ignored character (as in Rust and Swift).

    1_000_000 == 1000000

Overview

The main purpose of this PR is to provide a parser and data structure for PureScript's exact surface syntax. In the process, I've aimed to cleanup and fix many edge cases and bugs with both the grammar and parser. This can hopefully serve as the basis for a grammar specification and for syntactic tooling .

The current parser and AST have several disadvantages. The AST itself is both very lossy, and inadequate for any specific purpose. A lot of desugaring is done in the parser, and many comments are dropped, so the AST is not a good basis for syntactic tooling. Since comment annotations are incomplete, it's difficult to extend docs with additional forms. It contains many nodes that are internal to the typechecker, but then also has many nodes which the typechecker doesn't even handle.

I have not made any changes to the AST, or updated any of the tooling other than to go through the new parser. However, my hope is that the current AST can be gradually cleaned up and pruned so it's only a typechecking IR.

Language.PureScript.CST.Monad

This contains the barebones parser Monad, which is essentially StateT s (Except e) a, but CPSed and combined. The primary parser state holds onto the token stream and accumulated failures. Currently the parser does not do recovery (earlier versions did, but it required some more finessing), but it will report multiple errors for trivial things like = in record constructors. Otherwise, this module is straightforward and not particularly interesting.

Language.PureScript.CST.Types

This contains data types for the entire PureScript surface language. Every token is represented in the tree, and every token is annotated with whitespace and comments (both leading and trailing). This means one can write an exact printer so that print . parse = id. Every constructor is laid out with tokens in left-to-right order. The core productions are given a slot for arbitrary annotations, however this is not used by the parser. It takes a very long time to compile because of all the derived instances.

Language.PureScript.CST.Print

This is just a simple token printer. It's not a full fledged formatter, but it is used by the layout golden tests. Printing each token in the tree with this printer will result in the exact input that was given to the lexer.

Language.PureScript.CST.Positions

This contains utilities for calculating positions and offsets. While tokens are annotated with ranges, CST nodes are not, but they can be dynamically derived with the functions in this module, which will return the first and last tokens for a given node.

Language.PureScript.CST.Convert

This contains functions for converting the CST into the core AST. It is mostly boilerplate, and does the job of resolving ranges for all the nodes and attaching comments.

Language.PureScript.CST.Errors

This contains a data type for all the different parse errors and a pretty printer.

Language.PureScript.CST.Lexer

This is the core lexer. It is handwritten, and significantly faster than the previous lexer. This could probably be written with Alex, but I was able to write the entire lexer in Haskell in the time it would've take me to learn how to do everything I needed with Alex. FWIW, I've commented all the productions. There isn't anything particularly fancy other than that the lexer is lazy, which means we are able to only lex what we actually parse. This means we can parse module headers, and avoid the work of lexing the rest of the source. It also does the job of calling the layout algorithm.

Language.PureScript.CST.Layout

The parser itself is unaware of indentation, and instead only parses explicit delimiters which are inserted by the layout algorithm (much like Haskell). This is convenient because the actual grammar can be specified apart from the indentation rules. Haskell's layout algorithm is somewhat unfortunate though as it relies on a parse error side condition. This is notoriously difficult to interpret and implement apart from the way GHC does it. Haskell is somewhat hindered by a few productions which make it impossible to implement a purely lexical algorithm. PureScript does not have these problematic productions (particularly foo, bar :: SomeType syntax in declarations), but it does have a few gotchas of it's own. The algorithm is "non-trivial" to say the least, but it is implemented as a purely lexical delimiter parser on a token-by-token basis, which is highly convenient, since it can be replicated in any language or toolchain. There is likely room to simplify it, but there are some seemingly innocuous things that complicate it.

  • "Naked" commas (case, patterns, guards, fundeps) are a constant source of complexity, and indeed too much of this is what prevents Haskell from having such an algorithm.
  • Unquoted properties for layout keywords introduce a domino effect of complexity since we have to mask and unmask any usage of . (also in foralls!) or labels in record literals. If this algorithm had been written before that feature, I would have required layout keywords to be quotes in properties.

I've tried to comment most of the cases, and I can further elaborate if it's deemed confusing. I've added golden tests for many of the weird edge cases, which show that this algorithm is very robust.

It does however break a few things that the current parser accepts, but it is quite rare. The current parser offside rule is very fragile. It only checks exact columns, so it's possible to write things that disobey the spirit of the offside rule, but still parse.

example = case foo of Foo a ->
  a

The old parser would accept this because a was not at exactly column 23, and indeed there's a fair amount of old code in core that uses this, but it clearly disobeys the offside rule. This extends to other weird cases like:

example = case foo of
            Foo a ->
        a
            Bar b -> b

Or

example = do \a ->
  a

Or

example = do
  foo <-
a b

I'll leave that last one as an exercise for the reader.

Overall though the vast majority of existing code will continue to parse fine.

Language.PureScript.CST.Parser

This is the parser for the entire PureScript surface language. I've chosen to implement with Happy for a few reasons.

  • Being a LALR(1) parser generator, I feel like it keeps the complexity honest. Most of the grammar (and I feel like it's apparent in the size of this file) is very straightforward when combined with the layout algorithm. There are a few tricky cases, but not too bad.
  • Performance is improved a lot (I don't have exact number though for just the parser).
  • Since it doesn't backtrack (mostly), error locations are precise. In some cases the previous parser wielded backtracking quite aggressively, and I'm sure all of us have had to binary-search-comment a file only to find out that we added an extra comma or used an = in a record. I have not turned on reporting of expected alternative tokens since I didn't find it super useful, but we can explore this.
  • It has some implementation details showing, but otherwise is very close to what a spec CFG would look like.

There are a handful of cases in the grammar that require unbounded lookahead, which could certainly be handled a GLR parser, but I've opted to use backtracking in those cases. This is primarily because binder and expression reductions conflict. Haskell solves this by parsing all binders as expressions and then maintaining an ad-hoc expression-to-binder parser. I did this initially, but it's very difficult to maintain and audit. I use the monadic reduction rules and partial parsing provided by Happy to implement backtracking for those cases. In practice, I think the cost of reparsing those tokens is probably the same as doing a conversion. The end result is much more maintainable.

Performance

In my benchmarking, the new lexer and parser yield a 6-7x improvement over the old one. I've also written it so that module headers can be parsed lazily, and we can continue parsing while sharing the previous work. This helps when calculating the dependency tree. This means time to first compile in a clean build is much faster. However, in a project that has already been built, calling the compiler again still has overhead. We had previously thought that parsing was the primary overhead, but I don't think that's completely true in practice. Even when nothing is changed, the current compiler orchestrates everything in build order, when it could calculate changed modules fully in parallel. The parser improvements do help, but just not as much as we might have hoped.

@natefaubion
Copy link
Contributor Author

Note, I have not done anything with the old parser yet (other than removing the parseModulesFromFiles entry points) since that is very likely to cause conflicts.

Copy link
Contributor

@hdgarrood hdgarrood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to merge #3576 first?

I think it would be good to add test cases for all of the issues listed in #3555 which this fixes, but we don't have to do that immediately.

@natefaubion
Copy link
Contributor Author

I have updated this PR with an overview.

@natefaubion
Copy link
Contributor Author

Would you like to merge #3576 first?

Yes, there's actually a TODO in Convert.hs for this.

@hdgarrood
Copy link
Contributor

The writeup is fantastic. I think it would be good to put most of what you’ve written here into the relevant modules as comments - perhaps with the exception of comparisons to the current parser, since they won’t be relevant for very much longer.

@natefaubion
Copy link
Contributor Author

I have added several new tests and added module comments where appropriate.

@hdgarrood
Copy link
Contributor

I think it would be good to have this code in master for at least a couple of weeks before releasing, so that we can test it out a bit better, so shall we merge it now? We can always revert it if we find that something breaks in a way that we can't see how to fix right away.

@garyb
Copy link
Member

garyb commented May 5, 2019

Sounds good to me.

@natefaubion natefaubion merged commit 317dce2 into purescript:master May 5, 2019
@joneshf joneshf mentioned this pull request Feb 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integrate purescript-cst

3 participants