Skip to content

Support underscores in numeric literals#6180

Merged
aryairani merged 10 commits intounisonweb:trunkfrom
tfausak:gh-2228-numeric-underscores
Mar 6, 2026
Merged

Support underscores in numeric literals#6180
aryairani merged 10 commits intounisonweb:trunkfrom
tfausak:gh-2228-numeric-underscores

Conversation

@tfausak
Copy link
Contributor

@tfausak tfausak commented Mar 5, 2026

Overview

  • What does this change accomplish and why?

    • Adds support for underscores as visual separators in numeric literals, a common readability feature in modern languages (Rust, Python, Java, Kotlin, Swift). Closes Underscores in numeric literals #2228.
    • How does it change the user experience?
      • Users can now write 1_000_000 instead of 1000000, 0xFF_FF instead of 0xFFFF, etc.
    • What was the old behavior/API and what is the new behavior/API?
      • Before: 1_000 produced a confusing parse error.
      • After: 1_000 is accepted and evaluates to 1000.
  • Before and after examples:

    Input Before After
    1_000_000 parse error 1000000
    0xFF_FF parse error 65535
    0b1010_0101 parse error 165
    1_000.5e1_0 parse error 1000.5e10
    1_ 1 (silent) parse error
    1__2 1 (silent) parse error
    1_x 1, _x (two tokens) parse error
  • Closes Underscores in numeric literals #2228

Implementation approach and notes

This is a parser-only change — underscores are stripped at lex time, so no downstream changes to the parser, typechecker, pretty-printer, or runtime are needed.

Two helpers are added to the numeric parser in Unison.Syntax.Lexer.Unison:

  • digitsWithUnderscores — parses digit groups separated by single underscores, committing after each _ so that malformed literals (trailing _, consecutive __, _ before non-digit) produce errors instead of silently accepting partial input.
  • digitsToInteger — converts a digit string to Integer for a given base, replacing megaparsec's LP.decimal/LP.hexadecimal/LP.octal/LP.binary.

The three prefixed-base parsers (octal, hex, binary) are factored into a shared baseWithPrefix helper. Decimal is intentionally excluded from this because it has no prefix, which changes the backtracking semantics.

Interesting/controversial decisions

  • 1_x is now an error, not two tokens (1, _x). Once the lexer sees <digit>_, it commits to the underscore-in-number interpretation. This seems like the right call — 1_x looks like a malformed numeric literal, and users can write 1 _x if they mean two tokens.
  • Underscores are not allowed after base prefixes (0x_FF is rejected) or adjacent to ./e/exponent signs. This matches Java's rules. Rust and Python are more permissive here.
  • Underscores are never emitted on output. The pretty-printer always renders 1000000, not 1_000_000. This is consistent with how Unison handles string literals (single-line vs multi-line formatting is regenerated by the pretty-printer, not preserved from source). A configurable pretty-printer could be a future enhancement.
  • Bytes literals (0xs...) don't support underscores. This is a separate piece of work since the bytes parser has different structure.

Test coverage

  • Lexer unit tests: 38 new test cases covering all numeric forms with underscores (decimal, float, scientific, hex, octal, binary) plus error cases (trailing _, consecutive __, _ adjacent to prefix/period/exponent, digit_nondigit sequences, leading zeros).
  • Transcript integration test: numeric-underscore-literals.md exercises valid literals end-to-end (parse → evaluate → display) and verifies error messages for invalid literals.
  • Test coverage is adequate for this change.

Loose ends

  • Bytes literals (0xs...) could also support underscores — separate issue.
  • The pretty-printer could optionally insert underscores in large numeric literals for readability — separate feature.

Final checklist

  • PR title is descriptive of the change.
  • Transcripts included demonstrating the changed behavior.
  • No .cabal file changes.

tfausak and others added 8 commits March 5, 2026 01:16
Allow underscores as visual separators in all numeric literal forms
(decimal, float, scientific, hex, octal, binary). Underscores are
stripped at lex time so no downstream changes are needed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Remove P.try from digitsWithUnderscores so that once an underscore is
consumed, the parser commits to requiring digits after it. This rejects
malformed literals like 1_, 1__2, 0xFF_, etc. instead of silently
accepting them.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Integration test covering valid literals (decimal, float, scientific,
hex, octal, binary with underscores) and error cases (trailing and
consecutive underscores).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Factor octal/hex/binary into shared otherBase helper. Extract
isBinDigit predicate. Use mconcat and toInteger for clarity.

Add tests confirming that 1_x and 1_e3 are rejected (previously
these silently parsed as two tokens).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Ensures 0x_FF is an error, not a valid hex literal.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Ensures 1_.2, 1._2, 1e_2, and 1_e2 are all rejected.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Ensures 1e+_2 and 1e-_2 are rejected (underscore after exponent sign).
Adds tests for leading zeros: 007 -> 7 and 0_1 -> 1.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@aryairani
Copy link
Contributor

I was wanting this recently too for some reason.

@aryairani
Copy link
Contributor

@tfausak Looks great. Could you do a couple of things:

  1. Add yourself to CONTRIBUTORS.markdown, acknowledging Unison's MIT license
  2. run scripts/check.sh, and assuming it succeeds, check in the "proof" files it generates.

@tfausak tfausak requested a review from a team as a code owner March 5, 2026 20:43
@aryairani aryairani self-requested a review March 6, 2026 02:16
@aryairani aryairani added this pull request to the merge queue Mar 6, 2026
Merged via the queue into unisonweb:trunk with commit 0d51c30 Mar 6, 2026
5 checks passed
tfausak added a commit to tfausak/unison that referenced this pull request Mar 6, 2026
Extend the underscore separator feature from numeric literals (unisonweb#6180)
to bytes literals. E.g. `0xs01_ef` now parses as `0xs01ef`. Uses the
existing `digitsWithUnderscores` helper and switches from `isAlphaNum`
to `isHexDigit` for stricter validation at lex time.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@ChrisPenner
Copy link
Member

Lovely, thanks @tfausak !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Underscores in numeric literals

3 participants