Skip to content

Add serialization/hashing for integer/natural#6163

Merged
runarorama merged 9 commits intotrunkfrom
runarorama/bigintserialize
Feb 11, 2026
Merged

Add serialization/hashing for integer/natural#6163
runarorama merged 9 commits intotrunkfrom
runarorama/bigintserialize

Conversation

@runarorama
Copy link
Contributor

@runarorama runarorama commented Feb 6, 2026

Overview

This change adds support for serializing and hashing arbitrary-precision Integer and Natural values in the Unison runtime.

Problem: When a Unison value containing an Integer or Natural was serialized via reflectValue, the runtime would fail with the error: reflectValue: cannot prepare value for serialization: foreign value. This caused operations like storing values in event logs or transferring them across the network to fail.

Solution: Added BigInt and BigNat constructors to the BLit (boxed literal) type in the ANF representation, with corresponding serialization, deserialization, and hashing support.

User experience: Code that previously hung or failed silently when serializing values containing Integer or Decimal types now works correctly.

Implementation approach and notes

  1. ANF.hs: Added BigInt Integer and BigNat Natural constructors to the BLit data type, alongside existing literals like Text, Bytes, Pos, Neg, etc.

  2. Tags.hs: Added BigIntT (tag 15) and BigNatT (tag 16) to the BLTag enum for wire format identification.

  3. ValueV5.hs & ANF/Serialize.hs: Added serialization functions that encode arbitrary-precision numbers as:

    • Natural: length-prefixed list of Word64 chunks (most significant first)
    • Integer: sign byte (0=positive, 1=negative) + Natural magnitude
  4. MurmurHash/Untyped.hs: Added hashing support for BigInt and BigNat by hashing the sign and Word64 chunks.

  5. Machine.hs: Updated reflectValue (goF) to convert WrapInteger/WrapNatural foreign values to ANF.BigInt/ANF.BigNat, and updated reifyValue (goL) to convert back.

Interesting/controversial decisions

  • Chose BLit over alternatives: BLit already contains other serializable foreign values without literal syntax (Code, Quote, BArr, Arr), so adding BigInt/BigNat here is consistent with existing patterns.

  • Serialization format: Used a simple length-prefixed Word64 chunk format rather than a more compact variable-length encoding. Happy to change this if controversial.

  • Tail recursion: Made the naturalToWord64s and integerToWord64s helper functions tail-recursive and strict.

Test coverage

  • Added property-based tests using Hedgehog in Unison/Test/Runtime/ANF/Serialization.hs
  • Added genInteger generator that produces small integers (fit in Int64), large positive integers (2^64 to 2^256), and large negative integers
  • Added genNatural generator that produces small naturals (fit in Word64) and large naturals (2^64 to 2^256)
  • These generators are included in genBLit, which feeds into the existing valueRoundtrip property test
  • Added transcript to exercise the serialization

All 276 runtime tests pass.

Loose ends

Serialization format can be debated.

@dolio
Copy link
Contributor

dolio commented Feb 6, 2026

I think the overall structure is fine. Adding BLit constructors is the right thing to do, I think.

As far as the format goes, my opinion is based on no actual usage data. However, my expectation is that most of these numbers are still going to be small. So maybe something that doesn't necessarily use 8 bytes would still be a good idea.

Just off the top of my head, since you're writing them big endian, maybe you can serialize just the first Word64 using the VarInt format. That way you automatically get that behavior for anything that fits in a single word, but there's also no chance of using more than 8 bytes per word for very large numbers.

However, it might also be good to just use VarInt for every Word64, because it's also unlikely that every one is going to have its high bits set even in a large number. So maybe the expected size is actually better for that. I'm just speculating, though. I haven't done/read any analysis.


My other comment is that you only changed the version 5 value format. At this point, it's probably what everyone is using. But it's probably good for us to support this in the earlier formats as well. The earlier format might also be used if you try to hash (cryptographic, not murmur) the value.

That just involves the same changes in other Serialization files. And maybe factoring out a couple of your functions so that they're not duplicated.

Also, I just want to make sure: this Natural and Integer is a direct builtin, right? It isn't a behind-the-scenes replacement for the 'list of Nat' unison data type, and therefore should hash in the same way, correct?

@runarorama
Copy link
Contributor Author

That's correct, Natural and Integer here are foreign conventions for the Haskell types GHC.Natural and GHC.Integer directly.

@runarorama
Copy link
Contributor Author

OK @dolio I have changed to a varint encoding. LMK if this is better.

@dolio
Copy link
Contributor

dolio commented Feb 9, 2026

The implementation looks good.

I think it'd be a good idea to make the tests use serialize.versioned with both versions 4 and 5 for more coverage. I think it's not actually testing 5 right now (since the legacy builtin is 4).

@aryairani
Copy link
Contributor

Do we want to wait for the tests? My gut says yes, but someone let me know.

@runarorama runarorama enabled auto-merge February 11, 2026 01:51
@runarorama runarorama merged commit 3721508 into trunk Feb 11, 2026
13 checks passed
@runarorama runarorama deleted the runarorama/bigintserialize branch February 11, 2026 01:51
@pchiusano
Copy link
Member

Cool. TIL about

CleanShot 2026-02-11 at 09 49 53@2x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants