Skip to content

fix(swc_common): make eat_byte unsafe to prevent UTF-8 boundary violation#11731

Merged
kdy1 merged 3 commits intomainfrom
claude/issue-11719-20260320-1545
Mar 31, 2026
Merged

fix(swc_common): make eat_byte unsafe to prevent UTF-8 boundary violation#11731
kdy1 merged 3 commits intomainfrom
claude/issue-11719-20260320-1545

Conversation

@kdy1
Copy link
Copy Markdown
Member

@kdy1 kdy1 commented Mar 24, 2026

Make Input::eat_byte an unsafe fn to fix a soundness bug where calling it with a non-ASCII byte could split a multi-byte UTF-8 sequence, violating the &str invariant.

Closes #11719

Generated with Claude Code

…ation

`eat_byte` advances the input by exactly one byte, which can split a
multi-byte UTF-8 sequence if the byte is not ASCII. Mark the method as
`unsafe` and document that the caller must ensure `c <= 0x7F`.

Closes #11719

Co-authored-by: Donny/강동윤 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Mar 24, 2026

🦋 Changeset detected

Latest commit: 56b6dc9

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 24, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ kdy1
❌ github-actions[bot]
You have signed the CLA already but the status is still pending? Let us recheck it.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Mar 24, 2026

Merging this PR will not alter performance

✅ 219 untouched benchmarks


Comparing claude/issue-11719-20260320-1545 (56b6dc9) with main (f21d336)

Open in CodSpeed

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 24, 2026

Binary Sizes

File Size
swc.linux-x64-gnu.node 28M (29081800 bytes)

Commit: 411d6ad

@kdy1 kdy1 marked this pull request as ready for review March 31, 2026 07:41
@kdy1 kdy1 requested review from a team as code owners March 31, 2026 07:41
Copilot AI review requested due to automatic review settings March 31, 2026 07:41
@kdy1 kdy1 changed the title fix(es/common): make eat_byte unsafe to prevent UTF-8 boundary violation fix(swc_common): make eat_byte unsafe to prevent UTF-8 boundary violation Mar 31, 2026
@kdy1 kdy1 merged commit 669a659 into main Mar 31, 2026
22 of 23 checks passed
@kdy1 kdy1 deleted the claude/issue-11719-20260320-1545 branch March 31, 2026 07:41
@github-actions github-actions Bot added this to the Planned milestone Mar 31, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 31, 2026

PR Review: make eat_byte unsafe to prevent UTF-8 boundary violation

Summary

This PR fixes a soundness bug (#11719) where Input::eat_byte could be called with a non-ASCII byte, splitting a multi-byte UTF-8 sequence and violating &str invariants (triggering UB under Miri). The fix makes eat_byte an unsafe fn and wraps all call sites in unsafe blocks with safety comments.

Feedback

Correctness: Looks good. The core fix is sound — all existing call sites pass ASCII byte literals (b'=', b'>', b'*', etc.), so the safety invariant holds at every call site. The safety documentation on the trait method clearly states the precondition (c <= 0x7F).

Consider adding a debug_assert!: While making the function unsafe documents the contract, adding debug_assert!(c <= 0x7F) inside eat_byte (both the trait default and the StringInput override) would catch misuse during development at zero cost in release builds. This is a common pattern for unsafe functions with numeric preconditions.

unsafe fn eat_byte(&mut self, c: u8) -> bool {
    debug_assert!(c <= 0x7F, "eat_byte called with non-ASCII byte: {:#x}", c);
    // ...
}

Alternative design consideration: Another approach would be to keep eat_byte safe but accept a newtype like AsciiByte that can only be constructed from ASCII values, making misuse a compile-time error rather than a safety contract. This would avoid spreading unsafe across ~30 call sites. However, the maintainer explicitly requested the unsafe approach in the issue, so this is just a note for future consideration.

Minor nit — comment alignment in crates/swc_ecma_parser/src/lexer/mod.rs:2122:

        self.bump(1); // first `?`
                      // Safety: b'?' and b'=' are ASCII.

The safety comment is oddly aligned as a continuation of the // first ? comment on the previous line. Should be on its own line above the if:

        self.bump(1); // first `?`
        // Safety: b'?' and b'=' are ASCII.
        if unsafe { self.input_mut().eat_byte(b'?') } {

Changeset: Marked as major for swc_common, which is correct since this is a breaking change to a public trait method signature.

Performance: CodSpeed shows a slight improvement (+2.07% on one benchmark), no regressions. The unsafe blocks don't change the generated code — the function was already doing the same unchecked operations internally.

Test coverage: The original PoC from the issue demonstrates the bug. The existing test suite passes. Since the fix is purely a signature change (moving the safety burden to callers), the existing lexer tests adequately cover the affected code paths. No new tests needed.

Verdict

LGTM with the minor suggestions above (debug_assert, comment alignment). The fix is correct, minimal, and well-documented.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a soundness issue in swc_common::input::Input by making eat_byte unsafe, since consuming a non-ASCII byte can advance into the middle of a UTF-8 sequence and break the &str invariant. It then updates lexer call sites to use unsafe { ... } when consuming known-ASCII bytes, and adds a changeset marking a breaking change.

Changes:

  • Make Input::eat_byte (and StringInput’s implementation) an unsafe fn with updated safety docs.
  • Update SWC lexers to wrap eat_byte calls in unsafe blocks (with some added “Safety: ASCII” comments).
  • Add a changeset marking swc_common as a major bump due to the API break.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
crates/swc_common/src/input.rs Makes eat_byte unsafe and documents the UTF-8 boundary safety requirement.
crates/swc_es_parser/src/lexer.rs Wraps eat_byte calls in unsafe blocks for ASCII token scanning.
crates/swc_ecma_parser/src/lexer/state.rs Wraps JSX terminal token eat_byte calls in unsafe blocks with an ASCII safety note.
crates/swc_ecma_parser/src/lexer/mod.rs Wraps eat_byte calls in unsafe blocks and updates a helper eat() wrapper.
crates/swc_ecma_lexer/src/lexer/mod.rs Wraps eat_byte calls in unsafe blocks with ASCII safety notes.
crates/swc_ecma_lexer/src/common/lexer/mod.rs Wraps eat_byte calls in unsafe blocks and updates a trait helper eat() wrapper.
.changeset/selfish-boats-argue.md Declares a major changeset for swc_common reflecting the breaking API change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

#[inline(always)]
fn eat(&mut self, c: u8) -> bool {
self.input_mut().eat_byte(c)
// Safety: All callers pass ASCII bytes.
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eat is a safe method but now calls Input::eat_byte (unsafe) without enforcing its safety precondition (c must be ASCII). As written, passing a non-ASCII byte to eat could still violate the underlying &str UTF-8 invariant and cause UB. Consider guarding with if !c.is_ascii() { return false; } (or making eat itself unsafe) so the safe wrapper upholds the contract in all builds.

Suggested change
// Safety: All callers pass ASCII bytes.
if !c.is_ascii() {
return false;
}
// Safety: `c` is checked to be ASCII above, so this upholds the contract of `eat_byte`.

Copilot uses AI. Check for mistakes.
#[inline(always)]
fn eat(&mut self, c: u8) -> bool {
self.input_mut().eat_byte(c)
// Safety: All callers pass ASCII bytes.
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eat is safe but wraps an unsafe { input_mut().eat_byte(c) } call without ensuring c is ASCII. This creates a hidden precondition on a safe method (and would be UB if misused). Add an ASCII guard (e.g., early-return false when !c.is_ascii()) or otherwise constrain the API so the unsafe contract is always upheld.

Suggested change
// Safety: All callers pass ASCII bytes.
if !c.is_ascii() {
return false;
}
// Safety: We guard above that `c` is ASCII, satisfying `eat_byte`'s contract.

Copilot uses AI. Check for mistakes.
@github-actions github-actions Bot modified the milestones: Planned, 1.15.24 Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Soundness Bug in this crate

3 participants