fix(swc_common): make eat_byte unsafe to prevent UTF-8 boundary violation#11731
fix(swc_common): make eat_byte unsafe to prevent UTF-8 boundary violation#11731
eat_byte unsafe to prevent UTF-8 boundary violation#11731Conversation
…ation `eat_byte` advances the input by exactly one byte, which can split a multi-byte UTF-8 sequence if the byte is not ASCII. Mark the method as `unsafe` and document that the caller must ensure `c <= 0x7F`. Closes #11719 Co-authored-by: Donny/강동윤 <[email protected]> Co-Authored-By: Claude Opus 4.6 <[email protected]>
🦋 Changeset detectedLatest commit: 56b6dc9 The changes in this PR will be included in the next version bump. Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
|
Binary Sizes
Commit: 411d6ad |
eat_byte unsafe to prevent UTF-8 boundary violationeat_byte unsafe to prevent UTF-8 boundary violation
PR Review: make
|
There was a problem hiding this comment.
Pull request overview
This PR addresses a soundness issue in swc_common::input::Input by making eat_byte unsafe, since consuming a non-ASCII byte can advance into the middle of a UTF-8 sequence and break the &str invariant. It then updates lexer call sites to use unsafe { ... } when consuming known-ASCII bytes, and adds a changeset marking a breaking change.
Changes:
- Make
Input::eat_byte(andStringInput’s implementation) anunsafe fnwith updated safety docs. - Update SWC lexers to wrap
eat_bytecalls inunsafeblocks (with some added “Safety: ASCII” comments). - Add a changeset marking
swc_commonas a major bump due to the API break.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/swc_common/src/input.rs | Makes eat_byte unsafe and documents the UTF-8 boundary safety requirement. |
| crates/swc_es_parser/src/lexer.rs | Wraps eat_byte calls in unsafe blocks for ASCII token scanning. |
| crates/swc_ecma_parser/src/lexer/state.rs | Wraps JSX terminal token eat_byte calls in unsafe blocks with an ASCII safety note. |
| crates/swc_ecma_parser/src/lexer/mod.rs | Wraps eat_byte calls in unsafe blocks and updates a helper eat() wrapper. |
| crates/swc_ecma_lexer/src/lexer/mod.rs | Wraps eat_byte calls in unsafe blocks with ASCII safety notes. |
| crates/swc_ecma_lexer/src/common/lexer/mod.rs | Wraps eat_byte calls in unsafe blocks and updates a trait helper eat() wrapper. |
| .changeset/selfish-boats-argue.md | Declares a major changeset for swc_common reflecting the breaking API change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #[inline(always)] | ||
| fn eat(&mut self, c: u8) -> bool { | ||
| self.input_mut().eat_byte(c) | ||
| // Safety: All callers pass ASCII bytes. |
There was a problem hiding this comment.
eat is a safe method but now calls Input::eat_byte (unsafe) without enforcing its safety precondition (c must be ASCII). As written, passing a non-ASCII byte to eat could still violate the underlying &str UTF-8 invariant and cause UB. Consider guarding with if !c.is_ascii() { return false; } (or making eat itself unsafe) so the safe wrapper upholds the contract in all builds.
| // Safety: All callers pass ASCII bytes. | |
| if !c.is_ascii() { | |
| return false; | |
| } | |
| // Safety: `c` is checked to be ASCII above, so this upholds the contract of `eat_byte`. |
| #[inline(always)] | ||
| fn eat(&mut self, c: u8) -> bool { | ||
| self.input_mut().eat_byte(c) | ||
| // Safety: All callers pass ASCII bytes. |
There was a problem hiding this comment.
eat is safe but wraps an unsafe { input_mut().eat_byte(c) } call without ensuring c is ASCII. This creates a hidden precondition on a safe method (and would be UB if misused). Add an ASCII guard (e.g., early-return false when !c.is_ascii()) or otherwise constrain the API so the unsafe contract is always upheld.
| // Safety: All callers pass ASCII bytes. | |
| if !c.is_ascii() { | |
| return false; | |
| } | |
| // Safety: We guard above that `c` is ASCII, satisfying `eat_byte`'s contract. |
Make
Input::eat_byteanunsafe fnto fix a soundness bug where calling it with a non-ASCII byte could split a multi-byte UTF-8 sequence, violating the&strinvariant.Closes #11719
Generated with Claude Code