fix: Swapcase must handle multibyte expansions#7559
fix: Swapcase must handle multibyte expansions#7559youknowone merged 1 commit intoRustPython:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthrough
Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: [ ] test: cpython/Lib/test/test_str.py (TODO: 16) dependencies: dependent tests: (no tests depend on str) Legend:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/vm/src/builtins/str.rs (1)
1044-1059: LGTM! Correct fix for Unicode multi-character case expansions.The change from
to_ascii_uppercase()/to_ascii_lowercase()toto_uppercase()/to_lowercase()withextend()properly handles Unicode case mappings that expand to multiple characters (e.g., "ß" → "SS", or "ı" → "I"). Theshrink_to_fit()call is appropriate since the initial capacity may over-allocate for potential expansions that don't always occur.Minor terminology nit: the comment at line 1048 says "multiple bytes" but technically
to_uppercase()returns an iterator because case changes may produce multiple characters (code points), not bytes. Each character is then encoded to its byte representation.📝 Optional comment clarification
- // to_uppercase returns an iterator because case changes may be multiple bytes + // to_uppercase returns an iterator because case changes may produce multiple characters,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/builtins/str.rs` around lines 1044 - 1059, Update the inline comment inside the swapcase method to accurately describe why to_uppercase()/to_lowercase() return an iterator: change "multiple bytes" to "multiple characters (code points)" or similar; the correction should be made in the comment immediately above the if-block that calls c.to_uppercase() and c.to_lowercase() in the swapcase function (referencing swapcase, c_orig, to_uppercase, to_lowercase).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@crates/vm/src/builtins/str.rs`:
- Around line 1044-1059: Update the inline comment inside the swapcase method to
accurately describe why to_uppercase()/to_lowercase() return an iterator: change
"multiple bytes" to "multiple characters (code points)" or similar; the
correction should be made in the comment immediately above the if-block that
calls c.to_uppercase() and c.to_lowercase() in the swapcase function
(referencing swapcase, c_orig, to_uppercase, to_lowercase).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: da2da0a5-1571-4e3b-97c0-44307ec105b2
⛔ Files ignored due to path filters (1)
Lib/test/test_str.pyis excluded by!Lib/**
📒 Files selected for processing (1)
crates/vm/src/builtins/str.rs
youknowone
left a comment
There was a problem hiding this comment.
Thank you for contributing! welcome to RustPython project
crates/vm/src/builtins/str.rs
Outdated
| swapped_str.push(c_orig); | ||
| } | ||
| } | ||
| swapped_str.shrink_to_fit(); |
There was a problem hiding this comment.
Is this worth to do? I'd like to listen what was the rationale of this line.
There was a problem hiding this comment.
Rust's vector doubles in size for amortized pushes. Here is a link to Rust's source code as a citation.
So, my logic is that extend() may cause a large over allocation because the buffer in Wtf8 would double in size. If the buffer was not reallocated, shrink_to_fit() does nothing so there is no hit there.
The worst case scenario (I THINK) would be three allocations:
with_capacity(n)-> first allocation; capacity is nextend()-> possible allocation for multiple chars; new capacity is n * 2shrink_to_fit()-> realloc if the initial n changed but saves memory
The best case scenario is if the buffer size doesn't change.
I may have overthought this. 😅 But, that's my reasoning. It felt wrong to waste all of that extra memory when the original code allocated an exact sized buffer. 😆
Let me know if I should remove that line though. I'll fix that and the test later today. 😁
Lib/test/test_str.py
Outdated
| self.assertEqual('\u03a3'.swapcase(), '\u03c3') | ||
| self.assertEqual('ß'.swapcase(), 'SS') | ||
| self.assertEqual('\u1fd2'.swapcase(), '\u0399\u0308\u0300') | ||
| self.assertEqual("Σίσυφος".swapcase(), "σΊΣΥΦΟΣ") |
There was a problem hiding this comment.
Please don't modify tests by CPython, but add them to extra_tests/snippets/builtin_str.py instead
There was a problem hiding this comment.
Ahh my bad. I didn't know those were tests from CPython. I'll fix it.
There was a problem hiding this comment.
If you are interested in, probably this is worth to add CPython test case actually. We usually don't add code under Lib/test directly, but also merge commits from CPython often
There was a problem hiding this comment.
I moved it over to builtin_str.py and everything passes. 😁
I'll remove shrink_to_fit() if you two think it's a better idea too.
There was a problem hiding this comment.
I am not 100% confident, but feel like shrink_to_fit will be overhead for normal inputs, while irregular input just takes double memory consumption. The other way is more important. Suppose a case when the capacity is 100k+1, but the result is 100k. It will reallocate and shrink only to reduce a single byte, which doesn't look worth to do.
There was a problem hiding this comment.
Fair enough. I removed shrink_to_fit() and force pushed.
d749d6f to
0fc94a8
Compare
`swapcase` used `to_ascii_lowercase` and uppercase to swap cases. This is fine for ASCII, but code points may expand into multiple bytes which leads to incorrect case swaps for some languages. The fix is to use `to_lowercase` and `to_uppercase` instead. Unfortunately, this leads to a realloc in `swapcase` when bytes are expanded. Part of RustPython#7526.
0fc94a8 to
2e1a0f4
Compare
swapcaseusedto_ascii_lowercaseand uppercase to swap cases. This is fine for ASCII, but code points may expand into multiple bytes which leads to incorrect case swaps for some languages. The fix is to useto_lowercasedirectly.Unfortunately, this leads to a realloc in
swapcasewhen bytes are expanded.Part of #7526.
Most of the disabled swapcase tests seem to pass now including the new assert based on #7526. However, there are other areas where
to_ascii_lowercaseandto_ascii_uppercaseare used which seem to be causing the other cases to fail.Summary by CodeRabbit