fix: Handle char expansion in islower, isupper#7583
fix: Handle char expansion in islower, isupper#7583youknowone merged 2 commits intoRustPython:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (4)
💤 Files with no reviewable changes (2)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughpy_islower and py_isupper were rewritten to decode UTF‑8 and consult ICU CodePointSetData ( Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Part of CI is failing because of two newly unused functions in the |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/vm/src/anystr.rs (1)
412-448:⚠️ Potential issue | 🔴 CriticalPreserve the “at least one cased character” requirement.
lower/upperis set totruefor every successfully compared chunk, so uncased strings now regress totruehere. For example,"123".islower()and"🙂".isupper()would both pass this implementation even though CPython returnsFalse. Please derive the flag from actual casedness (for example, by checking that the opposite-case mapping changes the chunk) instead of from chunk presence. A small negative regression test for an all-uncased string would also help.🐛 Suggested fix
fn py_islower(&self) -> bool { - let mut lower = false; + let mut cased = false; let mut lowercased = String::with_capacity(self.bytes_len()); + let mut uppercased = String::with_capacity(self.bytes_len()); let cm = CaseMapper::new(); for chunk in self.as_bytes().utf8_chunks().map(|c| c.valid()) { - let writer = cm.lowercase(chunk, &LanguageIdentifier::UNKNOWN); lowercased.clear(); - writer + cm.lowercase(chunk, &LanguageIdentifier::UNKNOWN) .write_to(&mut lowercased) .expect("Writing to a buffer is infallible"); if chunk != lowercased { return false; } - lower = true; + uppercased.clear(); + cm.uppercase(chunk, &LanguageIdentifier::UNKNOWN) + .write_to(&mut uppercased) + .expect("Writing to a buffer is infallible"); + cased |= chunk != uppercased; } - lower + cased } fn py_isupper(&self) -> bool { - let mut upper = false; + let mut cased = false; let mut uppercased = String::with_capacity(self.bytes_len()); + let mut lowercased = String::with_capacity(self.bytes_len()); let cm = CaseMapper::new(); for chunk in self.as_bytes().utf8_chunks().map(|c| c.valid()) { - let writer = cm.uppercase(chunk, &LanguageIdentifier::UNKNOWN); uppercased.clear(); - writer + cm.uppercase(chunk, &LanguageIdentifier::UNKNOWN) .write_to(&mut uppercased) .expect("Writing to a buffer is infallible"); if chunk != uppercased { return false; } - upper = true; + lowercased.clear(); + cm.lowercase(chunk, &LanguageIdentifier::UNKNOWN) + .write_to(&mut lowercased) + .expect("Writing to a buffer is infallible"); + cased |= chunk != lowercased; } - upper + cased }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/anystr.rs` around lines 412 - 448, The current py_islower and py_isupper set lower/upper true for every chunk processed, causing uncased-only strings to incorrectly return true; change the flag logic so you only set lower to true when the chunk is actually cased (e.g., when the uppercase mapping of the chunk differs from the chunk) and only set upper to true when the lowercase mapping differs, while preserving the existing equality checks that reject mixed-case chunks; update py_islower/py_isupper (and use CaseMapper, lowercased, uppercased, cm) accordingly and add a small negative regression test asserting that "123" and "🙂" return false for islower/isupper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@crates/vm/src/anystr.rs`:
- Around line 412-448: The current py_islower and py_isupper set lower/upper
true for every chunk processed, causing uncased-only strings to incorrectly
return true; change the flag logic so you only set lower to true when the chunk
is actually cased (e.g., when the uppercase mapping of the chunk differs from
the chunk) and only set upper to true when the lowercase mapping differs, while
preserving the existing equality checks that reject mixed-case chunks; update
py_islower/py_isupper (and use CaseMapper, lowercased, uppercased, cm)
accordingly and add a small negative regression test asserting that "123" and
"🙂" return false for islower/isupper.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 043c5b7b-8009-434d-b53b-284d97f84ab6
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
Cargo.tomlcrates/vm/Cargo.tomlcrates/vm/src/anystr.rsextra_tests/snippets/builtin_str.py
Closes: RustPython#7526 `py_islower` and `py_isupper` need to handle expansions for letter casing. Comparing chars directly can miss edge cases in certain languages. Unfortunately, like the last PR, this allocates to handle potential expansions. I also had to add `icu_casemap` as a dependency. RustPython is already using parts of icu4x so this doesn't add many transitive dependencies.
113ba92 to
1ae5efb
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
crates/vm/src/anystr.rs (1)
413-446: Consider consolidating the duplicate iteration over characters.The logic is correct and aligns with Python's case-checking semantics. However, each loop iterates over
chunk.chars()twice—once forcase_change.contains()and once foralphabetic.contains(). This could be consolidated into a single pass for better efficiency:♻️ Optional single-pass optimization
fn py_islower(&self) -> bool { let case_change = CodePointSetData::new::<ChangesWhenLowercased>(); let alphabetic = CodePointSetData::new::<Alphabetic>(); let mut lower = false; for chunk in self.as_bytes().utf8_chunks().map(|c| c.valid()) { - if chunk.chars().any(|c| case_change.contains(c)) { - return false; - } - - if !lower && chunk.chars().any(|c| alphabetic.contains(c)) { - lower = true; + for c in chunk.chars() { + if case_change.contains(c) { + return false; + } + if !lower && alphabetic.contains(c) { + lower = true; + } } } lower }Apply similarly to
py_isupper.Additionally, per coding guidelines, both
py_islowerandpy_isuppershare identical structure differing only in theCodePointSetDatatype parameter. You could extract a helper to reduce duplication, though this is optional given the current simplicity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/vm/src/anystr.rs` around lines 413 - 446, The two methods py_islower and py_isupper currently iterate chunk.chars() twice per chunk; change each to a single pass over chunk.chars() where you check both CodePointSetData::new::<ChangesWhenLowercased>() (or ChangesWhenUppercased) and CodePointSetData::new::<Alphabetic>() in the same loop so you can early-return on a case_change match and set the lower/upper flag when you see an alphabetic char, avoiding the duplicate any() calls; optionally extract the common loop pattern into a small helper that accepts the specific ChangesWhen... type and returns the boolean to remove duplication between py_islower and py_isupper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@extra_tests/snippets/builtin_str.py`:
- Around line 227-230: Remove the unnecessary trailing semicolons at the end of
the assert statements that use the cat emoji string checks (e.g., the four
asserts calling isupper()/islower() on "\U0001f431" and combined strings);
simply delete the ";" characters so each assert is a proper Python statement
(e.g., change "assert not \"\U0001f431\".isupper();" to "assert not
\"\U0001f431\".isupper()").
---
Nitpick comments:
In `@crates/vm/src/anystr.rs`:
- Around line 413-446: The two methods py_islower and py_isupper currently
iterate chunk.chars() twice per chunk; change each to a single pass over
chunk.chars() where you check both
CodePointSetData::new::<ChangesWhenLowercased>() (or ChangesWhenUppercased) and
CodePointSetData::new::<Alphabetic>() in the same loop so you can early-return
on a case_change match and set the lower/upper flag when you see an alphabetic
char, avoiding the duplicate any() calls; optionally extract the common loop
pattern into a small helper that accepts the specific ChangesWhen... type and
returns the boolean to remove duplication between py_islower and py_isupper.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 5a62116c-bef4-4f8e-be79-b71517b9acbb
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (2)
crates/vm/src/anystr.rsextra_tests/snippets/builtin_str.py
1ae5efb to
e52405f
Compare
This fixes a regression mentioned by CodeRabbit. I also figured out how to check a string's case without allocation using Unicode properties. Thus, this commit removes `icu_casemap` again. `icu_casemap` and my old solution is required for a robust case check, but it seems like the current code is fine for Python.
e52405f to
712bd75
Compare
ShaharNaveh
left a comment
There was a problem hiding this comment.
lgtm:)
my only concert is about the removal of those two methods from a public trait.
| assert "\U0001f431 CAT".isupper() | ||
| assert "\U0001f431 cat".islower() | ||
| assert "\u0295".islower() | ||
| assert "\u1c89".isupper() |
There was a problem hiding this comment.
If CPython doesn't have a test case that cover those edge cases, I think it will be worth trying to get it upstream
|
Should I re-add the methods and mark them with |
youknowone
left a comment
There was a problem hiding this comment.
Thank you so much! They were removed because upper/lower case is not a trait of char, but trait of string, right?
Closes: #7526
py_islowerandpy_isupperneed to handle expansions for letter casing. Comparing chars directly can miss edge cases in certain languages.Unfortunately, like the last PR, this allocates to handle potential expansions.I also had to addicu_casemapas a dependency.RustPython is already using parts of icu4x so this doesn't add many transitive dependencies.I figured out how to do it without allocation after working on CodeRabbit's suggestions. I also added more tests based on CodeRabbit which all pass. I cross checked the tests with CPython too. 🥳
The ICU crates use the Writeable trait internally which allows writing to a buffer. If you want to avoid the heap allocations, I can try wrapping a stack allocated buffer that falls back to heap allocations in my next patch.
icu_casemapalso handles title casing. It may be useful to dropunicode-casingin favor oficu_casemap.icu4xis the official Unicode consortium implementation, so it's robust and will be maintained indefinitely.^ I wrote this comment earlier when I added
icu_casemap. I still think it's relevant so I'll leave it. Currently, there are still title case bugs that may be easier to fix usingicu_casemapoverunicode-casing. I'll tackle that in another patch.Summary by CodeRabbit
Bug Fixes
Tests