fix: Handle char expansion in islower, isupper by joshuamegnauth54 · Pull Request #7583 · RustPython/RustPython

joshuamegnauth54 · 2026-04-12T19:29:56Z

Closes: #7526

py_islower and py_isupper need to handle expansions for letter casing. Comparing chars directly can miss edge cases in certain languages. ~~Unfortunately, like the last PR, this allocates to handle potential expansions.~~

~~I also had to add icu_casemap as a dependency.~~
~~RustPython is already using parts of icu4x so this doesn't add many transitive dependencies.~~

I figured out how to do it without allocation after working on CodeRabbit's suggestions. I also added more tests based on CodeRabbit which all pass. I cross checked the tests with CPython too. 🥳

The ICU crates use the Writeable trait internally which allows writing to a buffer. If you want to avoid the heap allocations, I can try wrapping a stack allocated buffer that falls back to heap allocations in my next patch. icu_casemap also handles title casing. It may be useful to drop unicode-casing in favor of icu_casemap. icu4x is the official Unicode consortium implementation, so it's robust and will be maintained indefinitely.

^ I wrote this comment earlier when I added icu_casemap. I still think it's relevant so I'll leave it. Currently, there are still title case bugs that may be easier to fix using icu_casemap over unicode-casing. I'll tackle that in another patch.

Summary by CodeRabbit

Bug Fixes
- Improved Unicode case detection and classification so string case checks (isupper/islower/istitle) behave correctly for a wider range of international characters and mixed emoji/text sequences.
Tests
- Expanded string tests to cover Unicode edge cases: modifier letters, title/swapcase behavior, emoji with text, and digit-only strings to ensure correct case-reporting.

coderabbitai · 2026-04-12T19:30:10Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 8eddd968-4df4-486e-a236-e44a5a95a4db

📥 Commits

Reviewing files that changed from the base of the PR and between e52405f and 712bd75.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (4)

crates/vm/src/anystr.rs
crates/vm/src/builtins/str.rs
crates/vm/src/bytes_inner.rs
extra_tests/snippets/builtin_str.py

💤 Files with no reviewable changes (2)

crates/vm/src/bytes_inner.rs
crates/vm/src/builtins/str.rs

✅ Files skipped from review due to trivial changes (1)

extra_tests/snippets/builtin_str.py

📝 Walkthrough

Walkthrough

py_islower and py_isupper were rewritten to decode UTF‑8 and consult ICU CodePointSetData (Alphabetic, ChangesWhenLowercased, ChangesWhenUppercased) per codepoint; AnyChar’s is_lowercase/is_uppercase methods were removed and tests for Unicode casing/swapcase were extended.

Changes

Cohort / File(s)	Summary
Unicode case logic `crates/vm/src/anystr.rs`	Rewrote `py_islower` / `py_isupper` to iterate decoded UTF‑8 chars (from `as_bytes().utf8_chunks().map(
AnyChar trait impls `crates/vm/src/builtins/str.rs`, `crates/vm/src/bytes_inner.rs`	Removed `is_lowercase` and `is_uppercase` implementations from `AnyChar` impls for `char`, `CodePoint`, `ascii::AsciiChar`, and `u8`. Kept `bytes_len` and other existing helpers.
Tests: Unicode casing `extra_tests/snippets/builtin_str.py`	Added and extended assertions for `islower()`/`isupper()`, `istitle()`, and `swapcase()` covering specific Unicode code points, emojis, and digit-only strings to exercise new ICU-based case behavior.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

PR #7555: Migrates Unicode property handling to icu_properties / CodePointSetData, closely related to this PR's switch to ICU sets for casing logic.

Suggested reviewers

youknowone
ShaharNaveh

Poem

🐰 Hopping through bytes and Unicode light,
ICU maps guide each lowercase fight,
I peek at each codepoint, swift and spry,
Rejecting odd changes, letting alphabets fly,
A tiny swapcase dance beneath the sky.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main change: updating islower and isupper to handle character expansion in Unicode case mappings.
Linked Issues check	✅ Passed	The PR addresses the core requirement from `#7526` by implementing proper Unicode case mapping handling for islower and isupper using ICU CodePointSetData to handle character expansion.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing islower/isupper Unicode handling: updating implementations, removing unused AnyChar trait methods, and adding test coverage for the fixes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

joshuamegnauth54 · 2026-04-12T19:36:37Z

Part of CI is failing because of two newly unused functions in the AnyChar trait. Should I just remove those functions?

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/vm/src/anystr.rs (1)

412-448: ⚠️ Potential issue | 🔴 Critical

Preserve the “at least one cased character” requirement.

lower / upper is set to true for every successfully compared chunk, so uncased strings now regress to true here. For example, "123".islower() and "🙂".isupper() would both pass this implementation even though CPython returns False. Please derive the flag from actual casedness (for example, by checking that the opposite-case mapping changes the chunk) instead of from chunk presence. A small negative regression test for an all-uncased string would also help.

🐛 Suggested fix

 fn py_islower(&self) -> bool {
-    let mut lower = false;
+    let mut cased = false;
     let mut lowercased = String::with_capacity(self.bytes_len());
+    let mut uppercased = String::with_capacity(self.bytes_len());
     let cm = CaseMapper::new();
     for chunk in self.as_bytes().utf8_chunks().map(|c| c.valid()) {
-        let writer = cm.lowercase(chunk, &LanguageIdentifier::UNKNOWN);
         lowercased.clear();
-        writer
+        cm.lowercase(chunk, &LanguageIdentifier::UNKNOWN)
             .write_to(&mut lowercased)
             .expect("Writing to a buffer is infallible");
         if chunk != lowercased {
             return false;
         }
-        lower = true;
+        uppercased.clear();
+        cm.uppercase(chunk, &LanguageIdentifier::UNKNOWN)
+            .write_to(&mut uppercased)
+            .expect("Writing to a buffer is infallible");
+        cased |= chunk != uppercased;
     }
-    lower
+    cased
 }
 
 fn py_isupper(&self) -> bool {
-    let mut upper = false;
+    let mut cased = false;
     let mut uppercased = String::with_capacity(self.bytes_len());
+    let mut lowercased = String::with_capacity(self.bytes_len());
     let cm = CaseMapper::new();
     for chunk in self.as_bytes().utf8_chunks().map(|c| c.valid()) {
-        let writer = cm.uppercase(chunk, &LanguageIdentifier::UNKNOWN);
         uppercased.clear();
-        writer
+        cm.uppercase(chunk, &LanguageIdentifier::UNKNOWN)
             .write_to(&mut uppercased)
             .expect("Writing to a buffer is infallible");
         if chunk != uppercased {
             return false;
         }
-        upper = true;
+        lowercased.clear();
+        cm.lowercase(chunk, &LanguageIdentifier::UNKNOWN)
+            .write_to(&mut lowercased)
+            .expect("Writing to a buffer is infallible");
+        cased |= chunk != lowercased;
     }
-    upper
+    cased
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/anystr.rs` around lines 412 - 448, The current py_islower and
py_isupper set lower/upper true for every chunk processed, causing uncased-only
strings to incorrectly return true; change the flag logic so you only set lower
to true when the chunk is actually cased (e.g., when the uppercase mapping of
the chunk differs from the chunk) and only set upper to true when the lowercase
mapping differs, while preserving the existing equality checks that reject
mixed-case chunks; update py_islower/py_isupper (and use CaseMapper, lowercased,
uppercased, cm) accordingly and add a small negative regression test asserting
that "123" and "🙂" return false for islower/isupper.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@crates/vm/src/anystr.rs`:
- Around line 412-448: The current py_islower and py_isupper set lower/upper
true for every chunk processed, causing uncased-only strings to incorrectly
return true; change the flag logic so you only set lower to true when the chunk
is actually cased (e.g., when the uppercase mapping of the chunk differs from
the chunk) and only set upper to true when the lowercase mapping differs, while
preserving the existing equality checks that reject mixed-case chunks; update
py_islower/py_isupper (and use CaseMapper, lowercased, uppercased, cm)
accordingly and add a small negative regression test asserting that "123" and
"🙂" return false for islower/isupper.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 043c5b7b-8009-434d-b53b-284d97f84ab6

📥 Commits

Reviewing files that changed from the base of the PR and between d201c48 and 113ba92.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (4)

Cargo.toml
crates/vm/Cargo.toml
crates/vm/src/anystr.rs
extra_tests/snippets/builtin_str.py

Closes: RustPython#7526 `py_islower` and `py_isupper` need to handle expansions for letter casing. Comparing chars directly can miss edge cases in certain languages. Unfortunately, like the last PR, this allocates to handle potential expansions. I also had to add `icu_casemap` as a dependency. RustPython is already using parts of icu4x so this doesn't add many transitive dependencies.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

crates/vm/src/anystr.rs (1)

413-446: Consider consolidating the duplicate iteration over characters.

The logic is correct and aligns with Python's case-checking semantics. However, each loop iterates over chunk.chars() twice—once for case_change.contains() and once for alphabetic.contains(). This could be consolidated into a single pass for better efficiency:

♻️ Optional single-pass optimization

 fn py_islower(&self) -> bool {
     let case_change = CodePointSetData::new::<ChangesWhenLowercased>();
     let alphabetic = CodePointSetData::new::<Alphabetic>();
     let mut lower = false;
     for chunk in self.as_bytes().utf8_chunks().map(|c| c.valid()) {
-        if chunk.chars().any(|c| case_change.contains(c)) {
-            return false;
-        }
-
-        if !lower && chunk.chars().any(|c| alphabetic.contains(c)) {
-            lower = true;
+        for c in chunk.chars() {
+            if case_change.contains(c) {
+                return false;
+            }
+            if !lower && alphabetic.contains(c) {
+                lower = true;
+            }
         }
     }
     lower
 }

Apply similarly to py_isupper.

Additionally, per coding guidelines, both py_islower and py_isupper share identical structure differing only in the CodePointSetData type parameter. You could extract a helper to reduce duplication, though this is optional given the current simplicity.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/anystr.rs` around lines 413 - 446, The two methods py_islower
and py_isupper currently iterate chunk.chars() twice per chunk; change each to a
single pass over chunk.chars() where you check both
CodePointSetData::new::<ChangesWhenLowercased>() (or ChangesWhenUppercased) and
CodePointSetData::new::<Alphabetic>() in the same loop so you can early-return
on a case_change match and set the lower/upper flag when you see an alphabetic
char, avoiding the duplicate any() calls; optionally extract the common loop
pattern into a small helper that accepts the specific ChangesWhen... type and
returns the boolean to remove duplication between py_islower and py_isupper.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@extra_tests/snippets/builtin_str.py`:
- Around line 227-230: Remove the unnecessary trailing semicolons at the end of
the assert statements that use the cat emoji string checks (e.g., the four
asserts calling isupper()/islower() on "\U0001f431" and combined strings);
simply delete the ";" characters so each assert is a proper Python statement
(e.g., change "assert not \"\U0001f431\".isupper();" to "assert not
\"\U0001f431\".isupper()").

---

Nitpick comments:
In `@crates/vm/src/anystr.rs`:
- Around line 413-446: The two methods py_islower and py_isupper currently
iterate chunk.chars() twice per chunk; change each to a single pass over
chunk.chars() where you check both
CodePointSetData::new::<ChangesWhenLowercased>() (or ChangesWhenUppercased) and
CodePointSetData::new::<Alphabetic>() in the same loop so you can early-return
on a case_change match and set the lower/upper flag when you see an alphabetic
char, avoiding the duplicate any() calls; optionally extract the common loop
pattern into a small helper that accepts the specific ChangesWhen... type and
returns the boolean to remove duplication between py_islower and py_isupper.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 5a62116c-bef4-4f8e-be79-b71517b9acbb

📥 Commits

Reviewing files that changed from the base of the PR and between 113ba92 and 1ae5efb.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (2)

crates/vm/src/anystr.rs
extra_tests/snippets/builtin_str.py

extra_tests/snippets/builtin_str.py

This fixes a regression mentioned by CodeRabbit. I also figured out how to check a string's case without allocation using Unicode properties. Thus, this commit removes `icu_casemap` again. `icu_casemap` and my old solution is required for a robust case check, but it seems like the current code is fine for Python.

ShaharNaveh

lgtm:)
my only concert is about the removal of those two methods from a public trait.

ShaharNaveh · 2026-04-13T21:56:04Z

extra_tests/snippets/builtin_str.py

+assert "\U0001f431 CAT".isupper()
+assert "\U0001f431 cat".islower()
+assert "\u0295".islower()
+assert "\u1c89".isupper()


If CPython doesn't have a test case that cover those edge cases, I think it will be worth trying to get it upstream

joshuamegnauth54 · 2026-04-13T23:52:53Z

Should I re-add the methods and mark them with #[allow(unused)]? CI failed so I wasn't sure what to do. 😆

youknowone

Thank you so much! They were removed because upper/lower case is not a trait of char, but trait of string, right?

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

joshuamegnauth54 force-pushed the more-case-fixes branch from 113ba92 to 1ae5efb Compare April 13, 2026 01:31

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

extra_tests/snippets/builtin_str.py Outdated Show resolved Hide resolved

joshuamegnauth54 force-pushed the more-case-fixes branch from 1ae5efb to e52405f Compare April 13, 2026 01:41

joshuamegnauth54 force-pushed the more-case-fixes branch from e52405f to 712bd75 Compare April 13, 2026 19:00

ShaharNaveh approved these changes Apr 13, 2026

View reviewed changes

ShaharNaveh requested a review from youknowone April 13, 2026 21:55

ShaharNaveh reviewed Apr 13, 2026

View reviewed changes

youknowone approved these changes Apr 14, 2026

View reviewed changes

youknowone merged commit 4bbabe5 into RustPython:main Apr 14, 2026
20 checks passed

Conversation

joshuamegnauth54 commented Apr 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

joshuamegnauth54 commented Apr 12, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShaharNaveh left a comment

Choose a reason for hiding this comment

Uh oh!

ShaharNaveh Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joshuamegnauth54 commented Apr 13, 2026

Uh oh!

youknowone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joshuamegnauth54 commented Apr 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 12, 2026 •

edited

Loading

ShaharNaveh Apr 13, 2026 •

edited

Loading