Skip to content

fix: Handle char expansion in islower, isupper#7583

Merged
youknowone merged 2 commits intoRustPython:mainfrom
joshuamegnauth54:more-case-fixes
Apr 14, 2026
Merged

fix: Handle char expansion in islower, isupper#7583
youknowone merged 2 commits intoRustPython:mainfrom
joshuamegnauth54:more-case-fixes

Conversation

@joshuamegnauth54
Copy link
Copy Markdown
Contributor

@joshuamegnauth54 joshuamegnauth54 commented Apr 12, 2026

Closes: #7526

py_islower and py_isupper need to handle expansions for letter casing. Comparing chars directly can miss edge cases in certain languages. Unfortunately, like the last PR, this allocates to handle potential expansions.

I also had to add icu_casemap as a dependency.
RustPython is already using parts of icu4x so this doesn't add many transitive dependencies.

I figured out how to do it without allocation after working on CodeRabbit's suggestions. I also added more tests based on CodeRabbit which all pass. I cross checked the tests with CPython too. 🥳


The ICU crates use the Writeable trait internally which allows writing to a buffer. If you want to avoid the heap allocations, I can try wrapping a stack allocated buffer that falls back to heap allocations in my next patch. icu_casemap also handles title casing. It may be useful to drop unicode-casing in favor of icu_casemap. icu4x is the official Unicode consortium implementation, so it's robust and will be maintained indefinitely.

^ I wrote this comment earlier when I added icu_casemap. I still think it's relevant so I'll leave it. Currently, there are still title case bugs that may be easier to fix using icu_casemap over unicode-casing. I'll tackle that in another patch.

Summary by CodeRabbit

  • Bug Fixes

    • Improved Unicode case detection and classification so string case checks (isupper/islower/istitle) behave correctly for a wider range of international characters and mixed emoji/text sequences.
  • Tests

    • Expanded string tests to cover Unicode edge cases: modifier letters, title/swapcase behavior, emoji with text, and digit-only strings to ensure correct case-reporting.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 8eddd968-4df4-486e-a236-e44a5a95a4db

📥 Commits

Reviewing files that changed from the base of the PR and between e52405f and 712bd75.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • crates/vm/src/anystr.rs
  • crates/vm/src/builtins/str.rs
  • crates/vm/src/bytes_inner.rs
  • extra_tests/snippets/builtin_str.py
💤 Files with no reviewable changes (2)
  • crates/vm/src/bytes_inner.rs
  • crates/vm/src/builtins/str.rs
✅ Files skipped from review due to trivial changes (1)
  • extra_tests/snippets/builtin_str.py

📝 Walkthrough

Walkthrough

py_islower and py_isupper were rewritten to decode UTF‑8 and consult ICU CodePointSetData (Alphabetic, ChangesWhenLowercased, ChangesWhenUppercased) per codepoint; AnyChar’s is_lowercase/is_uppercase methods were removed and tests for Unicode casing/swapcase were extended.

Changes

Cohort / File(s) Summary
Unicode case logic
crates/vm/src/anystr.rs
Rewrote py_islower / py_isupper to iterate decoded UTF‑8 chars (from `as_bytes().utf8_chunks().map(
AnyChar trait impls
crates/vm/src/builtins/str.rs, crates/vm/src/bytes_inner.rs
Removed is_lowercase and is_uppercase implementations from AnyChar impls for char, CodePoint, ascii::AsciiChar, and u8. Kept bytes_len and other existing helpers.
Tests: Unicode casing
extra_tests/snippets/builtin_str.py
Added and extended assertions for islower()/isupper(), istitle(), and swapcase() covering specific Unicode code points, emojis, and digit-only strings to exercise new ICU-based case behavior.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • PR #7555: Migrates Unicode property handling to icu_properties / CodePointSetData, closely related to this PR's switch to ICU sets for casing logic.

Suggested reviewers

  • youknowone
  • ShaharNaveh

Poem

🐰 Hopping through bytes and Unicode light,
ICU maps guide each lowercase fight,
I peek at each codepoint, swift and spry,
Rejecting odd changes, letting alphabets fly,
A tiny swapcase dance beneath the sky.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: updating islower and isupper to handle character expansion in Unicode case mappings.
Linked Issues check ✅ Passed The PR addresses the core requirement from #7526 by implementing proper Unicode case mapping handling for islower and isupper using ICU CodePointSetData to handle character expansion.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing islower/isupper Unicode handling: updating implementations, removing unused AnyChar trait methods, and adding test coverage for the fixes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@joshuamegnauth54
Copy link
Copy Markdown
Contributor Author

Part of CI is failing because of two newly unused functions in the AnyChar trait. Should I just remove those functions?

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/vm/src/anystr.rs (1)

412-448: ⚠️ Potential issue | 🔴 Critical

Preserve the “at least one cased character” requirement.

lower / upper is set to true for every successfully compared chunk, so uncased strings now regress to true here. For example, "123".islower() and "🙂".isupper() would both pass this implementation even though CPython returns False. Please derive the flag from actual casedness (for example, by checking that the opposite-case mapping changes the chunk) instead of from chunk presence. A small negative regression test for an all-uncased string would also help.

🐛 Suggested fix
 fn py_islower(&self) -> bool {
-    let mut lower = false;
+    let mut cased = false;
     let mut lowercased = String::with_capacity(self.bytes_len());
+    let mut uppercased = String::with_capacity(self.bytes_len());
     let cm = CaseMapper::new();
     for chunk in self.as_bytes().utf8_chunks().map(|c| c.valid()) {
-        let writer = cm.lowercase(chunk, &LanguageIdentifier::UNKNOWN);
         lowercased.clear();
-        writer
+        cm.lowercase(chunk, &LanguageIdentifier::UNKNOWN)
             .write_to(&mut lowercased)
             .expect("Writing to a buffer is infallible");
         if chunk != lowercased {
             return false;
         }
-        lower = true;
+        uppercased.clear();
+        cm.uppercase(chunk, &LanguageIdentifier::UNKNOWN)
+            .write_to(&mut uppercased)
+            .expect("Writing to a buffer is infallible");
+        cased |= chunk != uppercased;
     }
-    lower
+    cased
 }
 
 fn py_isupper(&self) -> bool {
-    let mut upper = false;
+    let mut cased = false;
     let mut uppercased = String::with_capacity(self.bytes_len());
+    let mut lowercased = String::with_capacity(self.bytes_len());
     let cm = CaseMapper::new();
     for chunk in self.as_bytes().utf8_chunks().map(|c| c.valid()) {
-        let writer = cm.uppercase(chunk, &LanguageIdentifier::UNKNOWN);
         uppercased.clear();
-        writer
+        cm.uppercase(chunk, &LanguageIdentifier::UNKNOWN)
             .write_to(&mut uppercased)
             .expect("Writing to a buffer is infallible");
         if chunk != uppercased {
             return false;
         }
-        upper = true;
+        lowercased.clear();
+        cm.lowercase(chunk, &LanguageIdentifier::UNKNOWN)
+            .write_to(&mut lowercased)
+            .expect("Writing to a buffer is infallible");
+        cased |= chunk != lowercased;
     }
-    upper
+    cased
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/anystr.rs` around lines 412 - 448, The current py_islower and
py_isupper set lower/upper true for every chunk processed, causing uncased-only
strings to incorrectly return true; change the flag logic so you only set lower
to true when the chunk is actually cased (e.g., when the uppercase mapping of
the chunk differs from the chunk) and only set upper to true when the lowercase
mapping differs, while preserving the existing equality checks that reject
mixed-case chunks; update py_islower/py_isupper (and use CaseMapper, lowercased,
uppercased, cm) accordingly and add a small negative regression test asserting
that "123" and "🙂" return false for islower/isupper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@crates/vm/src/anystr.rs`:
- Around line 412-448: The current py_islower and py_isupper set lower/upper
true for every chunk processed, causing uncased-only strings to incorrectly
return true; change the flag logic so you only set lower to true when the chunk
is actually cased (e.g., when the uppercase mapping of the chunk differs from
the chunk) and only set upper to true when the lowercase mapping differs, while
preserving the existing equality checks that reject mixed-case chunks; update
py_islower/py_isupper (and use CaseMapper, lowercased, uppercased, cm)
accordingly and add a small negative regression test asserting that "123" and
"🙂" return false for islower/isupper.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 043c5b7b-8009-434d-b53b-284d97f84ab6

📥 Commits

Reviewing files that changed from the base of the PR and between d201c48 and 113ba92.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • Cargo.toml
  • crates/vm/Cargo.toml
  • crates/vm/src/anystr.rs
  • extra_tests/snippets/builtin_str.py

Closes: RustPython#7526

`py_islower` and `py_isupper` need to handle expansions for letter
casing. Comparing chars directly can miss edge cases in certain
languages. Unfortunately, like the last PR, this allocates to handle
potential expansions.

I also had to add `icu_casemap` as a dependency.
RustPython is already using parts of icu4x so this doesn't add many
transitive dependencies.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
crates/vm/src/anystr.rs (1)

413-446: Consider consolidating the duplicate iteration over characters.

The logic is correct and aligns with Python's case-checking semantics. However, each loop iterates over chunk.chars() twice—once for case_change.contains() and once for alphabetic.contains(). This could be consolidated into a single pass for better efficiency:

♻️ Optional single-pass optimization
 fn py_islower(&self) -> bool {
     let case_change = CodePointSetData::new::<ChangesWhenLowercased>();
     let alphabetic = CodePointSetData::new::<Alphabetic>();
     let mut lower = false;
     for chunk in self.as_bytes().utf8_chunks().map(|c| c.valid()) {
-        if chunk.chars().any(|c| case_change.contains(c)) {
-            return false;
-        }
-
-        if !lower && chunk.chars().any(|c| alphabetic.contains(c)) {
-            lower = true;
+        for c in chunk.chars() {
+            if case_change.contains(c) {
+                return false;
+            }
+            if !lower && alphabetic.contains(c) {
+                lower = true;
+            }
         }
     }
     lower
 }

Apply similarly to py_isupper.

Additionally, per coding guidelines, both py_islower and py_isupper share identical structure differing only in the CodePointSetData type parameter. You could extract a helper to reduce duplication, though this is optional given the current simplicity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/anystr.rs` around lines 413 - 446, The two methods py_islower
and py_isupper currently iterate chunk.chars() twice per chunk; change each to a
single pass over chunk.chars() where you check both
CodePointSetData::new::<ChangesWhenLowercased>() (or ChangesWhenUppercased) and
CodePointSetData::new::<Alphabetic>() in the same loop so you can early-return
on a case_change match and set the lower/upper flag when you see an alphabetic
char, avoiding the duplicate any() calls; optionally extract the common loop
pattern into a small helper that accepts the specific ChangesWhen... type and
returns the boolean to remove duplication between py_islower and py_isupper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@extra_tests/snippets/builtin_str.py`:
- Around line 227-230: Remove the unnecessary trailing semicolons at the end of
the assert statements that use the cat emoji string checks (e.g., the four
asserts calling isupper()/islower() on "\U0001f431" and combined strings);
simply delete the ";" characters so each assert is a proper Python statement
(e.g., change "assert not \"\U0001f431\".isupper();" to "assert not
\"\U0001f431\".isupper()").

---

Nitpick comments:
In `@crates/vm/src/anystr.rs`:
- Around line 413-446: The two methods py_islower and py_isupper currently
iterate chunk.chars() twice per chunk; change each to a single pass over
chunk.chars() where you check both
CodePointSetData::new::<ChangesWhenLowercased>() (or ChangesWhenUppercased) and
CodePointSetData::new::<Alphabetic>() in the same loop so you can early-return
on a case_change match and set the lower/upper flag when you see an alphabetic
char, avoiding the duplicate any() calls; optionally extract the common loop
pattern into a small helper that accepts the specific ChangesWhen... type and
returns the boolean to remove duplication between py_islower and py_isupper.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 5a62116c-bef4-4f8e-be79-b71517b9acbb

📥 Commits

Reviewing files that changed from the base of the PR and between 113ba92 and 1ae5efb.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • crates/vm/src/anystr.rs
  • extra_tests/snippets/builtin_str.py

This fixes a regression mentioned by CodeRabbit. I also figured out how
to check a string's case without allocation using Unicode properties.
Thus, this commit removes `icu_casemap` again. `icu_casemap` and my old
solution is required for a robust case check, but it seems like the
current code is fine for Python.
Copy link
Copy Markdown
Contributor

@ShaharNaveh ShaharNaveh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm:)
my only concert is about the removal of those two methods from a public trait.

@ShaharNaveh ShaharNaveh requested a review from youknowone April 13, 2026 21:55
assert "\U0001f431 CAT".isupper()
assert "\U0001f431 cat".islower()
assert "\u0295".islower()
assert "\u1c89".isupper()
Copy link
Copy Markdown
Contributor

@ShaharNaveh ShaharNaveh Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CPython doesn't have a test case that cover those edge cases, I think it will be worth trying to get it upstream

@joshuamegnauth54
Copy link
Copy Markdown
Contributor Author

Should I re-add the methods and mark them with #[allow(unused)]? CI failed so I wasn't sure what to do. 😆

Copy link
Copy Markdown
Member

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much! They were removed because upper/lower case is not a trait of char, but trait of string, right?

@youknowone youknowone merged commit 4bbabe5 into RustPython:main Apr 14, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Disagreement with CPython for str.islower/isupper/swapcase

3 participants