Skip to content

fix: Swapcase must handle multibyte expansions#7559

Merged
youknowone merged 1 commit intoRustPython:mainfrom
joshuamegnauth54:fix-swapcase-expansion
Apr 6, 2026
Merged

fix: Swapcase must handle multibyte expansions#7559
youknowone merged 1 commit intoRustPython:mainfrom
joshuamegnauth54:fix-swapcase-expansion

Conversation

@joshuamegnauth54
Copy link
Copy Markdown
Contributor

@joshuamegnauth54 joshuamegnauth54 commented Apr 4, 2026

swapcase used to_ascii_lowercase and uppercase to swap cases. This is fine for ASCII, but code points may expand into multiple bytes which leads to incorrect case swaps for some languages. The fix is to use to_lowercase directly.

Unfortunately, this leads to a realloc in swapcase when bytes are expanded.

Part of #7526.


Most of the disabled swapcase tests seem to pass now including the new assert based on #7526. However, there are other areas where to_ascii_lowercase and to_ascii_uppercase are used which seem to be causing the other cases to fail.

Summary by CodeRabbit

  • Bug Fixes
    • swapcase now correctly applies full Unicode case mappings, including characters that map to multiple code points.
  • Tests
    • Added a test validating Unicode swapcase behavior for non-ASCII text.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 4, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 27f14e23-f709-469f-a209-d3cb5579b4e2

📥 Commits

Reviewing files that changed from the base of the PR and between 0fc94a8 and 2e1a0f4.

📒 Files selected for processing (2)
  • crates/vm/src/builtins/str.rs
  • extra_tests/snippets/builtin_str.py
✅ Files skipped from review due to trivial changes (1)
  • extra_tests/snippets/builtin_str.py

📝 Walkthrough

Walkthrough

PyStr::swapcase now uses full Unicode case mappings (calls to_uppercase/to_lowercase and appends their results) instead of ASCII-only single-char conversions; a test was added asserting Unicode swapcase for "Σίσυφος".

Changes

Cohort / File(s) Summary
String builtin
crates/vm/src/builtins/str.rs
Replaced ASCII-only to_ascii_uppercase()/to_ascii_lowercase() + push_char with Unicode-aware to_uppercase()/to_lowercase() and extend to handle multi-code-point expansions in swapcase.
Tests — Unicode swapcase
extra_tests/snippets/builtin_str.py
Added assertion verifying "Σίσυφος".swapcase() returns "σΊΣΥΦΟΣ" to validate Unicode-aware behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I hopped through letters, soft and spry,

Swapped their shapes beneath the sky,
Sigma flipped with merry cheer,
Many bytes now singing clear,
A rabbit's nod — the change is nigh!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the primary change: fixing swapcase to handle multibyte Unicode expansions rather than ASCII-only conversions.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[ ] test: cpython/Lib/test/test_str.py (TODO: 16)
[x] test: cpython/Lib/test/test_fstring.py (TODO: 19)
[x] test: cpython/Lib/test/test_string_literals.py (TODO: 4)

dependencies:

dependent tests: (no tests depend on str)

Legend:

  • [+] path exists in CPython
  • [x] up-to-date, [ ] outdated

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/vm/src/builtins/str.rs (1)

1044-1059: LGTM! Correct fix for Unicode multi-character case expansions.

The change from to_ascii_uppercase()/to_ascii_lowercase() to to_uppercase()/to_lowercase() with extend() properly handles Unicode case mappings that expand to multiple characters (e.g., "ß" → "SS", or "ı" → "I"). The shrink_to_fit() call is appropriate since the initial capacity may over-allocate for potential expansions that don't always occur.

Minor terminology nit: the comment at line 1048 says "multiple bytes" but technically to_uppercase() returns an iterator because case changes may produce multiple characters (code points), not bytes. Each character is then encoded to its byte representation.

📝 Optional comment clarification
-            // to_uppercase returns an iterator because case changes may be multiple bytes
+            // to_uppercase returns an iterator because case changes may produce multiple characters

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/builtins/str.rs` around lines 1044 - 1059, Update the inline
comment inside the swapcase method to accurately describe why
to_uppercase()/to_lowercase() return an iterator: change "multiple bytes" to
"multiple characters (code points)" or similar; the correction should be made in
the comment immediately above the if-block that calls c.to_uppercase() and
c.to_lowercase() in the swapcase function (referencing swapcase, c_orig,
to_uppercase, to_lowercase).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/vm/src/builtins/str.rs`:
- Around line 1044-1059: Update the inline comment inside the swapcase method to
accurately describe why to_uppercase()/to_lowercase() return an iterator: change
"multiple bytes" to "multiple characters (code points)" or similar; the
correction should be made in the comment immediately above the if-block that
calls c.to_uppercase() and c.to_lowercase() in the swapcase function
(referencing swapcase, c_orig, to_uppercase, to_lowercase).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: da2da0a5-1571-4e3b-97c0-44307ec105b2

📥 Commits

Reviewing files that changed from the base of the PR and between 87fc454 and d749d6f.

⛔ Files ignored due to path filters (1)
  • Lib/test/test_str.py is excluded by !Lib/**
📒 Files selected for processing (1)
  • crates/vm/src/builtins/str.rs

Copy link
Copy Markdown
Member

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contributing! welcome to RustPython project

swapped_str.push(c_orig);
}
}
swapped_str.shrink_to_fit();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth to do? I'd like to listen what was the rationale of this line.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust's vector doubles in size for amortized pushes. Here is a link to Rust's source code as a citation.

So, my logic is that extend() may cause a large over allocation because the buffer in Wtf8 would double in size. If the buffer was not reallocated, shrink_to_fit() does nothing so there is no hit there.

The worst case scenario (I THINK) would be three allocations:

  • with_capacity(n) -> first allocation; capacity is n
  • extend() -> possible allocation for multiple chars; new capacity is n * 2
  • shrink_to_fit() -> realloc if the initial n changed but saves memory

The best case scenario is if the buffer size doesn't change.

I may have overthought this. 😅 But, that's my reasoning. It felt wrong to waste all of that extra memory when the original code allocated an exact sized buffer. 😆

Let me know if I should remove that line though. I'll fix that and the test later today. 😁

self.assertEqual('\u03a3'.swapcase(), '\u03c3')
self.assertEqual('ß'.swapcase(), 'SS')
self.assertEqual('\u1fd2'.swapcase(), '\u0399\u0308\u0300')
self.assertEqual("Σίσυφος".swapcase(), "σΊΣΥΦΟΣ")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't modify tests by CPython, but add them to extra_tests/snippets/builtin_str.py instead

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh my bad. I didn't know those were tests from CPython. I'll fix it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are interested in, probably this is worth to add CPython test case actually. We usually don't add code under Lib/test directly, but also merge commits from CPython often

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it over to builtin_str.py and everything passes. 😁

I'll remove shrink_to_fit() if you two think it's a better idea too.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% confident, but feel like shrink_to_fit will be overhead for normal inputs, while irregular input just takes double memory consumption. The other way is more important. Suppose a case when the capacity is 100k+1, but the result is 100k. It will reallocate and shrink only to reduce a single byte, which doesn't look worth to do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I removed shrink_to_fit() and force pushed.

@joshuamegnauth54 joshuamegnauth54 force-pushed the fix-swapcase-expansion branch from d749d6f to 0fc94a8 Compare April 5, 2026 19:01
`swapcase` used `to_ascii_lowercase` and uppercase to swap cases. This
is fine for ASCII, but code points may expand into multiple bytes which
leads to incorrect case swaps for some languages. The fix is to use
`to_lowercase` and `to_uppercase` instead.

Unfortunately, this leads to a realloc in `swapcase` when bytes are
expanded.

Part of RustPython#7526.
@joshuamegnauth54 joshuamegnauth54 force-pushed the fix-swapcase-expansion branch from 0fc94a8 to 2e1a0f4 Compare April 6, 2026 02:12
Copy link
Copy Markdown
Contributor

@ShaharNaveh ShaharNaveh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
tysm:)

Copy link
Copy Markdown
Member

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@youknowone youknowone merged commit e009cc0 into RustPython:main Apr 6, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants