Fix `isalpha`/`isalnum` and regex `\w` to use Unicode General Category instead of Alphabetic derived property by Copilot · Pull Request #7520 · RustPython/RustPython

Copilot · 2026-03-27T16:36:32Z

Rust's char::is_alphabetic()/char::is_alphanumeric() use the Unicode Alphabetic derived property, which is a superset of the Unicode letter categories and includes some non-spacing marks (Mn). CPython's str.isalpha() and str.isalnum() are defined strictly by General Category membership (letters: Lu/Ll/Lt/Lm/Lo; numbers: Nd/Nl/No), causing divergence for characters like U+0345 (COMBINING GREEK YPOGEGRAMMENI, category Mn).

"\u0345".isalnum()   # RustPython: True  /  CPython: False
re.match(r"\w", "\u0345")  # RustPython: match  /  CPython: None

Changes

crates/vm/src/builtins/str.rs: Replace char::is_alphabetic() / char::is_alphanumeric() in isalpha() and isalnum() with explicit GeneralCategory checks (L* for isalpha; L* + Nd/Nl/No for isalnum).
crates/sre_engine/src/string.rs: Apply the same fix to is_uni_alnum() (used for \w in regex); remove the pre-existing // TODO: check with cpython note.
crates/sre_engine/Cargo.toml: Add unic-ucd-category workspace dependency.
extra_tests/snippets/builtin_str_unicode.py: Add regression assertions for U+0345 across isalpha, isalnum, and re.match(r"\w", ...).

Original prompt

Problem Details

The regex module in RustPython is treating \w as matching more characters than CPython does. The disagreement stems from the isalnum() function, leading to incorrect matching behavior by \w. For example:
import re

assert not re.match(r"\w", "\u0345"), r"\w should not match U+0345 (category Mn)"
Expected Behavior

The regex pattern \w should only match characters for which isalnum() returns True as per the Python documentation. The assertion in the example should pass without throwing an error when behavior matches CPython standards.

In CPython:

"\u0345".isalnum() returns False

Thus, \w in regex does not match \u0345

Current Behavior

In RustPython, "\u0345".isalnum() returns True. Consequently, the regex pattern \w incorrectly matches \u0345 in violation of CPython standards.

Solution

Fix the implementation of isalnum() to align RustPython with CPython's behavior for Unicode character classification, ensuring regex patterns behave accordingly.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Create a pr to fix it

Assistant: A pull request has been initiated to address the issue with inspect.getsource returning incomplete source code. Let me know if further assistance is required!

User: Accepted Confirmation: Are you sure?

Assistant: The pull request to fix the issue with inspect.getsource has been initiated. The Copilot coding agent is now working on this. You can track the progress of this task in the repository. Let me know if you have further instructions!

User: Create a pr to fix this

Assistant: The pull request to resolve the Unicode character discrepancy in the regex module has been initiated, and the Copilot coding agent is actively working on it. Let me know if you need further updates!

This pull request was created from Copilot chat.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

… \\w for Mn characters Agent-Logs-Url: https://github.com/RustPython/RustPython/sessions/6f573a91-8811-486c-933d-7ba9a9067643 Co-authored-by: youknowone <[email protected]>

github-actions · 2026-03-27T18:53:48Z

extra_tests/snippets/builtin_str_unicode.py

+assert not "\u0345".isalnum(), "isalnum should not match Mn category characters"
+
+import re
+assert not re.match(r"\w", "\u0345"), r"\w should not match U+0345 (category Mn)"


[reviewdog-suggester] _{reported by reviewdog 🐶}

Suggested change

assert not re.match(r"\w", "\u0345"), r"\w should not match U+0345 (category Mn)"

assert not re.match(r"\w", "\u0345"), r"\w should not match U+0345 (category Mn)"

Initial plan

2722bc0

Copilot AI assigned Copilot and youknowone Mar 27, 2026

Copilot started work on behalf of youknowone March 27, 2026 16:36 View session

Fix isalnum/isalpha to use Unicode general category checks; fix regex…

5dd88ee

… \\w for Mn characters Agent-Logs-Url: https://github.com/RustPython/RustPython/sessions/6f573a91-8811-486c-933d-7ba9a9067643 Co-authored-by: youknowone <[email protected]>

Copilot AI changed the title ~~[WIP] Fix isalnum function to align with CPython standards~~ Fix isalpha/isalnum and regex \w to use Unicode General Category instead of Alphabetic derived property Mar 27, 2026

Copilot finished work on behalf of youknowone March 27, 2026 16:59

Copilot AI requested a review from youknowone March 27, 2026 16:59

github-actions bot reviewed Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `isalpha`/`isalnum` and regex `\w` to use Unicode General Category instead of Alphabetic derived property#7520

Fix `isalpha`/`isalnum` and regex `\w` to use Unicode General Category instead of Alphabetic derived property#7520
Copilot wants to merge 2 commits intomainfrom
copilot/fix-isalnum-function-behavior

Copilot AI commented Mar 27, 2026 •

edited

Loading

Uh oh!

github-actions bot Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	assert not re.match(r"\w", "\u0345"), r"\w should not match U+0345 (category Mn)"

	assert not re.match(r"\w", "\u0345"), r"\w should not match U+0345 (category Mn)"

Conversation

Copilot AI commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Problem Details

Expected Behavior

Current Behavior

Solution

Uh oh!

github-actions bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 27, 2026 •

edited

Loading