Skip to content

Fix isalpha/isalnum and regex \w to use Unicode General Category instead of Alphabetic derived property#7520

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/fix-isalnum-function-behavior
Draft

Fix isalpha/isalnum and regex \w to use Unicode General Category instead of Alphabetic derived property#7520
Copilot wants to merge 2 commits intomainfrom
copilot/fix-isalnum-function-behavior

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 27, 2026

Rust's char::is_alphabetic()/char::is_alphanumeric() use the Unicode Alphabetic derived property, which is a superset of the Unicode letter categories and includes some non-spacing marks (Mn). CPython's str.isalpha() and str.isalnum() are defined strictly by General Category membership (letters: Lu/Ll/Lt/Lm/Lo; numbers: Nd/Nl/No), causing divergence for characters like U+0345 (COMBINING GREEK YPOGEGRAMMENI, category Mn).

"\u0345".isalnum()   # RustPython: True  /  CPython: False
re.match(r"\w", "\u0345")  # RustPython: match  /  CPython: None

Changes

  • crates/vm/src/builtins/str.rs: Replace char::is_alphabetic() / char::is_alphanumeric() in isalpha() and isalnum() with explicit GeneralCategory checks (L* for isalpha; L* + Nd/Nl/No for isalnum).
  • crates/sre_engine/src/string.rs: Apply the same fix to is_uni_alnum() (used for \w in regex); remove the pre-existing // TODO: check with cpython note.
  • crates/sre_engine/Cargo.toml: Add unic-ucd-category workspace dependency.
  • extra_tests/snippets/builtin_str_unicode.py: Add regression assertions for U+0345 across isalpha, isalnum, and re.match(r"\w", ...).
Original prompt

Problem Details

The regex module in RustPython is treating \w as matching more characters than CPython does. The disagreement stems from the isalnum() function, leading to incorrect matching behavior by \w. For example:

import re

assert not re.match(r"\w", "\u0345"), r"\w should not match U+0345 (category Mn)"

Expected Behavior

The regex pattern \w should only match characters for which isalnum() returns True as per the Python documentation. The assertion in the example should pass without throwing an error when behavior matches CPython standards.

In CPython:

  • "\u0345".isalnum() returns False
  • Thus, \w in regex does not match \u0345

Current Behavior

In RustPython, "\u0345".isalnum() returns True. Consequently, the regex pattern \w incorrectly matches \u0345 in violation of CPython standards.

Solution

Fix the implementation of isalnum() to align RustPython with CPython's behavior for Unicode character classification, ensuring regex patterns behave accordingly.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Create a pr to fix it

Assistant: A pull request has been initiated to address the issue with inspect.getsource returning incomplete source code. Let me know if further assistance is required!

User: Accepted Confirmation: Are you sure?

Assistant: The pull request to fix the issue with inspect.getsource has been initiated. The Copilot coding agent is now working on this. You can track the progress of this task in the repository. Let me know if you have further instructions!

User: Create a pr to fix this

Assistant: The pull request to resolve the Unicode character discrepancy in the regex module has been initiated, and the Copilot coding agent is actively working on it. Let me know if you need further updates!

This pull request was created from Copilot chat.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Fix isalnum function to align with CPython standards Fix isalpha/isalnum and regex \w to use Unicode General Category instead of Alphabetic derived property Mar 27, 2026
Copilot AI requested a review from youknowone March 27, 2026 16:59
assert not "\u0345".isalnum(), "isalnum should not match Mn category characters"

import re
assert not re.match(r"\w", "\u0345"), r"\w should not match U+0345 (category Mn)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[reviewdog-suggester] reported by reviewdog 🐶

Suggested change
assert not re.match(r"\w", "\u0345"), r"\w should not match U+0345 (category Mn)"
assert not re.match(r"\w", "\u0345"), r"\w should not match U+0345 (category Mn)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants