Skip to content

Extract shared rustpython-unicode crate and route core Unicode semantics through it#7561

Draft
Copilot wants to merge 7 commits intomainfrom
copilot/add-shared-unicode-crate
Draft

Extract shared rustpython-unicode crate and route core Unicode semantics through it#7561
Copilot wants to merge 7 commits intomainfrom
copilot/add-shared-unicode-crate

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 5, 2026

This change introduces a shared rustpython-unicode crate to own CPython-aligned Unicode semantics and data access, instead of keeping that logic spread across str, regex, literal escaping, and unicodedata. The goal is to make Unicode behavior reusable across RustPython components while preserving higher-level string behavior in existing runtime crates.

  • New shared Unicode crate

    • Adds crates/unicode with low-level, reusable APIs for:
      • character classification
      • identifier predicates
      • regex Unicode predicates
      • case mapping / casefold helpers
      • normalization
      • unicodedata-style database lookups
    • Keeps the API codepoint-oriented where needed and compatible with no_std consumers.
  • Refactor core runtime call sites

    • rustpython-vm now uses rustpython-unicode for:
      • str.isdecimal()
      • str.isspace()
      • str.isprintable()
      • str.isidentifier()
      • WTF-8 lowercase / uppercase / casefold helpers
    • rustpython-literal now uses the shared printable classification for repr/escape behavior.
    • rustpython-sre_engine now routes its Unicode character-class and case helpers through the shared crate.
  • Thin unicodedata integration

    • crates/stdlib/src/unicodedata.rs now delegates reusable Unicode database and normalization operations to rustpython-unicode.
    • This keeps module behavior in place while moving the underlying data/semantics into the shared layer.
  • Focused coverage

    • Adds unit coverage in rustpython-unicode for classification, identifiers, regex predicates, case mapping, normalization, and Unicode database access.
    • Adds a RustPython snippet covering integration across str, re, and unicodedata.

Example of the new shared API shape:

use rustpython_unicode::{classify, identifier, normalize::NormalizeForm, normalize};

assert!(classify::is_decimal('५' as u32));
assert!(identifier::is_python_identifier("유니코드"));

let text = rustpython_wtf8::Wtf8Buf::from("e\u{301}");
let normalized = normalize::normalize(NormalizeForm::Nfc, &text);
assert_eq!(normalized, rustpython_wtf8::Wtf8Buf::from("é"));

Copilot AI linked an issue Apr 5, 2026 that may be closed by this pull request
Copilot AI changed the title [WIP] Add rustpython-unicode shared crate for CPython compatibility Extract shared rustpython-unicode crate and route core Unicode semantics through it Apr 5, 2026
Copilot AI requested a review from youknowone April 5, 2026 07:50
Copy link
Copy Markdown
Member

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot check CI failures, comments and also missing refactoring about rustpython_common::encodings and rustpython_codegen::string_parser

Nfkd,
}

impl From<NormalizeForm> for rustpython_unicode::NormalizeForm {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local NormalizeForm is expected to be replaced by rustpython_unicode::NormalizeForm

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced the remaining local wrapper path in unicodedata and now parse directly into rustpython_unicode::NormalizeForm in 2934897.

[features]
default = ["std", "casefold"]
std = []
casefold = ["std", "dep:caseless"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this feature? when will the feature be disabled?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept the std feature because rustpython-unicode is consumed by no_std crates, while casefold depends on caseless for std-enabled consumers. The feature wiring was clarified in 2934897.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I expect is the entire Ucd type is placed in the new unicode crate, without Python runtime.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the non-runtime Ucd implementation into rustpython-unicode and left crates/stdlib/src/unicodedata.rs as a thin Python wrapper in 2934897.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 5, 2026

@copilot check CI failures, comments and also missing refactoring about rustpython_common::encodings and rustpython_codegen::string_parser

Checked the PR CI/check runs and the earlier branch run showed a stale lint failure while the newer run was otherwise succeeding/in progress. I addressed the review items and the missing rustpython_common::encodings / rustpython_codegen::string_parser refactors in 2934897.

Copilot AI requested a review from youknowone April 5, 2026 08:55
Copy link
Copy Markdown
Member

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot check CI failures and reviews

@@ -10,17 +8,5 @@ use icu_properties::props::{EnumeratedProperty, GeneralCategory};
/// * Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
/// * Zs (Separator, Space) other than ASCII space('\x20').
pub fn is_printable(c: char) -> bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need this function? why not directly calling is_repr_printable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the redundant wrapper and now call rustpython_unicode::classify::is_repr_printable directly from escape.rs in 4efa5da.

}

#[inline]
pub(crate) fn is_word(ch: u32) -> bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

importing rustpython_unicode::regex as unicode_regex will remove needs for all this helpers

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched the SRE engine over to a direct rustpython_unicode::regex import and removed the pass-through helpers in 4efa5da.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 6, 2026

@copilot check CI failures and reviews

Checked the current CI and reviews. I addressed the redundant literal/sre_engine wrappers and the cargo shear dependency failure in 4efa5da. After that cleanup, cargo shear, cargo check, clippy, and parallel_validation pass; the focused CPython failures still reproduce as test_re and test_pkgutil.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • docs.rs
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI requested a review from youknowone April 6, 2026 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rustpython-unicode

2 participants