Remove spaces from newlines between CJK characters by isuffix · Pull Request #7350 · typst/typst

isuffix · 2025-11-10T22:47:04Z

This PR makes spaces due to newlines between CJK characters collapse to avoid creating a space when rendered. I've done so by splitting the Space syntax kind into two kinds to determine whether a space had a newline, and then by modifying the space-collapsing algorithm during Typst's realization step.

This is the behavior I mentioned in #792 (comment) (I have since changed my username from wrzian to isuffix).

Closes #792

Besides the below tradeoffs, I also have a few TODO comments that I would like input on.

Tradeoffs

This is a robust solution, but it makes multiple choices with tradeoffs that I'm not sure are desireable. I do not speak any CJK languages, so I'd appreciate feedback from the community about what is/isn't desired :)
CC @peng1999, @YDX-2147483647, @account-login

The main alternative design would be to solely resolve this in the parser or the AST, like YDX mentions in #792 (comment). That may improve or harm any of the tradeoffs below based on your opinion.

Each of the tradeoffs is exemplified in one or more new test cases:

Tradeoff: Collapsing happens during realization

Because collapsing happens after realization, a single newline space in the document may or may not collapse if its neighbors evaluate to CJK/non-CJK text dynamically.

Additionally, since a space element can itself be stored in a variable, static CJK characters can have different spacing due to stored space variables.

This is obviously one of the more contentious tradeoffs, but I think it's mostly fine. The first case is reasonable, and I doubt the second case is likely to affect real documents. But I am ok with changing either.

Test and output: space-collapsing-cjk-dynamic

--- space-collapsing-cjk-dynamic ---
// Test cjk space collapsing with dynamic variables.
#let foo = [水果] // collapses
#foo
#foo

#let foo = [fruit] // doesn't collapse
#foo
#foo

#let one-newline = [
]
#let no-newline = [ ]
啊#one-newline;啊 // collapses

啊#no-newline;啊 // doesn't collapse

Tradeoff: Normal spaces collapse only if they are adjacent to a newline space

Spaces that are not from newlines are kept, as in 空格 (see the test case dropdown), but when adjacent to a newline space they will collapse.

While the basic behavior of collapsing a newline space or a newline space followed by a comment are straightforward, it's less clear whether a normal space followed by a comment and a newline space should combine as one space and collapse together, or if they should act separately. In addition, it's unclear if spaces with different styling should be able to combine and collapse together.

Currently, all three of the cases in this codeblock will combine and collapse their spaces. For me, I think the first is good, and the second is probably desired, but I'm less sure about the third case. (these are rendered in the dropdown below)

注//
释

空格 //
注释

*空格 *
换行

Tests and output: issue-792-space-collapsing-cjk and space-collapsing-cjk-strong

--- issue-792-space-collapsing-cjk ---
// Test how spaces with/without newlines collapse in CJK text.

// No space from just a newline/comments
换
行

注//
释

多行/*
*/注释

// Should have a space from a space character
空 格

// With both a space and a newline it still collapses
空格 //
注释

--- space-collapsing-cjk-strong ---
// Test cjk space collapsing with strong emphasis.
*空 *格

*换*
行

// This space still collapses because it is followed by a newline space.
*空格 *
换行

Tradeoff: Treating space values as equal

This is the one I'm least certain about, and would be improved by ignoring spaces in the parser/AST.

There are a few other test cases (list-indent-trivia-nesting, list-indent-bracket-nesting) that implicitly expect space elements to be equal to each other regardless of newlines. I feel like I also generally expect this behavior (I wrote those tetsts), so breaking them feels odd. I added a custom PartialEq implementation to SpaceElem that always returns true to make this work. However, I'm not sure if this is sound with the way Comemo caches data.

I'm totally ok with removing this if we stay with space-collapsing during realization, but it will require modifying the other test cases if we do, and we would probably also want to modify the repr of SpaceElem.

Test: space-eq-newline

--- space-eq-newline ---
// Test whether spaces with/without newlines compare equal.
#let parbreak = [

]
#let one-newline = [
]
#let no-newline = [ ]
// parbreak is not equal
#assert.ne(one-newline, parbreak)
// spaces are equal despite newlines
// TODO: Would this break comemo?
#assert.eq(one-newline, no-newline)

YDX-2147483647 · 2025-11-11T01:08:56Z

Thank you for implementing this!

As for space-collapsing-cjk-strong (*空格 *), I don't think it's very important, because I guess no one would write it like this.
As for other aspects, I can take a closer look when I have time.

Besides, the cjk-unbreak package has accumulated a few test cases in https://github.com/KZNS/cjk-unbreak/blob/2ea9b0ce3654ab537116499f63aa4077165192bc/test.typ. They might be relevant.

Update on 2025-12-17: The cjk-spacer package published recently is also relevant.
The two packages are developed by Chinese and Japanese people respectively. I guess there might be subtle differences, although I haven't compared them carefully yet.

YDX-2147483647 · 2025-11-11T01:26:44Z

crates/typst-syntax/src/lexer.rs

+pub fn is_cjk(c: char) -> bool {
+    matches!(
+        c.script(),
+        Script::Han | Script::Hiragana | Script::Katakana | Script::Hangul


I suggest we also include punctuation marks, for the following use case.

Typst source:

我就站住，豫备她来讨钱。 “你回来了？” 她先这样问。

Result:

我就站住，豫备她来讨钱。“你回来了？”她先这样问。

The list of CJK punctuation marks can be found in clreq and jlreq. (I don't know if K should be included here.)

This might be more complicated than it seems to be. For example, the following three characters are widely used in Chinese documents, but their categories are different. See regex.pdf for further comparisons.

U+3001 、 顿号 (secondary comma) matches \p{Script_Extensions=Han}.

U+FF0C ， 逗号 (regular full-width comma) matches \p{Script_Extensions=Common}.

U+201C “ 上双引号 (left double quotation mark) matches \p{Script_Extensions=Common}, and it is also used in Latin documents.

They all match \p{General_Category=Punctuation} and \p{Script=Common}.

For Korean, they should be included, along with its half-width counter parts of the Latin punctuations where possible---both of which are used.

Please don't include Hangul here. Modern Korean uses spaces to separate words, so it's not relavant here. The function is better named as is_cj.

I used 'CJK' in the title of #792, that was a mistake. This issue only affects Chinese and Japanese.

Thank you for the links, they're very helpful!

I can change to just CJ removing Hangul. I'll plan to move the function out of the lexer and leave the lexer behavior untouched (unless that should change too?)

However, I'd appreciate more thoughts on what to do for punctuation. It seems we have two categories of codepoints: non-ambiguous CJ punctuation and ambiguous CJ punctuation, such as left/right quotes. For non-ambiguous, I guess we can just treat them like all other CJ codepoints for collapsing, but I'm less sure what to do for ambiguous punctuation.

I presume it would be a good behavior for ambiguous punctuation followed by a CJ character (or vice-versa) to collapse a newline space, but what about an ambiguous punctuation next to another ambiguous punctuation? Are there any characters that would be likely to be split across lines? Should we look at the text language to determine this?

Also, some of the non-ambiguous CJ punctuation overlap in usage with Hangul. Should we be taking that into account?

This is also relevant for #5858, which has some related discussion around quotation marks.

Another approach would be to just set the space collapsing behavior based on the text language, or an explicit property of say, par() (similar to the request in #710). This is more coarse-grained, but would simplify many of these considerations. Another tradeoff 😮‍💨

@YDX-2147483647 According to https://www.unicode.org/Public/17.0.0/ucd/EastAsianWidth.txt, The EAW of U+17A4 is N, not W or F. Did you confuse it with something different?

The Script property of U+115F is Hangul and EAW of it is W.

https://www.unicode.org/Public/17.0.0/ucd/Scripts.txt

The EAW of U+17A4 is N, not W or F. Did you confuse it with something different?

Hi! I didn't make it clear. What I mean is as follows.

c.width() uses a complex rule to determine the width, and EAW is one of the factors.

The full rule is documented on https://docs.rs/unicode-width, and it says that U+17A4 and U+115F will give width 2. (For these two specific characters, EAW does not contribute to c.width().)

Therefore, I don't think c.width() == Some(2) is a good criterion.

I agree with your opinion that unicode_with is not suitable for determining whether the character is CJ(K).
We should use the EAW property directly. Also, U+FF61 HALFWIDTH IDEOGRAPHIC FULL STOP is a (legacy) Japanese character but whose Script is not Katakana but Common and whose EAW is H. All non-Hangul/Korean characters whose EAW is H must be treated as Chinese/Japanese.

If we have issues with determining whether characters are Korean, we should test their test cases in FIrefox and report them in https://bugzilla.mozilla.org/ (its bug tracker):

<div lang=ja> Test Case Here </div>

Removing a newline around a punctuation is enabled only in Chinese and Japanese.

tats-u · 2025-11-28T05:10:31Z

biomejs/biome#7304 (comment)

Note: I personally prefer the Emoji_Presentation property to the Emoji property

isuffix · 2025-12-01T02:20:20Z

@tats-u thanks! The link to the CSS WG Draft is also helpful.

tats-u · 2025-12-01T12:17:57Z

The link to the w3c/csswg-drafts#5086 is also helpful.

It should be just a link to tracker. No concrete specification is stipulated in CSS WG Draft (https://drafts.csswg.org/css-text-4/#line-break-transform) now.

Prettier's issue (fixed):

Prose wrap options for Korean prettier/prettier#6516

"K" should be excluded from the title, too.

FYI, the following JS expression returns [], which means that no "Korean-dedicated punctuation chracters (whose Script is Hangul and General Category is Punctuation)" exists:

Iterator.from((function*() {for (let i = 0; i <= 0x10ffff; i++) yield i;})()).filter(cp => /[\p{P}&&\p{sc=Hang}]/v.test(String.fromCodePoint(cp))).toArray()

isuffix · 2025-12-20T18:30:33Z

I rebased off main and have refactored the space collapsing algorithm quite a bit. It now both depends on #7609 and reifies the kinds of actions that the algorithm takes in a new enum. This shouldn't change any of this PR's high-level behavior (we still discard newline spaces if either side is a space-discarding character), but the new organization really helps me keep the full algorithm in my head.

Additionally, I've updated the space-discarding character set to include common-script characters only if their East Asian Width is F/W/H but they are not emoji (although I still anticipate that this can be improved). This is aided by the new test newline-space-discarding-edge-cases which lets us check characters at a very fine grained level.

To determine the East Asian Width, I've moved to using the icu_properties crate which seems to be the highest quality source of the property in Rust right now. I've also moved the Script and Emoji checks to use icu_properties for consistency. However, we are still inexplicably using version 1.4 while icu is on 2.0 since May (now 2.1), so I'd appreciate suggestions as to whether it should be updated (it does have breaking changes). If we did want to stick to crates from unicode-rs, I believe we could ask the maintainer to add the EAW to the unicode-properties crate. However, it was very annoying that I could not find any real guidance online as to whether we should prefer icu vs. unicode-rs.

isuffix · 2025-12-20T18:36:13Z

crates/typst-realize/src/spaces.rs

+/// Whether a character is part of the space-discarding set for Typst. These
+/// characters discard adjacent spaces caused by newlines and allow Chinese and
+/// Japanese text to be broken across lines in markup without producing spaces.
+///
+/// Currently this checks if the character is in either the Chinese or Japanese
+/// scripts, or it is Common script (mainly punctuation) and has a defined East
+/// Asian Width property of H/F/W and is not an Emoji.
+pub(crate) fn is_space_discarding(c: char) -> bool {
+    // TODO: Load ICU sets/maps from typst-assets or use data from a different
+    // crate altogether. I assume there are still more changes to make, so
+    // leaving as-is for now.
+    const SCRIPT_DATA: CodePointMapDataBorrowed<'static, Script> =
+        icu_properties::maps::script();
+    const EAW_DATA: CodePointMapDataBorrowed<'static, EastAsianWidth> =
+        icu_properties::maps::east_asian_width();
+    const EMOJI_DATA: CodePointSetDataBorrowed<'static> = icu_properties::sets::emoji();
+
+    match SCRIPT_DATA.get(c) {
+        Script::Han | Script::Hiragana | Script::Katakana => true,
+        Script::Common => {
+            matches!(
+                EAW_DATA.get(c),
+                EastAsianWidth::Halfwidth
+                    | EastAsianWidth::Fullwidth
+                    | EastAsianWidth::Wide
+            ) && !EMOJI_DATA.contains(c)
+        }
+        _ => false,
+    }
+}


Here is the new space-discarding character check.

There seems to be some other East Asian scripts that prefers without-space-style e.g. Yi.
Also you should add // Especially Hangul above _ => false,.

There seems to be some other East Asian scripts that prefers without-space-style e.g. Yi.

So imo the feature is more about scripts that does not use spaces in writing than CJK characters only. This reminds me of Tangut -- don't know if they are Script::Han, though. Also, do we need to take those languages that do not use space within sentence level into account? e.g. Tibetan only uses spaces after a punctuation mark like ། .

The advantage of relying only on EAW for non-Hangul scripts recognition is that it eliminates the need to worry about about which scripts must be covered, including minor scripts like Yi or Tangut.

't know if they are Script::Han, though.

A dedicated Script Property Value Tangut is assigned to Tangut since Unicode 9 (2016).

https://www.unicode.org/standard/supported.html

isuffix · 2025-12-20T18:36:33Z

tests/suite/text/space.typ

+--- newline-space-discarding-edge-cases paged ---
+// Test newline space discarding for edge case characters.
+// Characters inspired by clreq and jlreq:
+// https://www.w3.org/TR/clreq
+// https://www.w3.org/TR/jlreq
+
+// Whether each string should discard an adjacent newline space.
+#let should-discard = (
+  // Basic characters in different languages
+  ("A", false),
+  ("漢", true),


Let me know what characters I should add to this test and whether any should change.

ｱ U+FF71 true ←EAW is H

㊙ U+3299 ? (false in my opinion) ←This is a default text presentation character like ©. Without a succeeding U+FE0F, it should not be displayed as emoji if a proper Japanese font is assigned. However, the current Firefox treats both 2 symbols as emoji because it treats every emoji character (a character that has an Emoji property) as emoji. There is Emoji_Presentation property.

한 U+D55C false ←Prose wrap options for Korean prettier/prettier#6516

₩ U+20A9 ? (false in my opinion) ←EAW is H but Korea-dedicated. Since the EAW of ¥ U+00A5 for Japan and PRC is Na, ₩ should not be treated as Chinese-or-Japanese, either. Shamefully && c != '\u{20A9}' is needed.

It is recommended to also check the character two positions away as necessary to check emojiness of the adjacent grapheme more accurately.

laurmaedje · 2026-01-08T14:08:28Z

However, we are still inexplicably using version 1.4 while icu is on 2.0 since May (now 2.1), so I'd appreciate suggestions as to whether it should be updated (it does have breaking changes).

The reason is just that nobody got to it. But there is #7412. There's still some unresolved questions though.

However, it was very annoying that I could not find any real guidance online as to whether we should prefer icu vs. unicode-rs.

Definitely ICU.

isuffix · 2026-01-09T20:59:34Z

The first push rebases off the updated #7609, the second/third just add some of the mentioned edge case characters, although I haven't updated their implementation yet.

YDX-2147483647 reviewed Nov 11, 2025

View reviewed changes

YDX-2147483647 mentioned this pull request Nov 15, 2025

Ignore linebreaks between Chinese/Japanese characters in source code #792

Open

laurmaedje added syntax About syntax, parsing, etc. cjk Chinese, Japanese, Korean typography. interface PRs that add to or change Typst's user-facing interface as opposed to internals or docs changes. labels Nov 17, 2025

isuffix force-pushed the cjk-space-collapse branch 2 times, most recently from dd58d2b to a70304f Compare December 1, 2025 02:14

isuffix mentioned this pull request Dec 20, 2025

Small refactor for space collapsing #7609

Merged

isuffix force-pushed the cjk-space-collapse branch 2 times, most recently from 9457d5c to f2e43ab Compare December 20, 2025 17:44

isuffix commented Dec 20, 2025

View reviewed changes

isuffix mentioned this pull request Dec 24, 2025

No sane way to do linebreak in source file in CJK context #7624

Closed

1 task

laurmaedje added waiting-on-review This PR is waiting to be reviewed. i18n About language- or script-specific features. May need attention from native speakers. labels Jan 5, 2026

isuffix added 3 commits January 9, 2026 15:52

Add tests for text show rules with styled spaces

efceaf9

Refactor space collapsing and add Invisible element state

12ad2b6

Split Space syntax kind into SpaceWithNewline and SpaceNoNewline

d9d2571

isuffix force-pushed the cjk-space-collapse branch 3 times, most recently from 4c7e973 to e639bba Compare January 9, 2026 20:57

Discard newline spaces adjacent to Chinese/Japanese text

b8f33e9

isuffix force-pushed the cjk-space-collapse branch from e639bba to b8f33e9 Compare January 9, 2026 21:01

laurmaedje added waiting-on-author Pull request waits on author and removed waiting-on-review This PR is waiting to be reviewed. labels Jan 22, 2026

laurmaedje added the blocked This PR is blocked by something. label Jan 29, 2026

Uh oh!

Conversation

isuffix commented Nov 10, 2025

Tradeoffs

Tradeoff: Collapsing happens during realization

Tradeoff: Normal spaces collapse only if they are adjacent to a newline space

Tradeoff: Treating space values as equal

Uh oh!

YDX-2147483647 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YDX-2147483647 Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ponte-vecchio Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tats-u Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tats-u Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tats-u commented Nov 28, 2025

Uh oh!

isuffix commented Dec 1, 2025

Uh oh!

tats-u commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isuffix commented Dec 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tats-u Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laurmaedje commented Jan 8, 2026

Uh oh!

isuffix commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

YDX-2147483647 commented Nov 11, 2025 •

edited

Loading

YDX-2147483647 Nov 11, 2025 •

edited

Loading

ponte-vecchio Nov 11, 2025 •

edited

Loading

tats-u Dec 1, 2025 •

edited

Loading

tats-u Dec 1, 2025 •

edited

Loading

tats-u commented Dec 1, 2025 •

edited

Loading

tats-u Dec 21, 2025 •

edited

Loading