Fix token highlighting for UTF-8 and Claude normalization by shiquda · Pull Request #12 · BedirT/LLM-Token-Counter-VSCode

shiquda · 2026-04-14T12:51:30Z

Summary

switch token highlight range reconstruction from string matching to UTF-8 byte boundary tracking
add Claude normalization offset handling for full-width punctuation such as （）
add regression tests for UTF-8 boundaries and normalization offset mapping

Repro

Before this change, token highlighting could break on multibyte characters.
A simple repro was:

A ❌ B

With token highlighting enabled on GPT or Claude, the highlight ranges around ❌ could become misaligned.

Testing

npm test
manual verification in VS Code with GPT and Claude highlighting

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd2d67c3c6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-14T12:56:20Z

+            const mappedStart = normalizationOffsetMap
+                ? resolveOriginalOffsetFromNormalized(normalizationOffsetMap, startUtf16, 'backward')
+                : startUtf16;


Guard null normalization maps before using normalized offsets

buildNormalizationOffsetMap can return null when NFKC composition crosses grapheme boundaries (for example, half-width Hangul Jamo such as \uFFB5\uFFCC normalizing to one syllable). In that path, this fallback uses UTF-16 offsets from the normalized string directly, so ranges are applied to incorrect positions in the original document and Claude token highlighting becomes shifted/truncated. If normalization changed but no offset map is available, highlighting should be skipped or cleared rather than using normalized offsets as-is.

Useful? React with 👍 / 👎.

BedirT

Thanks for contributing and tackling this! Traced both bugs through the new code and the bug and the solution looks right.

I added two things inline before I merge; one's a code concern, one's a question about intent. Once we tackle those ill pull this in!

BedirT · 2026-04-18T02:11:05Z

+    const segments = [];
+    let index = 0;
+    while (index < text.length) {
+        const codePoint = text.codePointAt(index);
+        const segment = String.fromCodePoint(codePoint);
+        segments.push({ segment, index });
+        index += segment.length;
+    }
+    return segments;


This fallback walks code points, not graphemes; if it ever took this branch, combining chars and ZWJ emoji would get split across segments, and the check on line 222 would then reject the exact inputs your tests cover (e\u0301, ﬃ, etc.).

I dug around a bit, and saw that it can't actually trigger with VS Code 1.78+. always has Intl.Segmenter, but I'd rather drop it than leave a silently-wrong fallback lying around. Can we throw if Segmenter isn't available instead?

BedirT · 2026-04-18T02:11:05Z

+        if (byteCursor !== sourceBytes.length) {
+            hasMismatch = true;
        }

        editor.setDecorations(tokenDecorations.even, evenRanges);
        editor.setDecorations(tokenDecorations.odd, oddRanges);
+
+        if (hasMismatch) {
+            console.warn('[gpt-token-counter-live] Partial token highlight render due to unsupported UTF-8 token boundaries.');
+        }


Quick flag: this is a behavior change I don't think we want. We have any mismatch in the loop (decode failure, indexOf miss) cleared all highlights and bailed. All-or-nothing; which i think makes sense in an "error" situation.

Now when hasMismatch flips mid-loop we break, but then still call setDecorations with whatever prefix of ranges got built. The console.warn is invisible to end users so they'd just see half their doc highlighted and the other half not, with no way to tell what happened.

Can we move the setDecorations calls inside an if (!hasMismatch) guard, and call clearTokenHighlights(editor) in the else branch? Keeps the new byte-level matching but restores predictable all-or-nothing behavior. The warning can stay for debugging.

BedirT · 2026-04-18T02:11:05Z

+    const rebuiltNormalizedText = normalizedChunks.join('');
+    if (rebuiltNormalizedText !== normalizedText) {
+        return null;
+    }


One more little thing; can you add a one-line comment explaining what this check is guarding against? NFKC can merge across grapheme boundaries (your \uFFB5\uFFCC test is the example), which would make per-grapheme NFKC not reassemble into whole-text NFKC. Not obvious from the code.

shiquda · 2026-04-18T02:33:54Z

Thanks, updated. I removed the non-grapheme fallback and now throw when Intl.Segmenter is unavailable, restored all-or-nothing highlight rendering on mismatches, and added a comment explaining the cross-grapheme NFKC reassembly check.
And also added a regression test for the Segmenter requirement and reran npm test.

Resolve conflicts from PR BedirT#11 landing on main: - src/extension.js imports: keep minimatch (from BedirT#11), drop TextDecoder/utf8Decoder (BedirT#12 replaced decoded-string matching with raw-byte matching) - src/extension.js module.exports: combine __internal (BedirT#12) and _test (BedirT#11) keys - test/suite/extension.test.js: trivial blank-line formatting

Adds a `huggingface` model family backed by @huggingface/tokenizers. Loads tokenizer.json from the HuggingFace Hub (fetched once, cached in globalStorage) or from a local absolute path. Token counts are precise; highlighting reconstructs character offsets via decode-based walks that group byte-level-BPE continuation tokens so CJK and emoji stay aligned. Core pieces: - New settings: huggingfaceModelId, huggingfaceTokenizerPath (machine scope so a workspace cannot trigger remote loads or direct the extension at arbitrary local files) - Refresh command: gpt-token-counter-live.refreshHuggingfaceCache lets users force a re-fetch when they suspect upstream tokenizer drift - Load-time probe: supportsHighlight is set from a roundtrip check so SentencePiece-style tokenizers report highlight unavailable instead of silently clearing on every edit - Throttled auto-retry after load failures (30s cooldown) so the counter recovers from transient network failures without user intervention - Atomic cache writes with self-heal on corrupt reads; tmp files cleaned up on write/fsync failure - Collision-free safeId keys (SHA-256 suffix) so distinct model IDs cannot share a cache file - AbortSignal.timeout on every fetch so a stalled proxy never pins hfStatus at loading indefinitely - Skip offset reconstruction in the hot path when highlights are off - State-transition gating on error notifications so sustained failures do not repop a toast every 30 seconds - Separate visible-token counter for highlight band alternation so CJK groups do not share colors with adjacent tokens Bumps: engines.vscode to ^1.82.0 (fetch requirement), package version to 1.5.0, README badges, CHANGELOG and README release notes under 1.5.0 covering HuggingFace, enabledFilePatterns from #11, and the UTF-8/NFKC highlighting fixes from #12. 38 unit tests including real-tokenizer roundtrip coverage against Xenova/gpt-4.

fix: improve UTF-8 and Claude token highlighting

dd2d67c

chatgpt-codex-connector bot reviewed Apr 14, 2026

View reviewed changes

fix: skip Claude highlights when normalization mapping fails

2f26749

BedirT reviewed Apr 18, 2026

View reviewed changes

fix: restore all-or-nothing highlight rendering

808d483

BedirT merged commit 5251e60 into BedirT:main Apr 18, 2026

BedirT mentioned this pull request Apr 18, 2026

feat: HuggingFace tokenizer provider with live highlighting #13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix token highlighting for UTF-8 and Claude normalization#12

Fix token highlighting for UTF-8 and Claude normalization#12
BedirT merged 4 commits intoBedirT:mainfrom
shiquda:main

shiquda commented Apr 14, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 14, 2026

Uh oh!

BedirT left a comment

Uh oh!

BedirT Apr 18, 2026

Uh oh!

BedirT Apr 18, 2026

Uh oh!

BedirT Apr 18, 2026

Uh oh!

shiquda commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shiquda commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Repro

Testing

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

BedirT left a comment

Choose a reason for hiding this comment

Uh oh!

BedirT Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

BedirT Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

BedirT Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

shiquda commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shiquda commented Apr 14, 2026 •

edited

Loading