Skip to content

Fix token highlighting for UTF-8 and Claude normalization#12

Merged
BedirT merged 4 commits intoBedirT:mainfrom
shiquda:main
Apr 18, 2026
Merged

Fix token highlighting for UTF-8 and Claude normalization#12
BedirT merged 4 commits intoBedirT:mainfrom
shiquda:main

Conversation

@shiquda
Copy link
Copy Markdown
Contributor

@shiquda shiquda commented Apr 14, 2026

Summary

  • switch token highlight range reconstruction from string matching to UTF-8 byte boundary tracking
  • add Claude normalization offset handling for full-width punctuation such as ()
  • add regression tests for UTF-8 boundaries and normalization offset mapping

Repro

Before this change, token highlighting could break on multibyte characters.
A simple repro was:

A ❌ B

With token highlighting enabled on GPT or Claude, the highlight ranges around could become misaligned.

Testing

  • npm test
  • manual verification in VS Code with GPT and Claude highlighting

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd2d67c3c6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/extension.js
Comment on lines +487 to +489
const mappedStart = normalizationOffsetMap
? resolveOriginalOffsetFromNormalized(normalizationOffsetMap, startUtf16, 'backward')
: startUtf16;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard null normalization maps before using normalized offsets

buildNormalizationOffsetMap can return null when NFKC composition crosses grapheme boundaries (for example, half-width Hangul Jamo such as \uFFB5\uFFCC normalizing to one syllable). In that path, this fallback uses UTF-16 offsets from the normalized string directly, so ranges are applied to incorrect positions in the original document and Claude token highlighting becomes shifted/truncated. If normalization changed but no offset map is available, highlighting should be skipped or cleared rather than using normalized offsets as-is.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner

@BedirT BedirT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing and tackling this! Traced both bugs through the new code and the bug and the solution looks right.

I added two things inline before I merge; one's a code concern, one's a question about intent. Once we tackle those ill pull this in!

Comment thread src/extension.js Outdated
Comment on lines +175 to +183
const segments = [];
let index = 0;
while (index < text.length) {
const codePoint = text.codePointAt(index);
const segment = String.fromCodePoint(codePoint);
segments.push({ segment, index });
index += segment.length;
}
return segments;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fallback walks code points, not graphemes; if it ever took this branch, combining chars and ZWJ emoji would get split across segments, and the check on line 222 would then reject the exact inputs your tests cover (e\u0301, , etc.).

I dug around a bit, and saw that it can't actually trigger with VS Code 1.78+. always has Intl.Segmenter, but I'd rather drop it than leave a silently-wrong fallback lying around. Can we throw if Segmenter isn't available instead?

Comment thread src/extension.js
Comment on lines +521 to +530
if (byteCursor !== sourceBytes.length) {
hasMismatch = true;
}

editor.setDecorations(tokenDecorations.even, evenRanges);
editor.setDecorations(tokenDecorations.odd, oddRanges);

if (hasMismatch) {
console.warn('[gpt-token-counter-live] Partial token highlight render due to unsupported UTF-8 token boundaries.');
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick flag: this is a behavior change I don't think we want. We have any mismatch in the loop (decode failure, indexOf miss) cleared all highlights and bailed. All-or-nothing; which i think makes sense in an "error" situation.

Now when hasMismatch flips mid-loop we break, but then still call setDecorations with whatever prefix of ranges got built. The console.warn is invisible to end users so they'd just see half their doc highlighted and the other half not, with no way to tell what happened.

Can we move the setDecorations calls inside an if (!hasMismatch) guard, and call clearTokenHighlights(editor) in the else branch? Keeps the new byte-level matching but restores predictable all-or-nothing behavior. The warning can stay for debugging.

Comment thread src/extension.js
Comment on lines +221 to +224
const rebuiltNormalizedText = normalizedChunks.join('');
if (rebuiltNormalizedText !== normalizedText) {
return null;
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more little thing; can you add a one-line comment explaining what this check is guarding against? NFKC can merge across grapheme boundaries (your \uFFB5\uFFCC test is the example), which would make per-grapheme NFKC not reassemble into whole-text NFKC. Not obvious from the code.

@shiquda
Copy link
Copy Markdown
Contributor Author

shiquda commented Apr 18, 2026

Thanks, updated. I removed the non-grapheme fallback and now throw when Intl.Segmenter is unavailable, restored all-or-nothing highlight rendering on mismatches, and added a comment explaining the cross-grapheme NFKC reassembly check.
And also added a regression test for the Segmenter requirement and reran npm test.

Resolve conflicts from PR BedirT#11 landing on main:
- src/extension.js imports: keep minimatch (from BedirT#11), drop TextDecoder/utf8Decoder (BedirT#12 replaced decoded-string matching with raw-byte matching)
- src/extension.js module.exports: combine __internal (BedirT#12) and _test (BedirT#11) keys
- test/suite/extension.test.js: trivial blank-line formatting
@BedirT BedirT merged commit 5251e60 into BedirT:main Apr 18, 2026
BedirT added a commit that referenced this pull request Apr 18, 2026
Adds a `huggingface` model family backed by @huggingface/tokenizers.
Loads tokenizer.json from the HuggingFace Hub (fetched once, cached in
globalStorage) or from a local absolute path. Token counts are precise;
highlighting reconstructs character offsets via decode-based walks that
group byte-level-BPE continuation tokens so CJK and emoji stay aligned.

Core pieces:
- New settings: huggingfaceModelId, huggingfaceTokenizerPath (machine scope
  so a workspace cannot trigger remote loads or direct the extension at
  arbitrary local files)
- Refresh command: gpt-token-counter-live.refreshHuggingfaceCache lets
  users force a re-fetch when they suspect upstream tokenizer drift
- Load-time probe: supportsHighlight is set from a roundtrip check so
  SentencePiece-style tokenizers report highlight unavailable instead of
  silently clearing on every edit
- Throttled auto-retry after load failures (30s cooldown) so the counter
  recovers from transient network failures without user intervention
- Atomic cache writes with self-heal on corrupt reads; tmp files cleaned
  up on write/fsync failure
- Collision-free safeId keys (SHA-256 suffix) so distinct model IDs
  cannot share a cache file
- AbortSignal.timeout on every fetch so a stalled proxy never pins
  hfStatus at loading indefinitely
- Skip offset reconstruction in the hot path when highlights are off
- State-transition gating on error notifications so sustained failures
  do not repop a toast every 30 seconds
- Separate visible-token counter for highlight band alternation so CJK
  groups do not share colors with adjacent tokens

Bumps: engines.vscode to ^1.82.0 (fetch requirement), package version to
1.5.0, README badges, CHANGELOG and README release notes under 1.5.0
covering HuggingFace, enabledFilePatterns from #11, and the UTF-8/NFKC
highlighting fixes from #12.

38 unit tests including real-tokenizer roundtrip coverage against
Xenova/gpt-4.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants