Fix token highlighting for UTF-8 and Claude normalization#12
Fix token highlighting for UTF-8 and Claude normalization#12BedirT merged 4 commits intoBedirT:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dd2d67c3c6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const mappedStart = normalizationOffsetMap | ||
| ? resolveOriginalOffsetFromNormalized(normalizationOffsetMap, startUtf16, 'backward') | ||
| : startUtf16; |
There was a problem hiding this comment.
Guard null normalization maps before using normalized offsets
buildNormalizationOffsetMap can return null when NFKC composition crosses grapheme boundaries (for example, half-width Hangul Jamo such as \uFFB5\uFFCC normalizing to one syllable). In that path, this fallback uses UTF-16 offsets from the normalized string directly, so ranges are applied to incorrect positions in the original document and Claude token highlighting becomes shifted/truncated. If normalization changed but no offset map is available, highlighting should be skipped or cleared rather than using normalized offsets as-is.
Useful? React with 👍 / 👎.
BedirT
left a comment
There was a problem hiding this comment.
Thanks for contributing and tackling this! Traced both bugs through the new code and the bug and the solution looks right.
I added two things inline before I merge; one's a code concern, one's a question about intent. Once we tackle those ill pull this in!
| const segments = []; | ||
| let index = 0; | ||
| while (index < text.length) { | ||
| const codePoint = text.codePointAt(index); | ||
| const segment = String.fromCodePoint(codePoint); | ||
| segments.push({ segment, index }); | ||
| index += segment.length; | ||
| } | ||
| return segments; |
There was a problem hiding this comment.
This fallback walks code points, not graphemes; if it ever took this branch, combining chars and ZWJ emoji would get split across segments, and the check on line 222 would then reject the exact inputs your tests cover (e\u0301, ffi, etc.).
I dug around a bit, and saw that it can't actually trigger with VS Code 1.78+. always has Intl.Segmenter, but I'd rather drop it than leave a silently-wrong fallback lying around. Can we throw if Segmenter isn't available instead?
| if (byteCursor !== sourceBytes.length) { | ||
| hasMismatch = true; | ||
| } | ||
|
|
||
| editor.setDecorations(tokenDecorations.even, evenRanges); | ||
| editor.setDecorations(tokenDecorations.odd, oddRanges); | ||
|
|
||
| if (hasMismatch) { | ||
| console.warn('[gpt-token-counter-live] Partial token highlight render due to unsupported UTF-8 token boundaries.'); | ||
| } |
There was a problem hiding this comment.
Quick flag: this is a behavior change I don't think we want. We have any mismatch in the loop (decode failure, indexOf miss) cleared all highlights and bailed. All-or-nothing; which i think makes sense in an "error" situation.
Now when hasMismatch flips mid-loop we break, but then still call setDecorations with whatever prefix of ranges got built. The console.warn is invisible to end users so they'd just see half their doc highlighted and the other half not, with no way to tell what happened.
Can we move the setDecorations calls inside an if (!hasMismatch) guard, and call clearTokenHighlights(editor) in the else branch? Keeps the new byte-level matching but restores predictable all-or-nothing behavior. The warning can stay for debugging.
| const rebuiltNormalizedText = normalizedChunks.join(''); | ||
| if (rebuiltNormalizedText !== normalizedText) { | ||
| return null; | ||
| } |
There was a problem hiding this comment.
One more little thing; can you add a one-line comment explaining what this check is guarding against? NFKC can merge across grapheme boundaries (your \uFFB5\uFFCC test is the example), which would make per-grapheme NFKC not reassemble into whole-text NFKC. Not obvious from the code.
|
Thanks, updated. I removed the non-grapheme fallback and now throw when |
Resolve conflicts from PR BedirT#11 landing on main: - src/extension.js imports: keep minimatch (from BedirT#11), drop TextDecoder/utf8Decoder (BedirT#12 replaced decoded-string matching with raw-byte matching) - src/extension.js module.exports: combine __internal (BedirT#12) and _test (BedirT#11) keys - test/suite/extension.test.js: trivial blank-line formatting
Adds a `huggingface` model family backed by @huggingface/tokenizers. Loads tokenizer.json from the HuggingFace Hub (fetched once, cached in globalStorage) or from a local absolute path. Token counts are precise; highlighting reconstructs character offsets via decode-based walks that group byte-level-BPE continuation tokens so CJK and emoji stay aligned. Core pieces: - New settings: huggingfaceModelId, huggingfaceTokenizerPath (machine scope so a workspace cannot trigger remote loads or direct the extension at arbitrary local files) - Refresh command: gpt-token-counter-live.refreshHuggingfaceCache lets users force a re-fetch when they suspect upstream tokenizer drift - Load-time probe: supportsHighlight is set from a roundtrip check so SentencePiece-style tokenizers report highlight unavailable instead of silently clearing on every edit - Throttled auto-retry after load failures (30s cooldown) so the counter recovers from transient network failures without user intervention - Atomic cache writes with self-heal on corrupt reads; tmp files cleaned up on write/fsync failure - Collision-free safeId keys (SHA-256 suffix) so distinct model IDs cannot share a cache file - AbortSignal.timeout on every fetch so a stalled proxy never pins hfStatus at loading indefinitely - Skip offset reconstruction in the hot path when highlights are off - State-transition gating on error notifications so sustained failures do not repop a toast every 30 seconds - Separate visible-token counter for highlight band alternation so CJK groups do not share colors with adjacent tokens Bumps: engines.vscode to ^1.82.0 (fetch requirement), package version to 1.5.0, README badges, CHANGELOG and README release notes under 1.5.0 covering HuggingFace, enabledFilePatterns from #11, and the UTF-8/NFKC highlighting fixes from #12. 38 unit tests including real-tokenizer roundtrip coverage against Xenova/gpt-4.
Summary
()Repro
Before this change, token highlighting could break on multibyte characters.
A simple repro was:
With token highlighting enabled on GPT or Claude, the highlight ranges around
❌could become misaligned.Testing
npm test