-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Bug
Creating a task fails with the following error when the task prompt contains multi-byte UTF-8 characters (e.g. em dash —, curly quotes, etc.):
Internal error creating task.
pq: invalid byte sequence for encoding "UTF8": 0x97
Root Cause
Go's json.Decoder enforces valid UTF-8, so the prompt input from the HTTP request is always valid. The issue is in strutil.Truncate (coderd/util/strings/strings.go), which operates on byte indices rather than runes:
if len(s) <= n { // len(s) is byte length, not rune count
return s
}
...
_, _ = sb.WriteString(s[:maxLen]) // byte slice, not rune-safeWhen a multi-byte UTF-8 character (e.g. em dash — = 3 bytes E2 80 94) straddles the truncation boundary, it gets split, producing an invalid byte sequence that PostgreSQL rejects on insert.
This is called from generateFromPrompt() in coderd/taskname/taskname.go:
- Task name generation:
strutil.Truncate(prompt, 27, strutil.TruncateWithFullWords) - Display name generation:
strutil.Truncate(prompt, 64, strutil.TruncateWithFullWords, strutil.TruncateWithEllipsis)
There's also a similar byte-slicing issue in generateFallback(): name[:min(len(name), 27)]
Additionally, strings.LastIndexFunc(s[:maxLen], unicode.IsSpace) in Truncate also byte-slices before searching for a word boundary.
Expected Behavior
Truncation should be rune-aware and never produce invalid UTF-8.
Suggested Fix
Make Truncate operate on rune boundaries instead of byte boundaries. For example, use utf8.RuneCountInString for length checks and iterate runes instead of byte-slicing. Also fix generateFallback to avoid raw byte slicing.
Existing tests only use ASCII strings, so adding multi-byte character test cases would prevent regressions.
Created on behalf of @bjornrobertsson