You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Simplify loading, remove duplicate indices calc, fix prefill path
- Remove clustering_config.json validation from _get_centroids (rely on
safetensors contents directly)
- Auto-detect n_clusters from centroids tensor shape instead of requiring
it as a parameter
- Infer vocab_size/hidden_size from weight shape instead of config metadata
- Return indices from _get_cluster_logits to avoid recomputing them in
get_next_token (removes duplicate index_select + flatten + unique)
- Fix prefill regression: only use FlashHead for single-token decode
(shape[0] == 1); let vLLM handle prefill natively via compiled path
- Fix sampling softmax to slice [:, -1, :] before temperature scaling
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Bump version to 0.1.2, fix README images for PyPI
Use absolute URLs for images so they render on PyPI.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>