TurboQuant
We introduce a set of advanced theoretically grounded quantization algorithms that enable massive compression for large language models and vector search engines.
From TurboQuant: Redefining AI efficiency with extreme compression by Google Research.
The practical bit is the combination: PolarQuant handles most of the compression, then QJL spends a single residual bit to correct bias. Google claims lossless-or-near-lossless KV-cache compression on long-context benchmarks, at least 6x memory reduction, and up to 8x attention-logit speedup on H100s.
It seems that graphics cards and memory prices can finally drop.