[Paper] [MLSys] [Artifact Evaluation]
This repository contains the official code for "IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference" (MLSys 2026).
The artifact evaluation workflow for reproducing the paper results is provided in ArtifactEvaluation/.
IntAttention is a fully integer attention pipeline designed for efficient edge inference. Instead of falling back to floating-point softmax and value mixing after integer QK accumulation, IntAttention keeps the whole attention path in low precision:
S8 x S8 -> S32for query-key accumulationS32 -> U8IndexSoftmax for probability generationU8 x S8 -> S32for probability-value mixing
Compared with conventional INT8 attention pipelines that dequantize to floating point around softmax, IntAttention preserves an integer computation path throughout attention, reducing memory traffic and improving CPU efficiency while maintaining accuracy.
