IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

This repository contains the official code for "IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference" (MLSys 2026).

The artifact evaluation workflow for reproducing the paper results is provided in ArtifactEvaluation/.

Overview

IntAttention is a fully integer attention pipeline designed for efficient edge inference. Instead of falling back to floating-point softmax and value mixing after integer QK accumulation, IntAttention keeps the whole attention path in low precision:

S8 x S8 -> S32 for query-key accumulation
S32 -> U8 IndexSoftmax for probability generation
U8 x S8 -> S32 for probability-value mixing

Compared with conventional INT8 attention pipelines that dequantize to floating point around softmax, IntAttention preserves an integer computation path throughout attention, reducing memory traffic and improving CPU efficiency while maintaining accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
ArtifactEvaluation		ArtifactEvaluation
assets		assets
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

Overview

About

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

Overview

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 1

Languages