Skip to content

WanliZhong/IntAttention

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

[Paper] [MLSys] [Artifact Evaluation]

This repository contains the official code for "IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference" (MLSys 2026).

The artifact evaluation workflow for reproducing the paper results is provided in ArtifactEvaluation/.

Overview

IntAttention is a fully integer attention pipeline designed for efficient edge inference. Instead of falling back to floating-point softmax and value mixing after integer QK accumulation, IntAttention keeps the whole attention path in low precision:

  • S8 x S8 -> S32 for query-key accumulation
  • S32 -> U8 IndexSoftmax for probability generation
  • U8 x S8 -> S32 for probability-value mixing

Compared with conventional INT8 attention pipelines that dequantize to floating point around softmax, IntAttention preserves an integer computation path throughout attention, reducing memory traffic and improving CPU efficiency while maintaining accuracy.

About

Official codebase for the MLSys 2026 paper "IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference". It enables high-fidelity and high-speed LLM/ViT deployment on ARM CPUs.

Topics

Resources

Stars

Watchers

Forks