Run 70B+ LLMs on a single 4GB GPU — no quantization required.
RabbitLLM is a fork of AirLLM. It enables inference on large language models (70B+ parameters) on consumer GPUs with as little as 4GB VRAM by streaming model layers one at a time through GPU memory. No quantization, distillation, or pruning needed — full model quality.
- Tested and supported: only Qwen2 and Qwen3 are currently tested and compatible. Use these families for reliable results.
- Other architectures (Llama, Mistral, Mixtral, etc.) are present in the codebase but not yet compatible — use at your own risk.
- Apple (macOS / Apple Silicon) is not supported; run on Linux or Windows with a CUDA-capable GPU (or CPU fallback on x86/ARM Linux).
Instead of loading the entire model into GPU memory, RabbitLLM:
- Splits the HuggingFace checkpoint into per-layer safetensors files (once, on first use).
- Streams each layer individually: load to GPU → forward pass → free GPU memory.
- Prefetches the next layer in a background thread while the current layer is computing.
Optional 4-bit/8-bit block-wise compression (via bitsandbytes) can reduce layer size further for up to 3× speed-up with minimal accuracy loss.
pip install rabbitllmOptional — Flash Attention 2 (faster on Ampere+ GPUs, e.g. RTX 30xx/40xx):
pip install rabbitllm[flash]If the prebuilt wheel is unavailable for your setup, install from flashattn.dev. Without it, SDPA is used automatically.
Build and run with GPU support (requires NVIDIA Container Toolkit on the host):
docker build -t rabbitllm .
docker run --gpus all -it rabbitllm python scripts/inference_example.py --model Qwen/Qwen2.5-0.5B-Instruct --max-new-tokens 20Optional env vars: HF_TOKEN for gated models, HF_HOME for model cache directory. Example:
docker run --gpus all -e HF_TOKEN=hf_... -it rabbitllm python scripts/inference_example.py --model Qwen/Qwen2.5-7B-Instructimport warnings
import torch
from rabbitllm import AutoModel
# Use GPU if available, otherwise CPU
with warnings.catch_warnings():
warnings.filterwarnings("ignore", message=".*CUDA.*unknown error.*", category=UserWarning)
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# compression: "4bit" (recommended), "8bit", or None (bfloat16)
model = AutoModel.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct",
device=device,
compression="4bit",
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
]
input_text = model.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
tokens = model.tokenizer(
[input_text], return_tensors="pt", truncation=True, max_length=512
)
input_ids = tokens["input_ids"].to(device)
attention_mask = tokens.get("attention_mask")
if attention_mask is None:
attention_mask = torch.ones_like(input_ids, dtype=torch.long, device=device)
else:
attention_mask = attention_mask.to(device)
output = model.generate(
input_ids,
attention_mask=attention_mask,
max_new_tokens=200,
use_cache=True,
do_sample=True,
temperature=0.6,
top_p=0.95,
return_dict_in_generate=True,
)
# Decode only the newly generated tokens
input_len = tokens["input_ids"].shape[1]
print(model.tokenizer.decode(output.sequences[0][input_len:], skip_special_tokens=True))AutoModel automatically detects the model architecture from the HuggingFace config —
no need to pick the right class manually.
Only Qwen2 and Qwen3 are tested and supported. The following table lists the architectures present in the codebase; others are not yet compatible.
| Family | Architectures | Class | Status |
|---|---|---|---|
| Qwen2 / Qwen2.5 / Qwen3 | Qwen2ForCausalLM, Qwen3ForCausalLM |
RabbitLLMQWen2 |
Tested, supported |
| Llama 2 / 3 / 3.1 / 3.2 | LlamaForCausalLM |
RabbitLLMLlama2 |
Not yet compatible |
| Qwen v1 | QWenLMHeadModel |
RabbitLLMQWen |
Not yet compatible |
| Mistral | MistralForCausalLM |
RabbitLLMMistral |
Not yet compatible |
| Mixtral | MixtralForCausalLM |
RabbitLLMMixtral |
Not yet compatible |
| InternLM | InternLMForCausalLM |
RabbitLLMInternLM |
Not yet compatible |
| ChatGLM | ChatGLMModel |
RabbitLLMChatGLM |
Not yet compatible |
| Baichuan | BaichuanForCausalLM |
RabbitLLMBaichuan |
Not yet compatible |
| Gemma 2 / 3 | Gemma2ForCausalLM, Gemma3ForCausalLM |
RabbitLLMLlama2 |
Not yet compatible |
| DeepSeek V2 / V3 | DeepseekV2ForCausalLM, DeepseekV3ForCausalLM |
RabbitLLMLlama2 |
Not yet compatible |
| Phi 2 / 3 / 4 | Phi3ForCausalLM, Phi4ForCausalLM |
RabbitLLMLlama2 |
Not yet compatible |
Unknown architectures fall back to the Llama-based implementation with a warning.
model = AutoModel.from_pretrained(
"Qwen/Qwen2.5-72B-Instruct",
compression="4bit", # "4bit" | "8bit" | None (default)
attn_implementation="auto", # "auto" | "flash_attention_2" | "sdpa" | "eager"
max_seq_len=512, # maximum sequence length
prefetching=True, # overlap layer loading with compute
prefetch_pin_memory=True, # faster CPU→GPU for small/medium models
use_gds=True, # GPU Direct Storage (kvikio) when available
kv_cache_dir=None, # path to offload KV cache for long context (50k+ tokens)
token="hf_...", # HuggingFace token for gated repos
layer_shards_saving_path="/path/to/cache", # custom split cache directory
profiling_mode=False, # print per-layer timing
delete_original=False, # delete original shards after splitting
)Block-wise quantization reduces on-disk and in-memory layer size:
- 4-bit (NF4): ~28% of original size, up to 3× faster loading, minimal quality loss.
- 8-bit: ~50% of original size.
model = AutoModel.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", compression="4bit")Requires bitsandbytes: pip install bitsandbytes.
For CUDA without compression, install kvikio-cu12 to load layers directly from disk to GPU,
bypassing CPU and pin_memory (can significantly speed up 70B+ models):
pip install rabbitllm[gds]
# or: pip install kvikio-cu12Set use_gds=False to disable.
For 50k+ token contexts, pass kv_cache_dir to offload KV cache to SSD:
model = AutoModel.from_pretrained("Qwen/Qwen2.5-72B-Instruct", kv_cache_dir="./kv_cache")Scripts to measure throughput and compare configurations:
| Script | What it measures |
|---|---|
scripts/benchmark_improvements.py |
GDS (GPU Direct Storage) and long-context DiskKVCache improvements |
scripts/benchmark_cpu_vs_cuda.py |
CPU vs CUDA inference with layer-streaming (same model and prompt) |
scripts/check_attention_and_benchmark.py --benchmark |
Throughput comparison: auto vs SDPA vs eager attention |
GDS and DiskKVCache:
# Local: make install pulls in kvikio (--extra gds)
make install
uv run python scripts/benchmark_improvements.py --mode gds
uv run python scripts/benchmark_improvements.py --mode long_context
# Docker (make bash): install with GDS first
pip install -e ".[gds]"
python scripts/benchmark_improvements.py --mode gdsQuick CPU vs CUDA comparison:
uv run python scripts/benchmark_cpu_vs_cuda.py
uv run python scripts/benchmark_cpu_vs_cuda.py --model Qwen/Qwen2.5-1.5B-Instruct --runs 3Attention implementation (auto vs SDPA vs eager):
uv run python scripts/check_attention_and_benchmark.py --benchmarkDetailed results and per-step breakdown for Qwen2.5-72B (e.g. pin_memory, async, 4-bit) are in docs/BENCHMARK_HISTORY.md.
Pass a HuggingFace token for repos that require access approval:
model = AutoModel.from_pretrained("Qwen/Qwen2.5-7B-Instruct", token="hf_YOUR_TOKEN")Or set the HF_TOKEN environment variable.
To keep model downloads local and out of git, set HF_HOME before running:
export HF_HOME="$(pwd)/models"The models/ directory is in .gitignore. RabbitLLM will store split layers alongside
the HuggingFace cache.
- docs/ARCHITECTURE.md — Design decisions: layer-streaming, KV cache, tied weights, attention implementations.
- docs/COMPATIBILITY.md — Transformers version, model matrix, Flash Attention, Qwen2 notes.
- docs/TROUBLESHOOTING.md — Common issues and how to debug them.
- CONTRIBUTING.md — How to set up the dev environment and add new models.
# Install with dev dependencies
pip install uv
uv sync --extra dev
# or: make install
# Run tests
make test
# Lint and format
make lint
make format
# Type check
make typecheckMetadataIncompleteBuffer on first run
The model splitting process is disk-intensive. Check available space — you need roughly the model size free in the split output directory.
ValueError: max() arg is an empty sequence
You are likely loading a Qwen or ChatGLM model with the wrong class. Use AutoModel:
from rabbitllm import AutoModel
model = AutoModel.from_pretrained("Qwen/Qwen-7B")ValueError: Asking to pad but the tokenizer does not have a padding token
Turn off padding:
input_tokens = model.tokenizer(text, padding=False, truncation=True, max_length=128, return_tensors="pt")MIT
