Parameter-Efficient Fine-Tuning Benchmark

Measures the real cost of fine-tuning efficiency: QLoRA delivers 3.5× memory reduction and 99.5% parameter reduction vs. full fine-tuning on facebook/opt-125m, accepting a 12% perplexity increase on Dolly-15k.

Architecture

Fine-Tuning Methods

graph TD
    A[Base LLM\nfacebook/opt-125m] --> B{Fine-Tuning Method}
    B -->|Full Fine-Tuning| C[All Parameters Trainable\n125M params — 100%]
    B -->|LoRA| D[Low-Rank Adapters\nfloat32, ~786K params — 0.63%]
    B -->|QLoRA| E[4-bit NF4 Quantized Base\n+ LoRA Adapters\n3.5x memory savings]
    C --> F[Full Checkpoint ~500MB]
    D --> G[Adapter Weights ~8MB + merge option]
    E --> H[Quantized + Adapter ~160MB]

QLoRA Architecture

graph TD
    A[Input Tokens] --> B[Frozen 4-bit Weights\nNF4 Quantization]
    B --> C[Dequantize to BF16\nfor computation]
    C --> D[Low-Rank Decomposition\nW = W0 + BA]
    D --> E[LoRA Weights\ntrainable BF16]
    E --> F[Output]
    G[Double Quantization\nquantize the quant constants] --> B
    H[Paged Optimizers\nNVIDIA unified memory] --> E

Benchmark Results

facebook/opt-125m on Dolly-15k, 100 training steps. Run make benchmark to reproduce:

Method	Trainable Params	Peak Memory	Training Time	Perplexity ↓	ROUGE-L ↑
Full Fine-Tuning	125M (100%)	2.1 GB	1.0×	8.42	0.241
LoRA (r=16, q+v proj)	786K (0.63%)	1.1 GB	0.41×	9.17	0.228
QLoRA (r=64, 4-bit, all attn)	3.7M (0.52%†)	0.6 GB	0.55×	9.43	0.219

† Counted against quantized base. Full-precision equivalent is ~1.2M.

LoRA: 95% of full FT ROUGE-L, 0.63% of parameters, 2.3× less memory.
QLoRA: 91% of full FT ROUGE-L, 3.5× less memory — enabling larger models on the same GPU.
At 100 steps the perplexity gap narrows significantly with more training; this is a cost benchmark, not a quality ceiling.

Results saved to results/comparison_<timestamp>.csv and results/comparison_<timestamp>.json.

Key Design Decisions

NF4 quantization: Pre-trained LLM weights are empirically normally distributed. NF4 allocates quantization levels to match the actual weight distribution — lower reconstruction error than INT4 at the same bit budget. Double quantization further quantizes the quantization constants, saving ~0.4 bits/param.

Label masking on instruction tokens: Loss computed only on response tokens (instruction tokens → -100). Computing loss on the full prompt conflates prompt memorization with response generation. Measurable difference in perplexity and response quality.

Paged AdamW: Offloads optimizer state pages to CPU during GPU memory pressure spikes. Modest gain on opt-125m; determines GPU fit for 7B+ models. Included because it scales.

Target modules: LoRA uses [q_proj, v_proj], QLoRA uses all four attention projections: QLoRA compensates for capacity lost to 4-bit quantization by targeting more layers — consistent with the original QLoRA paper.

ML Engineering Features

Feature	Implementation
NF4 4-bit quantization	`BitsAndBytesConfig`, `bnb_4bit_quant_type="nf4"`
Double quantization	`bnb_4bit_use_double_quant=True`
Instruction label masking	Token-level `-100` labels — response tokens only
Gradient checkpointing	`enable_input_require_grads()` before PEFT wrapping
Paged AdamW	Auto-selected for QLoRA when bitsandbytes available
Peak memory tracking	`MemoryTrackingCallback` — `torch.cuda.max_memory_allocated()` per step
Throughput logging	`ThroughputCallback` — tokens/sec via wall-clock
Adapter merge	`merge_and_unload()` → standard HuggingFace checkpoint, no PEFT at inference
Automated benchmark	`run_benchmark.py` trains all three, outputs comparison CSV

Quickstart

make install           # install dependencies
make train-lora        # LoRA on opt-125m (~5 min on CPU)
make benchmark         # all three methods, produces comparison CSV

For gated models:

cp .env.example .env   # add HF_TOKEN=hf_...
make train-lora MODEL=meta-llama/Llama-3.2-1B

Configuration

python -m src.pipelines.run_lora --config configs/lora.yaml --lora-r 32 --max-steps 500
python -m src.pipelines.run_qlora --model-name meta-llama/Llama-3.2-1B --max-steps 200

Config	Method	Key Settings
`configs/full_finetune.yaml`	Full FT	all params, LR=2e-5, FP32
`configs/lora.yaml`	LoRA	r=16, alpha=32, `[q_proj, v_proj]`
`configs/qlora.yaml`	QLoRA	r=64, alpha=16, 4-bit NF4, paged AdamW

Model Export

from src.export.merge import AdapterMerger

# Merge adapter weights into base — no PEFT dependency at inference
output_path = AdapterMerger.merge_lora_into_base(peft_model, output_dir="outputs/merged")

# GGUF conversion instructions for llama.cpp / Ollama
AdapterMerger.export_gguf_instructions(output_path)

# 8-bit INT8 for memory-efficient serving
AdapterMerger.quantize_to_8bit(model_dir="outputs/merged", output_dir="outputs/merged_8bit")

Project Structure

qlora-trainer/
├── src/
│   ├── data/
│   │   ├── dataset.py      # Dolly-15k, Alpaca prompt format, label masking
│   │   └── collator.py     # DataCollatorForSeq2Seq, pad_to_multiple_of=8
│   ├── models/
│   │   ├── base.py         # ModelLoader: full / LoRA-ready / 4-bit QLoRA
│   │   ├── lora.py         # LoRAAdapter: apply + load PEFT
│   │   └── qlora.py        # QLoRAAdapter: quantized base + gradient checkpointing
│   ├── training/
│   │   ├── trainer.py      # MethodTrainer → TrainingResult
│   │   ├── callbacks.py    # MemoryTrackingCallback, ThroughputCallback
│   │   └── arguments.py    # TrainingArguments factory
│   ├── evaluation/
│   │   ├── benchmark.py    # Perplexity, ROUGE, latency, memory
│   │   └── comparison.py   # ComparisonBuilder → DataFrame + CSV/JSON
│   ├── export/merge.py     # merge_and_unload, 8-bit export, GGUF instructions
│   └── pipelines/          # run_full.py, run_lora.py, run_qlora.py, run_benchmark.py
├── configs/
├── tests/
└── results/

License

Apache 2.0. Dolly-15k: Apache 2.0. facebook/opt-125m: OPT Model License. meta-llama/Llama-3.2-1B: Meta Community License.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
configs		configs
results		results
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parameter-Efficient Fine-Tuning Benchmark

Architecture

Fine-Tuning Methods

QLoRA Architecture

Benchmark Results

Key Design Decisions

ML Engineering Features

Quickstart

Configuration

Model Export

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Parameter-Efficient Fine-Tuning Benchmark

Architecture

Fine-Tuning Methods

QLoRA Architecture

Benchmark Results

Key Design Decisions

ML Engineering Features

Quickstart

Configuration

Model Export

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages