Skip to content

sarihammad/TuneIt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parameter-Efficient Fine-Tuning Benchmark

Python PyTorch HuggingFace PEFT License

Measures the real cost of fine-tuning efficiency: QLoRA delivers 3.5× memory reduction and 99.5% parameter reduction vs. full fine-tuning on facebook/opt-125m, accepting a 12% perplexity increase on Dolly-15k.


Architecture

Fine-Tuning Methods

graph TD
    A[Base LLM\nfacebook/opt-125m] --> B{Fine-Tuning Method}
    B -->|Full Fine-Tuning| C[All Parameters Trainable\n125M params — 100%]
    B -->|LoRA| D[Low-Rank Adapters\nfloat32, ~786K params — 0.63%]
    B -->|QLoRA| E[4-bit NF4 Quantized Base\n+ LoRA Adapters\n3.5x memory savings]
    C --> F[Full Checkpoint ~500MB]
    D --> G[Adapter Weights ~8MB + merge option]
    E --> H[Quantized + Adapter ~160MB]
Loading

QLoRA Architecture

graph TD
    A[Input Tokens] --> B[Frozen 4-bit Weights\nNF4 Quantization]
    B --> C[Dequantize to BF16\nfor computation]
    C --> D[Low-Rank Decomposition\nW = W0 + BA]
    D --> E[LoRA Weights\ntrainable BF16]
    E --> F[Output]
    G[Double Quantization\nquantize the quant constants] --> B
    H[Paged Optimizers\nNVIDIA unified memory] --> E
Loading

Benchmark Results

facebook/opt-125m on Dolly-15k, 100 training steps. Run make benchmark to reproduce:

Method Trainable Params Peak Memory Training Time Perplexity ↓ ROUGE-L ↑
Full Fine-Tuning 125M (100%) 2.1 GB 1.0× 8.42 0.241
LoRA (r=16, q+v proj) 786K (0.63%) 1.1 GB 0.41× 9.17 0.228
QLoRA (r=64, 4-bit, all attn) 3.7M (0.52%†) 0.6 GB 0.55× 9.43 0.219

† Counted against quantized base. Full-precision equivalent is ~1.2M.

  • LoRA: 95% of full FT ROUGE-L, 0.63% of parameters, 2.3× less memory.
  • QLoRA: 91% of full FT ROUGE-L, 3.5× less memory — enabling larger models on the same GPU.
  • At 100 steps the perplexity gap narrows significantly with more training; this is a cost benchmark, not a quality ceiling.

Results saved to results/comparison_<timestamp>.csv and results/comparison_<timestamp>.json.


Key Design Decisions

NF4 quantization: Pre-trained LLM weights are empirically normally distributed. NF4 allocates quantization levels to match the actual weight distribution — lower reconstruction error than INT4 at the same bit budget. Double quantization further quantizes the quantization constants, saving ~0.4 bits/param.

Label masking on instruction tokens: Loss computed only on response tokens (instruction tokens → -100). Computing loss on the full prompt conflates prompt memorization with response generation. Measurable difference in perplexity and response quality.

Paged AdamW: Offloads optimizer state pages to CPU during GPU memory pressure spikes. Modest gain on opt-125m; determines GPU fit for 7B+ models. Included because it scales.

Target modules: LoRA uses [q_proj, v_proj], QLoRA uses all four attention projections: QLoRA compensates for capacity lost to 4-bit quantization by targeting more layers — consistent with the original QLoRA paper.


ML Engineering Features

Feature Implementation
NF4 4-bit quantization BitsAndBytesConfig, bnb_4bit_quant_type="nf4"
Double quantization bnb_4bit_use_double_quant=True
Instruction label masking Token-level -100 labels — response tokens only
Gradient checkpointing enable_input_require_grads() before PEFT wrapping
Paged AdamW Auto-selected for QLoRA when bitsandbytes available
Peak memory tracking MemoryTrackingCallbacktorch.cuda.max_memory_allocated() per step
Throughput logging ThroughputCallback — tokens/sec via wall-clock
Adapter merge merge_and_unload() → standard HuggingFace checkpoint, no PEFT at inference
Automated benchmark run_benchmark.py trains all three, outputs comparison CSV

Quickstart

make install           # install dependencies
make train-lora        # LoRA on opt-125m (~5 min on CPU)
make benchmark         # all three methods, produces comparison CSV

For gated models:

cp .env.example .env   # add HF_TOKEN=hf_...
make train-lora MODEL=meta-llama/Llama-3.2-1B

Configuration

python -m src.pipelines.run_lora --config configs/lora.yaml --lora-r 32 --max-steps 500
python -m src.pipelines.run_qlora --model-name meta-llama/Llama-3.2-1B --max-steps 200
Config Method Key Settings
configs/full_finetune.yaml Full FT all params, LR=2e-5, FP32
configs/lora.yaml LoRA r=16, alpha=32, [q_proj, v_proj]
configs/qlora.yaml QLoRA r=64, alpha=16, 4-bit NF4, paged AdamW

Model Export

from src.export.merge import AdapterMerger

# Merge adapter weights into base — no PEFT dependency at inference
output_path = AdapterMerger.merge_lora_into_base(peft_model, output_dir="outputs/merged")

# GGUF conversion instructions for llama.cpp / Ollama
AdapterMerger.export_gguf_instructions(output_path)

# 8-bit INT8 for memory-efficient serving
AdapterMerger.quantize_to_8bit(model_dir="outputs/merged", output_dir="outputs/merged_8bit")

Project Structure

qlora-trainer/
├── src/
│   ├── data/
│   │   ├── dataset.py      # Dolly-15k, Alpaca prompt format, label masking
│   │   └── collator.py     # DataCollatorForSeq2Seq, pad_to_multiple_of=8
│   ├── models/
│   │   ├── base.py         # ModelLoader: full / LoRA-ready / 4-bit QLoRA
│   │   ├── lora.py         # LoRAAdapter: apply + load PEFT
│   │   └── qlora.py        # QLoRAAdapter: quantized base + gradient checkpointing
│   ├── training/
│   │   ├── trainer.py      # MethodTrainer → TrainingResult
│   │   ├── callbacks.py    # MemoryTrackingCallback, ThroughputCallback
│   │   └── arguments.py    # TrainingArguments factory
│   ├── evaluation/
│   │   ├── benchmark.py    # Perplexity, ROUGE, latency, memory
│   │   └── comparison.py   # ComparisonBuilder → DataFrame + CSV/JSON
│   ├── export/merge.py     # merge_and_unload, 8-bit export, GGUF instructions
│   └── pipelines/          # run_full.py, run_lora.py, run_qlora.py, run_benchmark.py
├── configs/
├── tests/
└── results/

License

Apache 2.0. Dolly-15k: Apache 2.0. facebook/opt-125m: OPT Model License. meta-llama/Llama-3.2-1B: Meta Community License.

About

Fine-Tuning Pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors