Notes on LLM Memory requirements

1) Model Weights:

If model weights loaded in 16-bit precision (FP16/BF16),
Formula: approx (Number of billion parameters * sizeof(fp16)) GB
Eg: Llama 3.2 1B: 1 * 2 bytes = 2 GB memory
So model weights need 2GB memory for the Llama 3.2 1B model

2) K/V Cache:

Size of K/V Cache per token in bytes = 2 * (num_layers) * (num_heads * dim_head) * precision_in_bytes

we know: sizeof(fp16) = 2 bytes

In my project, for Llama 3.2 1B (from model args in model.py):
16 decoder layers
num_kv_heads = 8
head_dim = d_model / n_heads = 2048 / 32 = 64
For Llama 3.2 1B, as GQA, it is '(num_kv_heads * head_dim)', and not '(num_heads * head_dim)'

Now, plugging everything in, this makes:
Size of K/V Cache per token in bytes = 2 * 16 * (8 * 64) * 2 bytes
= 32,768 B
= 33 KB
= 0.033 MB

So? 1 token tensor in the K/V Cache needs 0.033 MB

Total size of KV cache in bytes = (batch_size) * (sequence_length) * 2 * (num_layers) * (hidden_size) * precision_in_bytes

Eg: in my project, for Llama 3.2 1B, the method for K/V Cache storage is: K/V Cache pre-allocation
precision_in_bytes for FP16 = 2 bytes
from model args in the project code:

max_batch_size = 4
max_seq_len = 256
num_kv_heads * head_dim = 8 * 64 = 512
num_layers = 16

All of these tensor dimensions were pre-allocated for the K/V Cache tensor

Now, plugging all these values in,
Total size of K/V Cache in bytes = 4 * 256 * 2 * 16 * 512 * 2 bytes
= 33,554,432 bytes
= 33.5 MB
= 0.0335 GB
So? the total size of the K/V Cache tensor = 0.0335 GB for the Llama 3.2 1B model

Finally,
Total Memory = Model Weights + Total size of KV Cache = 2 GB + 0.0335 GB = 2.0335 GB
Llama 3.2 1B (with all assumptions according to my project) consumes: 2.0335 GB

Reasoning for GPU selection:

Whether it fits:
- P100 has 16 GB memory
- Looking at Total Memory (about 2 GB), the model fits on P100
P100 is a relatively smaller GPU, maybe more available compared to larger GPUs, like A40 (48 GB) or A100 (80 GB), as USC Research Labs must be using the larger ones

Sources I used in calculations:

USC EE508 Notes
Lei Gao, Kevin Yang, Chaoyi Jiang, USC EE508 Final Project: Efficient Processing of LLMs
Sebastian Raschka, Understanding and Coding the KV Cache in LLMs from Scratch: https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms
Nvidia Developer Technical Blog, Mastering LLM Techniques: Inference Optimization: https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
llama		llama
.gitignore		.gitignore
README.md		README.md
alpaca_data_200.json		alpaca_data_200.json
benchmark_inference.py		benchmark_inference.py
finetuning.py		finetuning.py
inference.py		inference.py
model_args.json		model_args.json
phase_1.pdf		phase_1.pdf
phase_2.pdf		phase_2.pdf
phase_3.pdf		phase_3.pdf
post-finetuning_inference.py		post-finetuning_inference.py
requirements.txt		requirements.txt
setup.sh		setup.sh
system_perf.pdf		system_perf.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Notes on LLM Memory requirements

1) Model Weights:

2) K/V Cache:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Notes on LLM Memory requirements

1) Model Weights:

2) K/V Cache:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages