Tags · olliewalsh/llama.cpp

b8701

ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels (ggml-org#21168)

* ds_read_b128 for q4_0 and q4_1 mmq kernels

     Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both.

* Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation

* Explicit for loop in mmq, renamed vec into tmp

* Fixed max_cpy usage in the loading loop

* Fixed typo in q4_1 kernel

* Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler <[email protected]>

* Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler <[email protected]>

* Update ggml/src/ggml-cuda/mmq.cuh

Co-authored-by: Johannes Gäßler <[email protected]>

* Renoved trailing white line 500

* Update mmq.cuh removed other whitelines

* Remove trailing whitespaces

---------

Co-authored-by: iacopPBK <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>
Co-authored-by: iacopPBK <[email protected]>

Apr 7, 2026
66c4f9d
zip
tar.gz

b8218

Checkpoint every n tokens: squash (ggml-org#20087)

Mar 6, 2026
f5ddcd1
zip
tar.gz

b8030

CUDA: Do not mutate cgraph for fused ADDs (ggml-org#19566)

* Do not mutate cgraph for fused ADDs

1. We should try to minimize in-place changes to the incoming
   ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
   step as we store the properties before modifying the graph in-place
   in the cuda-backend

* Assert ggml_tensor is trivially copyable

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Aman Gupta <[email protected]>

---------

Co-authored-by: Aman Gupta <[email protected]>

Feb 13, 2026
43919b7
zip
tar.gz

b7661

scripts : add pr2wt.sh (ggml-org#18644)

* scripts : add pr2wt.sh

* script : shebang

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Jan 7, 2026
5642667
zip
tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b8701

b8218

b8030

b7661

Tags: olliewalsh/llama.cpp

b8701

b8218

b8030

b7661