Tags: olliewalsh/llama.cpp
Tags
ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels (ggml-org#21168) * ds_read_b128 for q4_0 and q4_1 mmq kernels Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both. * Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation * Explicit for loop in mmq, renamed vec into tmp * Fixed max_cpy usage in the loading loop * Fixed typo in q4_1 kernel * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <[email protected]> * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <[email protected]> * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <[email protected]> * Renoved trailing white line 500 * Update mmq.cuh removed other whitelines * Remove trailing whitespaces --------- Co-authored-by: iacopPBK <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]> Co-authored-by: iacopPBK <[email protected]>
CUDA: Do not mutate cgraph for fused ADDs (ggml-org#19566) * Do not mutate cgraph for fused ADDs 1. We should try to minimize in-place changes to the incoming ggml_cgraph where possible (those should happen in graph_optimize) 2. Modifying in-place leads to an additional, unnecessary graph capture step as we store the properties before modifying the graph in-place in the cuda-backend * Assert ggml_tensor is trivially copyable * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Aman Gupta <[email protected]> --------- Co-authored-by: Aman Gupta <[email protected]>
scripts : add pr2wt.sh (ggml-org#18644) * scripts : add pr2wt.sh * script : shebang Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>