Tags: tmc/llama.cpp
Tags
english : use `typos` to fix comments and logs (ggml-org#4354)
build : target Windows 8 for standard mingw-w64 (ggml-org#4405) * build : target Windows 8 for standard mingw-w64 * make : fix missing console.o deps This was causing a link error with `make all` on Windows.
llama : document logits_all deprecation (ggml-org#4418) llama_context_params.logits_all is a parameter for controlling llama_eval. This documents that logits_all should not be used with llama_decode and llama_batch.
grammar : revert the replacement of llama_token_to_piece with id_to_t… …oken (ggml-org#4396)
sync : ggml (new ops, tests, backend, etc.) (ggml-org#4359) * sync : ggml (part 1) * sync : ggml (part 2, CUDA) * sync : ggml (part 3, Metal) * ggml : build fixes ggml-ci * cuda : restore lost changes * cuda : restore lost changes (StableLM rope) * cmake : enable separable compilation for CUDA ggml-ci * ggml-cuda : remove device side dequantize * Revert "cmake : enable separable compilation for CUDA" This reverts commit 09e35d0. * cuda : remove assert for rope * tests : add test-backend-ops * ggml : fix bug in ggml_concat * ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()` * ci : try to fix macOS * ggml-backend : remove backend self-registration * ci : disable Metal for macOS cmake build ggml-ci * metal : fix "supports family" call * metal : fix assert * metal : print resource path ggml-ci --------- Co-authored-by: slaren <[email protected]>
llama : per-layer KV cache + quantum K cache (ggml-org#4309) * per-layer KV * remove unnecessary copies * less code duplication, offload k and v separately * llama : offload KV cache per-layer * llama : offload K shift tensors * llama : offload for rest of the model arches * llama : enable offload debug temporarily * llama : keep the KV related layers on the device * llama : remove mirrors, perform Device -> Host when partial offload * common : add command-line arg to disable KV cache offloading * llama : update session save/load * llama : support quantum K cache (ggml-org#4312) * llama : support quantum K cache (wip) * metal : add F32 -> Q8_0 copy kernel * cuda : add F32 -> Q8_0 copy kernel ggml-ci * cuda : use mmv kernel for quantum cache ops * llama : pass KV cache type through API * llama : fix build ggml-ci * metal : add F32 -> Q4_0 copy kernel * metal : add F32 -> Q4_1 copy kernel * cuda : wip * cuda : add F32 -> Q4_0 and F32 -> Q4_1 copy kernels * llama-bench : support type_k/type_v * metal : use mm kernel only for quantum KV cache * cuda : add comment * llama : remove memory_f16 and kv_f16 flags --------- Co-authored-by: slaren <[email protected]> * readme : add API change notice --------- Co-authored-by: slaren <[email protected]>
train : fix ggml-org#4227 (double free in examples/train-text-from-sc… …ratch/train-text-from-scratch.cpp) (ggml-org#4351) On commit b1108 (44c117f) xaedes added ggml_allocr * alloc = NULL; ... (many lines in between) if (alloc) { ggml_allocr_free(alloc); } Which is correct, but it's easy to lose context after many lines in between. On commit b1287 (0e76a89) xaedes made a big change. From here on, alloc is freed eagerly. alloc = ggml_allocr_new(...) ... (short lines of code) ggml_allocr_free(alloc) This happens a few times, but alloc is never set to NULL, and many lines below, we still have if (alloc) { ggml_allocr_free(alloc); } which causes a double-free.
server : recognize cache_prompt parameter in OAI API (ggml-org#4347)
PreviousNext