Skip to content

Tags: ochafik/llama.cpp

Tags

b8157

Toggle b8157's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
support permuted, remove check s0/s10 (ggml-org#19889)

Co-authored-by: Neo Zhang Jianyu <[email protected]>

b8022

Toggle b8022's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
hexagon: fix typo in vtcm_needs_release (ggml-org#19545)

b7587

Toggle b7587's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
docker : add CUDA 13.1 image build (ggml-org#18441)

* add updated cuda-new.Dockerfile for Ubuntu 24.04 compatibilty

* add cuda13 build

b7540

Toggle b7540's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
ggml-cuda: fix regex for arch list (ggml-org#18371)

* ggml-cuda: fix regex for arch list

* make regex exact

b7482

Toggle b7482's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
llama : Changing off_t to size_t for Windows (ggml-org#18204)

b7404

Toggle b7404's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
preset: handle negated arg, reverse the meaning if needed (ggml-org#1…

…8041)

b7274

Toggle b7274's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
server: strip content-length header on proxy (ggml-org#17734)

b6925

Toggle b6925's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
server : support unified cache across slots (ggml-org#16736)

* server : support unified context across slots

* cont : fix speculative decoding initialization

* context : fix n_ctx_per_seq computation

* server : purge slots one by one

* tests : add unified cache server tests

* llama : update per-seq context computation

* test-thread-safety : handle tiny training context of the input model

* server : fix server_tokens clear()

* server : use 4 slots + unified KV by default

* llama : add note about context size queries

* cont : update todos [no ci]

* context : do not cap the size of the context

* tests : adjust parameters to be CI friendlier

* context : add warning

b6710

Toggle b6710's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
ggml webgpu: profiling, CI updates, reworking of command submission (g…

…gml-org#16452)

* Add profiling

* More detailed profiling

* Rework command submission to avoid global locks

* Update wait handling

* try new method of waiting on futures

* Add serializing of command submission in some cases

* Add new pool for timestamp queries and clean up logging

* Serialize command submission in CI and leave a TODO note

* Update webgpu CI

* Add myself as WebGPU codeowner

* Deadlock avoidance

* Leave WebGPU/Vulkan CI serialized

* Fix divide by 0

* Fix logic in division by inflight_threads

* Update CODEOWNERS and remove serialize submit option

b6250

Toggle b6250's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
test-opt: allow slight inprecision (ggml-org#15503)