Skip to content

Tags: struct/llama.cpp

Tags

b8453

Toggle b8453's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
vulkan: change gated_delta_net to shard a column across a subgroup (g…

…gml-org#20662)

* vulkan: change gated_delta_net to shard a column across a subgroup

This is based on ggml-org#20391, I used an
LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to
work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of
subgroup to invocation id, using subgroupAdd optionally, etc.).

This fixes a perf regression from the transposing of the values in memory
(!20443).

* vulkan: Spread columns across fewer lanes to reduce the number of workgroups

b8344

Toggle b8344's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
add op gated_delta_net (ggml-org#20455)

b8182

Toggle b8182's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
vendors : update miniaudio library to 0.11.24 (ggml-org#19914)

b8121

Toggle b8121's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Improve CUDA graph capture (ggml-org#19754)

* Improve CUDA graph capture

Currently, CUDA graphs are eagerly enabled on the first call to ggml_backend_cuda_graph_compute. If the graph properties keep changing (4+ consecutive updates), the graph is permanently disabled. This is suboptimal because:

- The first call always incurs CUDA graph capture overhead even if the graph is unstable
- Once permanently disabled, CUDA graphs never re-enable even after the graph stabilizes (e.g., switching from prompt processing to decode)

The new approach delays CUDA graph activation until warmup completes: the same cgraph must be called at least twice with matching properties before CUDA graph capture begins. This avoids wasted capture overhead on volatile graphs and allows graphs to become eligible once they stabilize.
This also fixes issues such as ggml-org#19708

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Johannes Gäßler <[email protected]>

* Remove EM dashes

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Aman Gupta <[email protected]>

---------

Co-authored-by: Johannes Gäßler <[email protected]>
Co-authored-by: Aman Gupta <[email protected]>

b8075

Toggle b8075's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
common : inline functions (ggml-org#18639)

b8055

Toggle b8055's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
convert : ensure all models handle new experts count (ggml-org#19621)

* ensure all models handle new experts count

* revert removal for PhiMoeModel, does not inherit from base

b7898

Toggle b7898's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
ggml-hexagon: flash-attention and reduce-sum optimizations (ggml-org#…

…19141)

* wip

* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation

* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations

* wip

* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance

* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability

* optimize vector dot product functions to use unified reduction for improved performance

* wip

* ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation

* ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations

* wip

* ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance

* ggml-hexagon: refactor dot product functions to use a common loading function for improved readability

* optimize vector dot product functions to use unified reduction for improved performance

* hexagon: optimize reduce-sum for v75+

* hexagon: always keep row_sums in sf/fp32

* ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT

* fix compiling error after rebase

---------

Co-authored-by: Max Krasnyansky <[email protected]>

b7695

Toggle b7695's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
scripts : follow api redirects in pr2wt.sh (ggml-org#18739)

b7677

Toggle b7677's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
vulkan: fix push constant size for quantize_q8_1 (ggml-org#18687)

I added an assert to catch further mismatches, and it found several.
Fix those, too.

b7597

Toggle b7597's commit message
sync : ggml