Tags: struct/llama.cpp
Tags
vulkan: change gated_delta_net to shard a column across a subgroup (g… …gml-org#20662) * vulkan: change gated_delta_net to shard a column across a subgroup This is based on ggml-org#20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.). This fixes a perf regression from the transposing of the values in memory (!20443). * vulkan: Spread columns across fewer lanes to reduce the number of workgroups
vendors : update miniaudio library to 0.11.24 (ggml-org#19914)
Improve CUDA graph capture (ggml-org#19754) * Improve CUDA graph capture Currently, CUDA graphs are eagerly enabled on the first call to ggml_backend_cuda_graph_compute. If the graph properties keep changing (4+ consecutive updates), the graph is permanently disabled. This is suboptimal because: - The first call always incurs CUDA graph capture overhead even if the graph is unstable - Once permanently disabled, CUDA graphs never re-enable even after the graph stabilizes (e.g., switching from prompt processing to decode) The new approach delays CUDA graph activation until warmup completes: the same cgraph must be called at least twice with matching properties before CUDA graph capture begins. This avoids wasted capture overhead on volatile graphs and allows graphs to become eligible once they stabilize. This also fixes issues such as ggml-org#19708 * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <[email protected]> * Remove EM dashes * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Aman Gupta <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]> Co-authored-by: Aman Gupta <[email protected]>
convert : ensure all models handle new experts count (ggml-org#19621) * ensure all models handle new experts count * revert removal for PhiMoeModel, does not inherit from base
ggml-hexagon: flash-attention and reduce-sum optimizations (ggml-org#… …19141) * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * hexagon: optimize reduce-sum for v75+ * hexagon: always keep row_sums in sf/fp32 * ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT * fix compiling error after rebase --------- Co-authored-by: Max Krasnyansky <[email protected]>
vulkan: fix push constant size for quantize_q8_1 (ggml-org#18687) I added an assert to catch further mismatches, and it found several. Fix those, too.
PreviousNext