Tags: qades/llama.cpp
Tags
Add missing GEMMA4V projector type support in mtmd/clip.cpp
server: fix Host header (ggml-org#20843) It should include port when it's not default.
[SYCL] ehance UPSCALE to support all UT cases (ggml-org#20637) * [SYCL] ehance UPSCALE to support more cases * rm test case result of SYCL1
metal : add FA specialization for HSK = 320, HSV = 256 (ggml-org#20549)
ci : move self-hosted workflows to separate files (ggml-org#20540)
ci : try to optimize some jobs (ggml-org#20521) * force arm version to test * run on either x86 or arm if we can help it, this only works for runs without ccache * readd other jobs * remove ccache
hexagon: Q4_0 and MXFP4 repack fixes (ggml-org#20527) * hexagon: fix tail corruption with rows sizes not multiple of 256 * hexagon: use different stride for repacking partial blocks * hex-mm: update repack and kernels to avoid shuffles for full 256-element blocks Previous commit changed the repacking to use even:odd (0:1,2:3,..) packing instead of the original (0:128,1:129,...) packing in order to fix tail corruption. Since the mm kernels already deal with partial tails we can use even:odd packing only for the last block. This avoid performance penalty of having to shuffle to zip the elements in the common case. * hex-mm: update rmpy x8 for better optimizations * hex-mm: tighten supported MUL_MAT checks to avoid spurios failures * hex-mm: use vzero to init accumulators * hex-mm: properly call partial rmpy_x8
ggml : add native AVX512-FP16 support for F16 operations (ggml-org#20529 ) The overall benchmark speed remains almost the same because the CPU is now calculating faster than the RAM can deliver the data. (See perf stat results below showing 2.7 billion fewer instructions). Also note that this path will be only enabled for native build or with custom flags. now: ``` Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128': 189,073.52 msec task-clock # 14.658 CPUs utilized 404 context-switches # 2.137 /sec 19 cpu-migrations # 0.100 /sec 372,390 page-faults # 1.970 K/sec 310,877,195,595 instructions # 0.54 insn per cycle 581,071,530,602 cycles # 3.073 GHz 19,352,107,994 branches # 102.352 M/sec 48,304,438 branch-misses # 0.25% of all branches 84,998,431,152 L1-dcache-loads # 449.552 M/sec 12,186,410,279 L1-dcache-load-misses # 14.34% of all L1-dcache accesses 12.899358742 seconds time elapsed 187.823044000 seconds user 1.253416000 seconds sys ``` before: ``` Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128': 190,594.56 msec task-clock # 14.652 CPUs utilized 436 context-switches # 2.288 /sec 22 cpu-migrations # 0.115 /sec 372,782 page-faults # 1.956 K/sec 313,574,921,966 instructions # 0.54 insn per cycle 586,064,970,425 cycles # 3.075 GHz 19,585,778,563 branches # 102.761 M/sec 48,437,488 branch-misses # 0.25% of all branches 86,219,336,628 L1-dcache-loads # 452.370 M/sec 12,232,085,771 L1-dcache-load-misses # 14.19% of all L1-dcache accesses 13.007923164 seconds time elapsed 189.395316000 seconds user 1.202612000 seconds sys ``` Signed-off-by: Adrien Gallouët <[email protected]>
PreviousNext