Tags · qades/llama.cpp

yuan3_0-b8690-a68fa82

Add missing GEMMA4V projector type support in mtmd/clip.cpp

Apr 6, 2026
a68fa82
zip
tar.gz

yuan3_0-b8551-9f40a3b

Merge branch 'master' into yuan3_0

Mar 26, 2026
9f40a3b
zip
tar.gz

yuan3_0-b8504-5760080

Merge branch 'master' into yuan3_0

Mar 23, 2026
5760080
zip
tar.gz

b8472

server: fix Host header (ggml-org#20843)

It should include port when it's not default.

Mar 22, 2026
81bc4d3
zip
tar.gz

b8390

[SYCL] ehance UPSCALE to support all UT cases (ggml-org#20637)

* [SYCL] ehance UPSCALE to support more cases

* rm test case result of SYCL1

Mar 17, 2026
b6c83aa
zip
tar.gz

b8351

metal : add FA specialization for HSK = 320, HSV = 256 (ggml-org#20549)

Mar 14, 2026
b30a5fd
zip
tar.gz

b8350

ci : move self-hosted workflows to separate files (ggml-org#20540)

Mar 14, 2026
b476895
zip
tar.gz

b8348

ci : try to optimize some jobs (ggml-org#20521)

* force arm version to test

* run on either x86 or arm if we can help it, this only works for runs without ccache

* readd other jobs

* remove ccache

Mar 14, 2026
3a6f059
zip
tar.gz

b8347

hexagon: Q4_0 and MXFP4 repack fixes (ggml-org#20527)

* hexagon: fix tail corruption with rows sizes not multiple of 256

* hexagon: use different stride for repacking partial blocks

* hex-mm: update repack and kernels to avoid shuffles for full 256-element blocks

Previous commit changed the repacking to use even:odd (0:1,2:3,..) packing
instead of the original (0:128,1:129,...) packing in order to fix tail corruption.
Since the mm kernels already deal with partial tails we can use even:odd
packing only for the last block.
This avoid performance penalty of having to shuffle to zip the elements
in the common case.

* hex-mm: update rmpy x8 for better optimizations

* hex-mm: tighten supported MUL_MAT checks to avoid spurios failures

* hex-mm: use vzero to init accumulators

* hex-mm: properly call partial rmpy_x8

Mar 14, 2026
609ea50
zip
tar.gz

b8340

ggml : add native AVX512-FP16 support for F16 operations (ggml-org#20529

)

The overall benchmark speed remains almost the same because the CPU is
now calculating faster than the RAM can deliver the data. (See perf stat
results below showing 2.7 billion fewer instructions).

Also note that this path will be only enabled for native build or with
custom flags.

now:
```
 Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128':

        189,073.52 msec task-clock                       #   14.658 CPUs utilized
               404      context-switches                 #    2.137 /sec
                19      cpu-migrations                   #    0.100 /sec
           372,390      page-faults                      #    1.970 K/sec
   310,877,195,595      instructions                     #    0.54  insn per cycle
   581,071,530,602      cycles                           #    3.073 GHz
    19,352,107,994      branches                         #  102.352 M/sec
        48,304,438      branch-misses                    #    0.25% of all branches
    84,998,431,152      L1-dcache-loads                  #  449.552 M/sec
    12,186,410,279      L1-dcache-load-misses            #   14.34% of all L1-dcache accesses

      12.899358742 seconds time elapsed

     187.823044000 seconds user
       1.253416000 seconds sys
```

before:
```
 Performance counter stats for 'build/bin/llama-bench -m Qwen3-0.6B-f16.gguf -p 512 -n 128':

        190,594.56 msec task-clock                       #   14.652 CPUs utilized
               436      context-switches                 #    2.288 /sec
                22      cpu-migrations                   #    0.115 /sec
           372,782      page-faults                      #    1.956 K/sec
   313,574,921,966      instructions                     #    0.54  insn per cycle
   586,064,970,425      cycles                           #    3.075 GHz
    19,585,778,563      branches                         #  102.761 M/sec
        48,437,488      branch-misses                    #    0.25% of all branches
    86,219,336,628      L1-dcache-loads                  #  452.370 M/sec
    12,232,085,771      L1-dcache-load-misses            #   14.19% of all L1-dcache accesses

      13.007923164 seconds time elapsed

     189.395316000 seconds user
       1.202612000 seconds sys
```

Signed-off-by: Adrien Gallouët <[email protected]>

Mar 14, 2026
d0b79aa
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

yuan3_0-b8690-a68fa82

yuan3_0-b8551-9f40a3b

yuan3_0-b8504-5760080

b8472

b8390

b8351

b8350

b8348

b8347

b8340

Tags: qades/llama.cpp