Skip to content

Tags: lavluda/llama.cpp

Tags

b8184

Toggle b8184's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
vulkan: improve partial offloading performance on AMD (ggml-org#19976)

* vulkan: fix and enable cpy_tensor_async function

* use transfer_queue for async transfers on AMD, synchronize with timeline semaphore

* update offload_op logic

* fix missing transfer submission

* disable async transfer queue on AMD GCN

* revert op batch size change

* fix cpy_tensor_async checks

b8183

Toggle b8183's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
cuda: cap grid.y at 65535 in non-contiguous dequantize/convert kernels (

ggml-org#19999)

b8182

Toggle b8182's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
vendors : update miniaudio library to 0.11.24 (ggml-org#19914)

b8181

Toggle b8181's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
vendor : update cpp-httplib to 0.35.0 (ggml-org#19969)

Signed-off-by: Adrien Gallouët <[email protected]>

b8180

Toggle b8180's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
tests : model metadata loading from huggingface (ggml-org#19796)

* Add model metadata loading from huggingface for use with other tests

* Add incremental chunking instead of full redownload, fix caching issue and add warning when it fails

* Add support for split models, load metadata from each individual split file, also avoid mmproj

* Code cleanup, revert incremental downloading

* Only compile when cpp-httplib has SSL support

* Fix formatting

gguf-v0.18.0

Toggle gguf-v0.18.0's commit message
Version 0.18.0 release

b8179

Toggle b8179's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
CUDA: add CDNA3 MFMA support for flash attention MMA kernel (ggml-org…

…#19806)

* CUDA: add CDNA3 MFMA support for flash attention MMA kernel

Add MI300X (gfx942) MFMA tensor core flash attention using
v_mfma_f32_16x16x16_f16 (FP16 in, FP32 accumulate).

- Add FATTN_WARP_SIZE=64 for CDNA wavefront64
- Add CDNA config for head sizes 64, 80, 96, 112, 128
- Add FP16 MFMA intrinsic path in mma.cuh
- Add manual V transpose load for MFMA register layout
- Route CDNA to MMA for prompt processing, VEC for token generation
- Fix Q loading and combine stride granularity for non-power-of-2 heads

Benchmarks (Qwen2.5-1.5B Q4_K_M, MI300X):
  pp512  +7%,  pp1024 +13%,  pp2048 +23%,  pp4096 +39%
  tg128  -10% (FA overhead, VEC used for both)

All 2480 flash attention tests pass.

Ref: ggml-org#17917

* address review: replace FATTN_WARP_SIZE with constexpr, improve dispatch

- Replace #define FATTN_WARP_SIZE with constexpr int warp_size =
  ggml_cuda_get_physical_warp_size() in each device function
- Use ne[1]*gqa_ratio threshold for MMA vs tile dispatch. Benchmarked
  crossover on MI300X @ d32768 with power-of-2 GQA models:
    hsk=64  (Llama 1B, gqa=4): MMA wins at eff >= 128 (+11%)
    hsk=128 (Llama 3B, gqa=4): MMA wins at eff >= 128 (+4%)
  Unified threshold: eff_nq >= 128 for all head sizes.
- Remove VEC fallback; small batches fall through to tile kernel

* Update ggml/src/ggml-cuda/fattn.cu

* use ggml_cuda_info().devices warp_size instead of hardcoded check

---------

Co-authored-by: Johannes Gäßler <[email protected]>

b8178

Toggle b8178's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
server: Add pragma once to server-context.h (ggml-org#19944)

b8177

Toggle b8177's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
server: Mirroring /v1/responses to /responses to match /v1/chat/compl…

…etions pattern (ggml-org#19873)

b8175

Toggle b8175's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
ggml-cpu: add repack for mxfp4 (ggml-org#19738)