Tags · gelim/llama.cpp

b8583

llama-model-loader: print warning when using overrides with mmap (ggm…

…l-org#20978)

* llama-model-loader: use pinned memory for tensor overrides

* change to warning

Mar 30, 2026
278521c
zip
tar.gz

b8581

server: wrap headers for mcp proxy (ggml-org#21072)

* server: wrap headers for mcp proxy

* Update tools/server/server-cors-proxy.h

Co-authored-by: Georgi Gerganov <[email protected]>

* fix build

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Aleksander Grygier <[email protected]>

Mar 30, 2026
abf9a62
zip
tar.gz

b8580

add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (ggml-org#21150)

Mar 29, 2026
7c20367
zip
tar.gz

b8579

Optimize MOE GEMV kernel for BS > 1. (ggml-org#20905)

* Optimize MOE GEMV kernel for BS > 1.

The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.

New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).

This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.

* Remove em-dashes

* Cherry-pick changes from @am17an PR ggml-org#20885 to enable small_k optimization only for cases where it benefits

Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8

* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype

---------

Co-authored-by: Aman Gupta <[email protected]>

Mar 29, 2026
ec16a07
zip
tar.gz

b8578

hexagon: dma optimizations (mostly fixing regressions) (ggml-org#21137)

* hex-fa: add simple dma cache for Mask

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.

* hex-dma: unset in-order desc bit which caused signficant perf regression

We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.

* hex-rope: update comment to clarify that we don't need in-order DMA completions

Mar 29, 2026
f5d1c41
zip
tar.gz

b8576

[SYCL] Enhance build script to use half cores to build, avoid OS hang (…

…ggml-org#21093)

* use half cores to build, avoid OS hang

* reduce the output text num to short test time

* avoid to return 0

Mar 29, 2026
afe65aa
zip
tar.gz

b8575

fix **/x glob matching (ggml-org#21129)

Mar 28, 2026
6509718
zip
tar.gz

b8574

common/parser: fix handling of tool definition with missing propertie…

…s key (ggml-org#21128)

Mar 28, 2026
98ae0a0
zip
tar.gz

b8573

common : add character class support to glob_match (ggml-org#21111)

* add character class support to glob_match

* remove pointless reference

Mar 28, 2026
3a14a54
zip
tar.gz

b8571

common/json-schema: fix: handle non-capturing groups (?:...) in JSON …

…schema pattern converter (ggml-org#21124)

The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV
when a JSON schema "pattern" field contains a non-capturing group (?:...).

Root cause: when the parser sees '(' followed by '?', it pushes a warning
but does not advance past '?:'. The recursive transform() call then
interprets '?' as a quantifier and calls seq.back() on an empty vector,
causing undefined behavior.

This commonly occurs when serving OpenAI-compatible tool calls from
clients that include complex regex patterns in their JSON schemas (e.g.,
date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$).

The fix:
- Skip '?:' after '(' to treat non-capturing groups as regular groups
- For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely,
  handling escaped characters to avoid miscounting parenthesis depth
- Adjust the ')' unbalanced-parentheses check using direct char
  comparisons instead of substr
- Add test cases for non-capturing groups (C++ only, as the JS/Python
  implementations do not yet support this syntax)

Mar 28, 2026
e397d38
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b8583

b8581

b8580

b8579

b8578

b8576

b8575

b8574

b8573

b8571

Tags: gelim/llama.cpp