Skip to content

Tags: gelim/llama.cpp

Tags

b8583

Toggle b8583's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
llama-model-loader: print warning when using overrides with mmap (ggm…

…l-org#20978)

* llama-model-loader: use pinned memory for tensor overrides

* change to warning

b8581

Toggle b8581's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
server: wrap headers for mcp proxy (ggml-org#21072)

* server: wrap headers for mcp proxy

* Update tools/server/server-cors-proxy.h

Co-authored-by: Georgi Gerganov <[email protected]>

* fix build

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Aleksander Grygier <[email protected]>

b8580

Toggle b8580's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (ggml-org#21150)

b8579

Toggle b8579's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Optimize MOE GEMV kernel for BS > 1. (ggml-org#20905)

* Optimize MOE GEMV kernel for BS > 1.

The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.

New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).

This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.

* Remove em-dashes

* Cherry-pick changes from @am17an PR ggml-org#20885 to enable small_k optimization only for cases where it benefits

Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8

* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype

---------

Co-authored-by: Aman Gupta <[email protected]>

b8578

Toggle b8578's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
hexagon: dma optimizations (mostly fixing regressions) (ggml-org#21137)

* hex-fa: add simple dma cache for Mask

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.

* hex-dma: unset in-order desc bit which caused signficant perf regression

We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.

* hex-rope: update comment to clarify that we don't need in-order DMA completions

b8576

Toggle b8576's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
[SYCL] Enhance build script to use half cores to build, avoid OS hang (

…ggml-org#21093)

* use half cores to build, avoid OS hang

* reduce the output text num to short test time

* avoid to return 0

b8575

Toggle b8575's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix **/x glob matching (ggml-org#21129)

b8574

Toggle b8574's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
common/parser: fix handling of tool definition with missing propertie…

…s key (ggml-org#21128)

b8573

Toggle b8573's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
common : add character class support to glob_match (ggml-org#21111)

* add character class support to glob_match

* remove pointless reference

b8571

Toggle b8571's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
common/json-schema: fix: handle non-capturing groups (?:...) in JSON …

…schema pattern converter (ggml-org#21124)

The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV
when a JSON schema "pattern" field contains a non-capturing group (?:...).

Root cause: when the parser sees '(' followed by '?', it pushes a warning
but does not advance past '?:'. The recursive transform() call then
interprets '?' as a quantifier and calls seq.back() on an empty vector,
causing undefined behavior.

This commonly occurs when serving OpenAI-compatible tool calls from
clients that include complex regex patterns in their JSON schemas (e.g.,
date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$).

The fix:
- Skip '?:' after '(' to treat non-capturing groups as regular groups
- For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely,
  handling escaped characters to avoid miscounting parenthesis depth
- Adjust the ')' unbalanced-parentheses check using direct char
  comparisons instead of substr
- Add test cases for non-capturing groups (C++ only, as the JS/Python
  implementations do not yet support this syntax)