Tags · danbev/llama.cpp

b7938

vendor : try to suppress boring ssl warnings

Feb 4, 2026
947ff17
zip
tar.gz

b7731

ggml-metal: do not copy headers for embedded, use current binary dir …

…for embedded (ggml-org#18705)

Jan 14, 2026
7d587e5
zip
tar.gz

b6946

sampling : add support for GPU sampling (wip)

This is a work in progress to add support for GPU sampling.

The motivation for this feature is to enable sampling to be performed
directly on the GPU as part of the computation graph being executed,
allowing for some or all of the sampling to be done on the GPU.

For example, the GPU sampler chain might select/sample a token directly
in which case only the sampled token needs to be transferred from
device memory to host memory.

It is also possible for the GPU samplers to perform filtering of the
logits, or compute and filter the probability distribution, in which
case only the filtered logits or probabilites need to be transferred
back to system memory for further processing by CPU samplers.

Currently the GPU sampling works in a similar manner to how pooling
works, it is a function that is called by build_graph:
```c++
    // add GPU sampling layers (if any)
    llm->build_sampling(*this, params);
```

GPU samplers can be configured by creating sampler chains, where each
sampler chain is associated with a specific sequence id:
```c++
    struct llama_sampler_chain_params params = llama_sampler_chain_default_params();
    struct llama_sampler * chain = llama_sampler_chain_init(params);
    llama_sampler_chain_add(chain, llama_sampler_gpu_init_greedy());
    std::vector<llama_sampler_seq_config> sampler_configs = {
        { 0, gpu_sampler_chain }
    };
```
The struct is defined as:
```c++
    struct llama_sampler_seq_config {
        llama_seq_id           seq_id;
        struct llama_sampler * sampler;
    };
```

These sampler configs are then passed as context params:
```c++
        llama_context_params cparams = llama_context_default_params();
        cparams.samplers = sampler_configs.data();
        cparams.n_samplers = sampler_configs.size();
```

When the graph is built, the configured sampler's _apply function is
called which allows them to add operations/nodes to the computation
graph.

This enables the sampling to happen fully, or partially on the GPU. The
samplers could sample a single token in which case that is what will be
transferred from the device memory to host memory after llama_decode has
been called. The sampled token can then be retrieved using:
```c++
    llama_token id = llama_get_sampled_token_ith(test_ctx.ctx, index);
```

Is it also possible to run a GPU sampler that only filters the logits
and then only the filtered logits are transferred back to the host and
the sampling can proceed on the CPU with the normal (CPU) sampler chain.
In this case the CPU samplers are configured as usual but they will now
operate on already filtered logits.

Similar to the above handling of logits, it is possible for a GPU
samplers to compute the full probability distribution and transfer that
to the host. And the CPU samplers can then operate on the those
probabilities.

Building and running the tests:

Download a model for testing:
```console
$ cd models && wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf
```
Building the test:
```console
$ cmake --build build --target test-gpu-sampling -j8
```
Runing all tests:
```console
$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R '^test-gpu-sampling$' -V
```

The following individual tests are available:
```console
$ ctest --test-dir build -N -R test-gpu-sampling-
  Test 35: test-gpu-sampling-greedy
  Test 36: test-gpu-sampling-temp
  Test 37: test-gpu-sampling-softmax
  Test 38: test-gpu-sampling-top_k
  Test 39: test-gpu-sampling-top_p
  Test 40: test-gpu-sampling-mul_seq

Total Tests: 6
```
These can be run individually, for example:
```console
$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R 'test-gpu-sampling-temp' -V
```

TODO:

- [ ] Allow GPU samplers to pre-allocate state tensors
- [ ] Integrate GPU samplers with llama-server
- [ ] Implement true top-p sampler on GPU
- [ ] Add missing GPU samplers (e.g. typical, mirostat, etc)

Nov 4, 2025
fb02ce4
zip
tar.gz

b6883

llama : use std::abs instead of abs (ggml-org#16853)

Oct 30, 2025
d739511
zip
tar.gz

b6708

ci : add option to disable CPU repack in test workflow

Oct 7, 2025
c88ec35
zip
tar.gz

b6688

ci : add job for testing AMX

Oct 6, 2025
7ce7eb4
zip
tar.gz

b6686

ggml : check src[1] does not have more than 2 dimensions

Oct 3, 2025
219f17a
zip
tar.gz

b6674

ci : change macos-13 to macos-15-intel

This commit updates the macos-13 runners to macos-15-intel.

The motivation for this changes is the macos-13 runners are scheduled
to be retired on 2025-12-04.

Refs: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/

Oct 3, 2025
8fa4aa3
zip
tar.gz
Downloads

b6672

switch to use ubuntu-22.04-arm [no ci]

Oct 2, 2025
d35dc11
zip
tar.gz

b6440

ci : add caching for ROCm installation in release workflow

This commit applies the same caching to the release workflow which
currently exists for the main CI workflow that was introduced in Commit
ff02caf ("ci : cache ROCm installation
in windows-latest-cmake-hip (ggml-org#15887)").

Sep 10, 2025
fd60eb2
zip
tar.gz
Downloads

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b7938

b7731

b6946

b6883

b6708

b6688

b6686

b6674

b6672

b6440

Tags: danbev/llama.cpp