Releases · FellowTraveler/llama.cpp

17 Sep 21:19

d304f45

b6502 Latest

Latest

GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators (#16018)

* Add paramater buffer pool, batching of submissions, refactor command building/submission

* Add header for linux builds

* Free staged parameter buffers at once

* Format with clang-format

* Fix thread-safe implementation

* Use device implicit synchronization

* Update workflow to use custom release

* Remove testing branch workflow

* some f32 tests passing

* Disable set_rows until it's implemented

* f32 add all tests passing

* Begin work on set_rows

* Work on set rows

* Add error buffers for reporting unsupported SET_ROWS indices

* Remove extra comments

* Add templated addition, clean up code

* Get addition and multiplication working

* Implement rms_norm

* Add get_rows implementation

* Add new get_rows files

* Refactor use of wg size entry

* Fix compilation

* Try manually unrolled q4_0 quant

* Revert "Try manually unrolled q4_0 quant"

This reverts commit 77f8b96515f7e640ae4b0e44f066321fbc4a6166.

* Move to constant max wg size

* Check for tensor size in supports_op

* Vectorize f32 and change default workgroup size

* Move f32 get_rows from < 4 to % 4 != 0

* fix linter errors

* Add in-place tests

---------

Co-authored-by: Neha Abbas <[email protected]>

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-09-17T21:19:21Z
llama-b6502-bin-macos-arm64.zip

sha256:e07e76a6739873acae73d954f6c2ef7d4a10dfb09712d867abc1582d39944ae8

10.2 MB 2025-09-17T21:19:33Z
llama-b6502-bin-macos-x64.zip

sha256:3e77eeb44a89058c93752767eddb40dd360711524ac262c58ed5c63e4fc27ae3

28.3 MB 2025-09-17T21:19:34Z
llama-b6502-bin-ubuntu-vulkan-x64.zip

sha256:794565eac228f355ea106f197c86335962ef1513cba884fca6bd6b9131f3537f

25.1 MB 2025-09-17T21:19:36Z
llama-b6502-bin-ubuntu-x64.zip

sha256:9114d28e999339f9f0c07976544528d1bc79fd2b7d95e99c1a21641e7306523d

12.2 MB 2025-09-17T21:19:38Z
llama-b6502-bin-win-cpu-arm64.zip

sha256:50392f5c69b787cd1104c632a325b70b249901fdd0e15888a2ed522e16cc6e6a

10.4 MB 2025-09-17T21:19:39Z
llama-b6502-bin-win-cpu-x64.zip

sha256:e3d95c5e433d6930dab3bac48000c2fc9fd7caabbc279b28722d4e16624da8f9

13.3 MB 2025-09-17T21:19:41Z
llama-b6502-bin-win-cuda-12.4-x64.zip

sha256:3e9c2081a145ca5801f271fc30f64c1f1f29d5fc063df2e90fec4e88ee9716eb

146 MB 2025-09-17T21:19:42Z
llama-b6502-bin-win-hip-radeon-x64.zip

sha256:724849639db721572721c0ca6245dd0fd5559feecbb3c6da6b014b01489ce54a

318 MB 2025-09-17T21:19:49Z
llama-b6502-bin-win-opencl-adreno-arm64.zip

sha256:2fb47b25dcad15207c9d468051701231594c33efbbd3837027aaa7be0d21dbfc

10.8 MB 2025-09-17T21:20:02Z
Source code (zip)

2025-09-17T20:09:40Z
Source code (tar.gz)

2025-09-17T20:09:40Z

19 Jan 02:39

github-actions

b4508

a1649cc

b4508

Adding linenoise.cpp to llama-run (#11252)

This is a fork of linenoise that is C++17 compatible. I intend on
adding it to llama-run so we can do things like traverse prompt
history via the up and down arrows:

https://github.com/ericcurtin/linenoise.cpp

Signed-off-by: Eric Curtin <[email protected]>

Assets 23

03 Jan 10:39

github-actions

b4409

e7da954

b4409

metal : avoid uint (#11019)

Assets 23

28 Sep 07:03

github-actions

b3830

b5de3b7

b3830

readme : update hot topics

Assets 22

24 Sep 07:24

github-actions

b3816

0aa1501

b3816

server : add newline after chat example (#9616)

Assets 22

23 Sep 08:02

github-actions

b3805

e62e978

b3805

Revert "[SYCL] fallback mmvq (#9088)" (#9579)

This reverts commit 50addec9a532a6518146ab837a85504850627316.

Assets 22

13 Sep 17:54

github-actions

b3751

feff4aa

b3751

server : add loading html page while model is loading (#9468)

* Adding loading page for '/' server requests

* set content when model is loading

* removed loading html file

* updated cmakelist

* updated makefile

* cleaned up whitespace

* cleanup for PR removed error

* updated server test to handle 503 HTML

* updated server test to handle 503 HTML

* ca†ch 503 before parsing json

* revert test

* account for both api and web browser requests

* precommit corrections

* eol fix

* revert changes to pre-commit

* removed print statement

* made loading message more descriptive

* also support .html files

---------

Co-authored-by: VJHack <[email protected]>
Co-authored-by: Vinesh Janarthanan <[email protected]>

Assets 19

07 Sep 02:46

github-actions

b3678

9b2c24c

b3678

server : simplify state machine for slot (#9283)

* server : simplify state machine for slot

* add SLOT_STATE_DONE_PROMPT

* pop_deferred_task

* add missing notify_one

* fix passkey test

* metrics : add n_busy_slots_per_decode

* fix test step

* add test

* maybe fix AddressSanitizer?

* fix deque ?

* missing lock

* pop_deferred_task: also notify

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>

Assets 19

25 Aug 04:16

github-actions

b3620

e11bd85

b3620

CPU/CUDA: Gemma 2 FlashAttention support (#8542)

* CPU/CUDA: Gemma 2 FlashAttention support

* apply logit_softcap to scale in kernel

* disable logit softcapping tests on Metal

* remove metal check

Assets 19

23 Aug 21:56

github-actions

b3618

3ba780e

b3618

lora : fix llama conversion script with ROPE_FREQS (#9117)

Assets 19

Releases: FellowTraveler/llama.cpp

b6502

Uh oh!

b4508

Uh oh!

b4409

Uh oh!

b3830

Uh oh!

b3816

Uh oh!

b3805

Uh oh!

b3751

Uh oh!

b3678

Uh oh!

b3620

Uh oh!

b3618

Uh oh!