Skip to content

Tags: bdx0/llama.cpp

Tags

b1617

Toggle b1617's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
server : recognize cache_prompt parameter in OAI API (ggml-org#4347)

b1428

Toggle b1428's commit message

Verified

This commit was signed with the committer’s verified signature.
ggerganov Georgi Gerganov
batched-bench : print params at start

b1427

Toggle b1427's commit message

Verified

This commit was signed with the committer’s verified signature.
ggerganov Georgi Gerganov
log : disable pid in log filenames

b1426

Toggle b1426's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
server : add parameter -tb N, --threads-batch N (ggml-org#3584) (ggml…

…-org#3768)

Co-authored-by: Michael Coppola <[email protected]>
Co-authored-by: Michael Coppola <[email protected]>

b1425

Toggle b1425's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
server : do not block system prompt update (ggml-org#3767)

* server : do not block system prompt update

* server : update state machine logic to process system prompts

* server : minor

b1424

Toggle b1424's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
sync : ggml (conv ops + cuda MSVC fixes) (ggml-org#3765)

ggml-ci

b1423

Toggle b1423's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
cmake : add missed dependencies (ggml-org#3763)

b1422

Toggle b1422's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
cuda : add batched cuBLAS GEMM for faster attention (ggml-org#3749)

* cmake : add helper for faster CUDA builds

* batched : add NGL arg

* ggml : skip nops in compute_forward

* cuda : minor indentation

* cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops)

* Apply suggestions from code review

These changes plus:

```c++
#define cublasGemmBatchedEx hipblasGemmBatchedEx
```

are needed to compile with ROCM. I haven't done performance testing, but it seems to work.

I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up.

* cuda : add ROCm / hipBLAS cublasGemmBatchedEx define

* cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases

* cuda : reduce mallocs in cublasGemmBatchedEx branch

* cuda : add TODO for calling cublas from kernel + using mem pool

---------

Co-authored-by: Kerfuffle <[email protected]>

b1421

Toggle b1421's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Add more tokenizer tests (ggml-org#3742)

* Add more tokenizer tests

* Add starcoder

* Update test vocab files

* Restrict bpe tokenizer tests to unicode planes

* Update comment

* Comment cosmetics

* Remove bloom vocab/test

b1420

Toggle b1420's commit message

Verified

This commit was signed with the committer’s verified signature.
ggerganov Georgi Gerganov
metal : handle ggml_scale for n%4 != 0 (close ggml-org#3754)

ggml-ci