Tags: bdx0/llama.cpp
Tags
server : add parameter -tb N, --threads-batch N (ggml-org#3584) (ggml… …-org#3768) Co-authored-by: Michael Coppola <[email protected]> Co-authored-by: Michael Coppola <[email protected]>
server : do not block system prompt update (ggml-org#3767) * server : do not block system prompt update * server : update state machine logic to process system prompts * server : minor
sync : ggml (conv ops + cuda MSVC fixes) (ggml-org#3765) ggml-ci
cuda : add batched cuBLAS GEMM for faster attention (ggml-org#3749) * cmake : add helper for faster CUDA builds * batched : add NGL arg * ggml : skip nops in compute_forward * cuda : minor indentation * cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops) * Apply suggestions from code review These changes plus: ```c++ #define cublasGemmBatchedEx hipblasGemmBatchedEx ``` are needed to compile with ROCM. I haven't done performance testing, but it seems to work. I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up. * cuda : add ROCm / hipBLAS cublasGemmBatchedEx define * cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases * cuda : reduce mallocs in cublasGemmBatchedEx branch * cuda : add TODO for calling cublas from kernel + using mem pool --------- Co-authored-by: Kerfuffle <[email protected]>
Add more tokenizer tests (ggml-org#3742) * Add more tokenizer tests * Add starcoder * Update test vocab files * Restrict bpe tokenizer tests to unicode planes * Update comment * Comment cosmetics * Remove bloom vocab/test
metal : handle ggml_scale for n%4 != 0 (close ggml-org#3754) ggml-ci
PreviousNext