GGUF by ggerganov · Pull Request #2398 · ggml-org/llama.cpp

ggerganov · 2023-07-26T08:19:08Z

ref:

This PR paves the way for integrating more models into llama.cpp. It changes the file format in which we convert the models by extending it with key-value pairs meta information. This meta data is flexible and allows to add specific information about the model being converted.

This is a breaking change, meaning that all existing ggml models will no longer be compatible after merging.
You should obtain the original model data (e.g. Meta PTH or Hugging Face, etc) and use the convert.py script to generate the new F16 or F32 .gguf models. From there, you can use all tools as usual: quantize, main, perplexity, etc.

Read more about the GGUF file format here: ggml-org/ggml#302

The PR also refactors some portions of llama.cpp and mainly extends ggml with a gguf API for loading and writing .gguf files. It incorporates LLaMA tokenizer fixes and adds (temporary?) BPE tokenizer support.

Huge thanks to all the contributors for writing the spec and helping with the implementation ❤️

Merge ETA: 21 Aug

Usage

# build the GGUF read/write example
make gguf

# write a dummy GGUF model to test.gguf
./gguf test.gguf w

# read the dummy GGUF model
./gguf test.gguf r

# LLaMA 1 PTH
python3 convert.py ../llama1/7B/ --outfile models/7B/ggml-model-f16.gguf

# LLaMA 2 PTH
python3 convert.py ../llama2/llama/llama-2-7b --outfile models/7B-v2/ggml-model-f16.gguf

# LLaMA 2 HF
python3 convert.py ~/development/huggingface/Llama-2-7b-hf/ --outfile models/7B-v2/ggml-model-f16.gguf

# vocab-only
python3 convert.py ../llama1/7B/ --outfile models/ggml-vocab-llama.gguf --vocab-only
python3 convert.py ~/development/huggingface/Llama-2-7b-hf/ --outfile models/ggml-vocab-llama.gguf --vocab-only

Convert GGML to GGUF (not guaranteed to work)

Implementation plan:

Contributions are welcome - just make the target branch this one.
Collaborators can push directly in the branch.

TODO

monatis · 2023-07-26T08:32:48Z

Great, better to have a single PR where every contributor can target.

I'll continue to work on this in the afternoon.

ggerganov · 2023-07-26T08:35:35Z

I'm working on the C implementation for reading. A python script for creating sample GGUF model files similar to the one you've started can be done in parallel. Also, directly modifying convert.py to output GGUF can also be done without conflict for now

ggerganov · 2023-07-26T19:45:49Z

Implemented an initial API for reading GGUF files. The new gguf example demonstrates writing simple model files. It also demonstrates different ways to read the data and load it into memory.

There were some minor deviations from the specification, but nothing is final yet. Will summarize in the end and update the spec respectively.

Tomorrow will continue with update of convert.py to write a GGUF file and try to use the new API to load the model in llama.cpp

Green-Sky · 2023-07-27T08:28:26Z

tip: port constants.py to C . Vulkan headers have done that, to drastically reduce the number of spelling mistakes in strings.

cztomsik · 2023-09-13T19:14:18Z

Sorry if this is not the right place to ask but why is whitespace getting escaped at all? The previous version was literally returning string slices right from the vocab and now we need to replace occurrences of ▁ (Lower One Eighth Block) in a newly allocated string.

Or it can write to a buffer, but still, every token which is generated, needs to be searched for substring, at every position, and then replaced.

Obviously it's not going to be a bottleneck but it seems wasteful and very odd. Can anybody explain what was the reasoning behind this decision?

ggerganov · 2023-09-13T19:20:31Z

@cztomsik Would be helpful to provide a specific example and what is the expectation.

cebtenzzre · 2023-09-13T19:26:58Z

why is whitespace getting escaped at all?

I don't have a solid understanding of sentencepiece, but I do know that the recent tokenizer changes started at #167 and made their way through a few PRs before landing here. These changes were made for more accurate tokenization compared to the official sentencepiece implementation.

cztomsik · 2023-09-13T20:11:12Z

@cztomsik Would be helpful to provide a specific example and what is the expectation.

Basically, I'm asking why can't we have a function which takes token and return a const string. Something like this:

(This actually works, and I'm doing the replace in JS, but I was curious why is it necessary to do the escaping in the first place, and why we can't have unescaped whitespace in the vocab)

One obvious fix would be to do the unescape when the vocab is loaded, and have two pointers for each entry (escaped + unescaped - if it's different), then you could avoid these replaceAll calls:

And not only it would be faster but the API would be easier to understand and to work with (now you need buffer, even if you're directly writing to a STDOUT for example.

BTW: I am not C/C++ dev so I might be missing something.

EDIT: My question is probably more related to this addition #2810

goerch · 2023-09-14T02:57:30Z

Obviously it's not going to be a bottleneck but it seems wasteful and very odd. Can anybody explain what was the reasoning behind this decision?

See #2310 and #3096. for example: we want to make the tokenizer compatible to LLaMa's sentencepiece tokenizer. I fear we are still missing Unicode normalization somehow. I'm sure you could optimize interfaces and implementations.

cztomsik · 2023-09-14T06:08:55Z

Ok, I see. But from my limited understanding, the only tricky part is encoding step, right? That's where different white-space combinations can yield different token ids.

When you get ids from the LLM and turn them into a string/byte chunks, every piece is fixed and known in advance, it can be invalid utf-8 but that should be so far the only problem, and therefore, token_to_s could be just look-up. At least that's what I assume from this code in the google sentencepiece library

Or, I mean, there's one another thing, a bit related to grammars, because if you are inside multi-byte you should not sample certain tokens at all, but that is related to sampling and not to id -> bytes conversion IMHO

BTW: I would be happy to help once you're done with your current changes and tests are passing.

ggerganov · 2023-09-14T17:35:00Z

@cztomsik

I just checked and the Python tokenizer works like this:

print(tokenizer.id_to_piece(15043))
_Hello

So I see your point - we should update llama_token_to_piece to match that behavior.
But we need to be careful to not break some existing logic.

Not sure when I'll get to this though - if you open a PR, I can help with review

goerch · 2023-09-14T17:46:16Z

So I see your point - we should update llama_token_to_piece to match that behavior.
But we need to be careful to not break some existing logic.

llama_token_to_piece is used in some of the examples instead of the newer llama_detokenize... AFAIU. Those should be adapted to use llama_detokenize... then? Do we have an model independent llama_detokenize yet?

ggerganov · 2023-09-14T18:13:42Z

Those should be adapted to use llama_detokenize... then?

Probably yes, but need a careful look

Do we have an model independent llama_detokenize yet?

No - it's a TODO:

https://github.com/ggerganov/llama.cpp/blob/e394084166baac09e8ee9a08a4686f907f7e5291/common/common.h#L148-L164

cztomsik · 2023-09-14T18:29:20Z

So I see your point - we should update llama_token_to_piece to match that behavior. But we need to be careful to not break some existing logic.

No, the only point was that there's a lot of strings getting allocated and destroyed if you're using the llama_token_to_piece from the C API.

In the end, I did it like it is shown below, it's only using one buffer, and it's still doing replacements for each generated token, but at least it's not creating and throwing away intermediate strings all the time.

I think I could even avoid that buffer entirely but I think it will be useful for rprompt (which I'm currently doing in the client code) so I'm fine with this and I don't need any changes, everything was possible with currently existing APIs.

So far my impl is only missing the stripping of the leading space, but I have that in my client code already since the pre-GGUF version.

goerch · 2023-09-14T18:35:58Z

@cztomsik : I think @ggerganov and I are talking about slowly converging to the LlaMa API for the tokenizer:

    def encode(self, s: str, bos: bool, eos: bool) -> List[int]:
        assert type(s) is str
        t = self.sp_model.encode(s)
        if bos:
            t = [self.bos_id] + t
        if eos:
            t = t + [self.eos_id]
        return t

    def decode(self, t: List[int]) -> str:
        return self.sp_model.decode(t)

ggerganov · 2023-09-14T18:47:42Z

No, the only point was that there's a lot of strings getting allocated and destroyed if you're using the llama_token_to_piece from the C API.

Yup, got it. We have many temporary strings because of the unescape stuff. If we change the API to return the "vanilla" piece (similar to what the OG tokenizer does), then we won't have to unescape and we won't have the allocations.

Alternatively, we can keep the current API and do what you suggested:

One obvious fix would be to do the unescape when the vocab is loaded, and have two pointers for each entry (escaped + unescaped - if it's different), then you could avoid these replaceAll calls:

Will think about which way is better

goerch · 2023-09-14T19:00:32Z

Just one remark: I believe the sentencepiece tokenizer doesn't respect invariants you'd expect (at least I did;).

#2310 (comment)
#2810 (comment)

@klosax

* gguf : first API pass * gguf : read header + meta data * gguf : read tensor info * gguf : initial model loading - not tested * gguf : add gguf_get_tensor_name() * gguf : do not support passing existing ggml_context to gguf_init * gguf : simplify gguf_get_val * gguf : gguf.c is now part of ggml.c * gguf : read / write sample models * gguf : add comments * refactor : reduce code duplication and better API (ggml-org#2415) * gguf : expose the gguf_type enum through the API for now * gguf : add array support * gguf.py : some code style changes * convert.py : start a new simplified implementation by removing old stuff * convert.py : remove GGML vocab + other obsolete stuff * GGUF : write tensor (ggml-org#2426) * WIP: Write tensor * GGUF : Support writing tensors in Python * refactor : rm unused import and upd todos * fix : fix errors upd writing example * rm example.gguf * gitignore *.gguf * undo formatting * gguf : add gguf_find_key (ggml-org#2438) * gguf.cpp : find key example * ggml.h : add gguf_find_key * ggml.c : add gguf_find_key * gguf : fix writing tensors * gguf : do not hardcode tensor names to read * gguf : write sample tensors to read * gguf : add tokenization constants * quick and dirty conversion example * gguf : fix writing gguf arrays * gguf : write tensors one by one and code reuse * gguf : fix writing gguf arrays * gguf : write tensors one by one * gguf : write tensors one by one * gguf : write tokenizer data * gguf : upd gguf conversion script * Update convert-llama-h5-to-gguf.py * gguf : handle already encoded string * ggml.h : get array str and f32 * ggml.c : get arr str and f32 * gguf.py : support any type * Update convert-llama-h5-to-gguf.py * gguf : fix set is not subscriptable * gguf : update convert-llama-h5-to-gguf.py * constants.py : add layer norm eps * gguf.py : add layer norm eps and merges * ggml.h : increase GGML_MAX_NAME to 64 * ggml.c : add gguf_get_arr_n * Update convert-llama-h5-to-gguf.py * add gptneox gguf example * Makefile : add gptneox gguf example * Update convert-llama-h5-to-gguf.py * add gptneox gguf example * Update convert-llama-h5-to-gguf.py * Update convert-gptneox-h5-to-gguf.py * Update convert-gptneox-h5-to-gguf.py * Update convert-llama-h5-to-gguf.py * gguf : support custom alignment value * gguf : fix typo in function call * gguf : mmap tensor data example * fix : update convert-llama-h5-to-gguf.py * Update convert-llama-h5-to-gguf.py * convert-gptneox-h5-to-gguf.py : Special tokens * gptneox-main.cpp : special tokens * Update gptneox-main.cpp * constants.py : special tokens * gguf.py : accumulate kv and tensor info data + special tokens * convert-gptneox-h5-to-gguf.py : accumulate kv and ti + special tokens * gguf : gguf counterpart of llama-util.h * gguf-util.h : update note * convert-llama-h5-to-gguf.py : accumulate kv / ti + special tokens * convert-llama-h5-to-gguf.py : special tokens * Delete gptneox-common.cpp * Delete gptneox-common.h * convert-gptneox-h5-to-gguf.py : gpt2bpe tokenizer * gptneox-main.cpp : gpt2 bpe tokenizer * gpt2 bpe tokenizer (handles merges and unicode) * Makefile : remove gptneox-common * gguf.py : bytesarray for gpt2bpe tokenizer * cmpnct_gpt2bpe.hpp : comments * gguf.py : use custom alignment if present * gguf : minor stuff * Update gptneox-main.cpp * map tensor names * convert-gptneox-h5-to-gguf.py : map tensor names * convert-llama-h5-to-gguf.py : map tensor names * gptneox-main.cpp : map tensor names * gguf : start implementing libllama in GGUF (WIP) * gguf : start implementing libllama in GGUF (WIP) * rm binary commited by mistake * upd .gitignore * gguf : calculate n_mult * gguf : inference with 7B model working (WIP) * gguf : rm deprecated function * gguf : start implementing gguf_file_saver (WIP) * gguf : start implementing gguf_file_saver (WIP) * gguf : start implementing gguf_file_saver (WIP) * gguf : add gguf_get_kv_type * gguf : add gguf_get_kv_type * gguf : write metadata in gguf_file_saver (WIP) * gguf : write metadata in gguf_file_saver (WIP) * gguf : write metadata in gguf_file_saver * gguf : rm references to old file formats * gguf : shorter name for member variable * gguf : rm redundant method * gguf : get rid of n_mult, read n_ff from file * Update gguf_tensor_map.py * Update gptneox-main.cpp * gguf : rm references to old file magics * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : quantization is working * gguf : roper closing of file * gguf.py : no need to convert tensors twice * convert-gptneox-h5-to-gguf.py : no need to convert tensors twice * convert-llama-h5-to-gguf.py : no need to convert tensors twice * convert-gptneox-h5-to-gguf.py : simplify nbytes * convert-llama-h5-to-gguf.py : simplify nbytes * gptneox-main.cpp : n_layer --> n_block * constants.py : n_layer --> n_block * gguf.py : n_layer --> n_block * convert-gptneox-h5-to-gguf.py : n_layer --> n_block * convert-llama-h5-to-gguf.py : n_layer --> n_block * gptneox-main.cpp : n_layer --> n_block * Update gguf_tensor_map.py * convert-gptneox-h5-to-gguf.py : load model in parts to save memory * convert-llama-h5-to-gguf.py : load model in parts to save memory * convert : write more metadata for LLaMA * convert : rm quantization version * convert-gptneox-h5-to-gguf.py : add file_type key * gptneox-main.cpp : add file_type key * fix conflicts * gguf : add todos and comments * convert-gptneox-h5-to-gguf.py : tensor name map changes * Create gguf_namemap.py : tensor name map changes * Delete gguf_tensor_map.py * gptneox-main.cpp : tensor name map changes * convert-llama-h5-to-gguf.py : fixes * gguf.py : dont add empty strings * simple : minor style changes * gguf : use UNIX line ending * Create convert-llama-7b-pth-to-gguf.py * llama : sync gguf-llama.cpp with latest llama.cpp (ggml-org#2608) * llama : sync gguf-llama.cpp with latest llama.cpp * minor : indentation + assert * llama : refactor gguf_buffer and gguf_ctx_buffer * llama : minor * gitignore : add gptneox-main * llama : tokenizer fixes (ggml-org#2549) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * convert : update convert-new.py with tokenizer fixes (ggml-org#2614) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * llama : sync gguf-llama with llama (ggml-org#2613) * llama : sync gguf-llama with llama * tests : fix build + warnings (test-tokenizer-1 still fails) * tests : fix wstring_convert * convert : fix layer names * llama : sync gguf-llama.cpp * convert : update HF converter to new tokenizer voodoo magics * llama : update tokenizer style * convert-llama-h5-to-gguf.py : add token types * constants.py : add token types * gguf.py : add token types * convert-llama-7b-pth-to-gguf.py : add token types * gguf-llama.cpp : fix n_head_kv * convert-llama-h5-to-gguf.py : add 70b gqa support * gguf.py : add tensor data layout * convert-llama-h5-to-gguf.py : add tensor data layout * convert-llama-7b-pth-to-gguf.py : add tensor data layout * gptneox-main.cpp : add tensor data layout * convert-llama-h5-to-gguf.py : clarify the reverse permute * llama : refactor model loading code (ggml-org#2620) * llama : style formatting + remove helper methods * llama : fix quantization using gguf tool * llama : simplify gguf_file_saver * llama : fix method names * llama : simplify write_header() * llama : no need to pass full file loader to the file saver just gguf_ctx * llama : gguf_file_saver write I32 * llama : refactor tensor names (ggml-org#2622) * gguf: update tensor names searched in quantization * gguf : define tensor names as constants * gguf : initial write API (not tested yet) * gguf : write to file API (not tested) * gguf : initial write API ready + example * gguf : fix header write * gguf : fixes + simplify example + add ggml_nbytes_pad() * gguf : minor * llama : replace gguf_file_saver with new gguf write API * gguf : streaming support when writing files * gguf : remove oboslete write methods * gguf : remove obosolete gguf_get_arr_xxx API * llama : simplify gguf_file_loader * llama : move hparams and vocab from gguf_file_loader to llama_model_loader * llama : merge gguf-util.h in llama.cpp * llama : reorder definitions in .cpp to match .h * llama : minor simplifications * llama : refactor llama_model_loader (WIP) wip : remove ggml_ctx from llama_model_loader wip : merge gguf_file_loader in llama_model_loader * llama : fix shape prints * llama : fix Windows build + fix norm_rms_eps key * llama : throw error on missing KV paris in model meta data * llama : improve printing + log meta data * llama : switch print order of meta data --------- Co-authored-by: M. Yusuf Sarıgöz <[email protected]> * gguf : deduplicate (ggml-org#2629) * gguf : better type names * dedup : CPU + Metal is working * ggml : fix warnings about unused results * llama.cpp : fix line feed and compiler warning * llama : fix strncpy warning + note token_to_str does not write null * llama : restore the original load/save session implementation Will migrate this to GGUF in the future * convert-llama-h5-to-gguf.py : support alt ctx param name * ggml : assert when using ggml_mul with non-F32 src1 * examples : dedup simple --------- Co-authored-by: klosax <[email protected]> * gguf.py : merge all files in gguf.py * convert-new.py : pick ggml-org#2427 for HF 70B support * examples/gguf : no need to keep q option for quantization any more * llama.cpp : print actual model size * llama.cpp : use ggml_elements() * convert-new.py : output gguf (ggml-org#2635) * convert-new.py : output gguf (WIP) * convert-new.py : add gguf key-value pairs * llama : add hparams.ctx_train + no longer print ftype * convert-new.py : minor fixes * convert-new.py : vocab-only option should work now * llama : fix tokenizer to use llama_char_to_byte * tests : add new ggml-vocab-llama.gguf * convert-new.py : tensor name mapping * convert-new.py : add map for skipping tensor serialization * convert-new.py : convert script now works * gguf.py : pick some of the refactoring from ggml-org#2644 * convert-new.py : minor fixes * convert.py : update to support GGUF output * Revert "ci : disable CI temporary to not waste energy" This reverts commit 7e82d25. * convert.py : n_head_kv optional and .gguf file extension * convert.py : better always have n_head_kv and default it to n_head * llama : sync with recent PRs on master * editorconfig : ignore models folder ggml-ci * ci : update ".bin" to ".gguf" extension ggml-ci * llama : fix llama_model_loader memory leak * gptneox : move as a WIP example * llama : fix lambda capture ggml-ci * ggml : fix bug in gguf_set_kv ggml-ci * common.h : .bin --> .gguf * quantize-stats.cpp : .bin --> .gguf * convert.py : fix HF tensor permuting / unpacking ggml-ci * llama.cpp : typo * llama : throw error if gguf fails to init from file ggml-ci * llama : fix tensor name grepping during quantization ggml-ci * gguf.py : write tensors in a single pass (ggml-org#2644) * gguf : single pass for writing tensors + refactoring writer * gguf : single pass for writing tensors + refactoring writer * gguf : single pass for writing tensors + refactoring writer * gguf : style fixes in simple conversion script * gguf : refactor gptneox conversion script * gguf : rename h5 to hf (for HuggingFace) * gguf : refactor pth to gguf conversion script * gguf : rm file_type key and method * gguf.py : fix vertical alignment * gguf.py : indentation --------- Co-authored-by: Georgi Gerganov <[email protected]> * convert-gptneox-hf-to-gguf.py : fixes * gguf.py : gptneox mapping * convert-llama-hf-to-gguf.py : fixes * convert-llama-7b-pth-to-gguf.py : fixes * ggml.h : reverse GGUF_MAGIC * gguf.py : reverse GGUF_MAGIC * test-tokenizer-0.cpp : fix warning * llama.cpp : print kv general.name * llama.cpp : get special token kv and linefeed token id * llama : print number of tensors per type + print arch + style * tests : update vocab file with new magic * editorconfig : fix whitespaces * llama : re-order functions * llama : remove C++ API + reorganize common source in /common dir * llama : minor API updates * llama : avoid hardcoded special tokens * llama : fix MPI build ggml-ci * llama : introduce enum llama_vocab_type + remove hardcoded string constants * convert-falcon-hf-to-gguf.py : falcon HF --> gguf conversion, not tested * falcon-main.cpp : falcon inference example * convert-falcon-hf-to-gguf.py : remove extra kv * convert-gptneox-hf-to-gguf.py : remove extra kv * convert-llama-7b-pth-to-gguf.py : remove extra kv * convert-llama-hf-to-gguf.py : remove extra kv * gguf.py : fix for falcon 40b * falcon-main.cpp : fix for falcon 40b * convert-falcon-hf-to-gguf.py : update ref * convert-falcon-hf-to-gguf.py : add tensor data layout * cmpnct_gpt2bpe.hpp : fixes * falcon-main.cpp : fixes * gptneox-main.cpp : fixes * cmpnct_gpt2bpe.hpp : remove non-general stuff * Update examples/server/README.md Co-authored-by: slaren <[email protected]> * cmpnct_gpt2bpe.hpp : cleanup * convert-llama-hf-to-gguf.py : special tokens * convert-llama-7b-pth-to-gguf.py : special tokens * convert-permute-debug.py : permute debug print * convert-permute-debug-master.py : permute debug for master * convert-permute-debug.py : change permute type of attn_q * convert.py : 70b model working (change attn_q permute) * Delete convert-permute-debug-master.py * Delete convert-permute-debug.py * convert-llama-hf-to-gguf.py : fix attn_q permute * gguf.py : fix rope scale kv * convert-llama-hf-to-gguf.py : rope scale and added tokens * convert-llama-7b-pth-to-gguf.py : rope scale and added tokens * llama.cpp : use rope scale kv * convert-llama-7b-pth-to-gguf.py : rope scale fix * convert-llama-hf-to-gguf.py : rope scale fix * py : fix whitespace * gguf : add Python script to convert GGMLv3 LLaMA models to GGUF (ggml-org#2682) * First pass at converting GGMLv3 LLaMA models to GGUF * Cleanups, better output during conversion * Fix vocab space conversion logic * More vocab conversion fixes * Add description to converted GGUF files * Improve help text, expand warning * Allow specifying name and description for output GGUF * Allow overriding vocab and hyperparams from original model metadata * Use correct params override var name * Fix wrong type size for Q8_K Better handling of original style metadata * Set default value for gguf add_tensor raw_shape KW arg * llama : improve token type support (ggml-org#2668) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * Improved tokenizer test But does it work on MacOS? * Improve token type support - Added @klosax code to convert.py - Improved token type support in vocabulary * Exclude platform dependent tests * More sentencepiece compatibility by eliminating magic numbers * Restored accidentally removed comment * llama : add API for token type ggml-ci * tests : use new tokenizer type API (ggml-org#2692) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * Improved tokenizer test But does it work on MacOS? * Improve token type support - Added @klosax code to convert.py - Improved token type support in vocabulary * Exclude platform dependent tests * More sentencepiece compatibility by eliminating magic numbers * Restored accidentally removed comment * Improve commentary * Use token type API in test-tokenizer-1.cpp * py : cosmetics * readme : add notice about new file format ggml-ci --------- Co-authored-by: M. Yusuf Sarıgöz <[email protected]> Co-authored-by: klosax <[email protected]> Co-authored-by: goerch <[email protected]> Co-authored-by: slaren <[email protected]> Co-authored-by: Kerfuffle <[email protected]>

@klosax

* gguf : first API pass * gguf : read header + meta data * gguf : read tensor info * gguf : initial model loading - not tested * gguf : add gguf_get_tensor_name() * gguf : do not support passing existing ggml_context to gguf_init * gguf : simplify gguf_get_val * gguf : gguf.c is now part of ggml.c * gguf : read / write sample models * gguf : add comments * refactor : reduce code duplication and better API (ggml-org#2415) * gguf : expose the gguf_type enum through the API for now * gguf : add array support * gguf.py : some code style changes * convert.py : start a new simplified implementation by removing old stuff * convert.py : remove GGML vocab + other obsolete stuff * GGUF : write tensor (ggml-org#2426) * WIP: Write tensor * GGUF : Support writing tensors in Python * refactor : rm unused import and upd todos * fix : fix errors upd writing example * rm example.gguf * gitignore *.gguf * undo formatting * gguf : add gguf_find_key (ggml-org#2438) * gguf.cpp : find key example * ggml.h : add gguf_find_key * ggml.c : add gguf_find_key * gguf : fix writing tensors * gguf : do not hardcode tensor names to read * gguf : write sample tensors to read * gguf : add tokenization constants * quick and dirty conversion example * gguf : fix writing gguf arrays * gguf : write tensors one by one and code reuse * gguf : fix writing gguf arrays * gguf : write tensors one by one * gguf : write tensors one by one * gguf : write tokenizer data * gguf : upd gguf conversion script * Update convert-llama-h5-to-gguf.py * gguf : handle already encoded string * ggml.h : get array str and f32 * ggml.c : get arr str and f32 * gguf.py : support any type * Update convert-llama-h5-to-gguf.py * gguf : fix set is not subscriptable * gguf : update convert-llama-h5-to-gguf.py * constants.py : add layer norm eps * gguf.py : add layer norm eps and merges * ggml.h : increase GGML_MAX_NAME to 64 * ggml.c : add gguf_get_arr_n * Update convert-llama-h5-to-gguf.py * add gptneox gguf example * Makefile : add gptneox gguf example * Update convert-llama-h5-to-gguf.py * add gptneox gguf example * Update convert-llama-h5-to-gguf.py * Update convert-gptneox-h5-to-gguf.py * Update convert-gptneox-h5-to-gguf.py * Update convert-llama-h5-to-gguf.py * gguf : support custom alignment value * gguf : fix typo in function call * gguf : mmap tensor data example * fix : update convert-llama-h5-to-gguf.py * Update convert-llama-h5-to-gguf.py * convert-gptneox-h5-to-gguf.py : Special tokens * gptneox-main.cpp : special tokens * Update gptneox-main.cpp * constants.py : special tokens * gguf.py : accumulate kv and tensor info data + special tokens * convert-gptneox-h5-to-gguf.py : accumulate kv and ti + special tokens * gguf : gguf counterpart of llama-util.h * gguf-util.h : update note * convert-llama-h5-to-gguf.py : accumulate kv / ti + special tokens * convert-llama-h5-to-gguf.py : special tokens * Delete gptneox-common.cpp * Delete gptneox-common.h * convert-gptneox-h5-to-gguf.py : gpt2bpe tokenizer * gptneox-main.cpp : gpt2 bpe tokenizer * gpt2 bpe tokenizer (handles merges and unicode) * Makefile : remove gptneox-common * gguf.py : bytesarray for gpt2bpe tokenizer * cmpnct_gpt2bpe.hpp : comments * gguf.py : use custom alignment if present * gguf : minor stuff * Update gptneox-main.cpp * map tensor names * convert-gptneox-h5-to-gguf.py : map tensor names * convert-llama-h5-to-gguf.py : map tensor names * gptneox-main.cpp : map tensor names * gguf : start implementing libllama in GGUF (WIP) * gguf : start implementing libllama in GGUF (WIP) * rm binary commited by mistake * upd .gitignore * gguf : calculate n_mult * gguf : inference with 7B model working (WIP) * gguf : rm deprecated function * gguf : start implementing gguf_file_saver (WIP) * gguf : start implementing gguf_file_saver (WIP) * gguf : start implementing gguf_file_saver (WIP) * gguf : add gguf_get_kv_type * gguf : add gguf_get_kv_type * gguf : write metadata in gguf_file_saver (WIP) * gguf : write metadata in gguf_file_saver (WIP) * gguf : write metadata in gguf_file_saver * gguf : rm references to old file formats * gguf : shorter name for member variable * gguf : rm redundant method * gguf : get rid of n_mult, read n_ff from file * Update gguf_tensor_map.py * Update gptneox-main.cpp * gguf : rm references to old file magics * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : quantization is working * gguf : roper closing of file * gguf.py : no need to convert tensors twice * convert-gptneox-h5-to-gguf.py : no need to convert tensors twice * convert-llama-h5-to-gguf.py : no need to convert tensors twice * convert-gptneox-h5-to-gguf.py : simplify nbytes * convert-llama-h5-to-gguf.py : simplify nbytes * gptneox-main.cpp : n_layer --> n_block * constants.py : n_layer --> n_block * gguf.py : n_layer --> n_block * convert-gptneox-h5-to-gguf.py : n_layer --> n_block * convert-llama-h5-to-gguf.py : n_layer --> n_block * gptneox-main.cpp : n_layer --> n_block * Update gguf_tensor_map.py * convert-gptneox-h5-to-gguf.py : load model in parts to save memory * convert-llama-h5-to-gguf.py : load model in parts to save memory * convert : write more metadata for LLaMA * convert : rm quantization version * convert-gptneox-h5-to-gguf.py : add file_type key * gptneox-main.cpp : add file_type key * fix conflicts * gguf : add todos and comments * convert-gptneox-h5-to-gguf.py : tensor name map changes * Create gguf_namemap.py : tensor name map changes * Delete gguf_tensor_map.py * gptneox-main.cpp : tensor name map changes * convert-llama-h5-to-gguf.py : fixes * gguf.py : dont add empty strings * simple : minor style changes * gguf : use UNIX line ending * Create convert-llama-7b-pth-to-gguf.py * llama : sync gguf-llama.cpp with latest llama.cpp (ggml-org#2608) * llama : sync gguf-llama.cpp with latest llama.cpp * minor : indentation + assert * llama : refactor gguf_buffer and gguf_ctx_buffer * llama : minor * gitignore : add gptneox-main * llama : tokenizer fixes (ggml-org#2549) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * convert : update convert-new.py with tokenizer fixes (ggml-org#2614) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * llama : sync gguf-llama with llama (ggml-org#2613) * llama : sync gguf-llama with llama * tests : fix build + warnings (test-tokenizer-1 still fails) * tests : fix wstring_convert * convert : fix layer names * llama : sync gguf-llama.cpp * convert : update HF converter to new tokenizer voodoo magics * llama : update tokenizer style * convert-llama-h5-to-gguf.py : add token types * constants.py : add token types * gguf.py : add token types * convert-llama-7b-pth-to-gguf.py : add token types * gguf-llama.cpp : fix n_head_kv * convert-llama-h5-to-gguf.py : add 70b gqa support * gguf.py : add tensor data layout * convert-llama-h5-to-gguf.py : add tensor data layout * convert-llama-7b-pth-to-gguf.py : add tensor data layout * gptneox-main.cpp : add tensor data layout * convert-llama-h5-to-gguf.py : clarify the reverse permute * llama : refactor model loading code (ggml-org#2620) * llama : style formatting + remove helper methods * llama : fix quantization using gguf tool * llama : simplify gguf_file_saver * llama : fix method names * llama : simplify write_header() * llama : no need to pass full file loader to the file saver just gguf_ctx * llama : gguf_file_saver write I32 * llama : refactor tensor names (ggml-org#2622) * gguf: update tensor names searched in quantization * gguf : define tensor names as constants * gguf : initial write API (not tested yet) * gguf : write to file API (not tested) * gguf : initial write API ready + example * gguf : fix header write * gguf : fixes + simplify example + add ggml_nbytes_pad() * gguf : minor * llama : replace gguf_file_saver with new gguf write API * gguf : streaming support when writing files * gguf : remove oboslete write methods * gguf : remove obosolete gguf_get_arr_xxx API * llama : simplify gguf_file_loader * llama : move hparams and vocab from gguf_file_loader to llama_model_loader * llama : merge gguf-util.h in llama.cpp * llama : reorder definitions in .cpp to match .h * llama : minor simplifications * llama : refactor llama_model_loader (WIP) wip : remove ggml_ctx from llama_model_loader wip : merge gguf_file_loader in llama_model_loader * llama : fix shape prints * llama : fix Windows build + fix norm_rms_eps key * llama : throw error on missing KV paris in model meta data * llama : improve printing + log meta data * llama : switch print order of meta data --------- Co-authored-by: M. Yusuf Sarıgöz <[email protected]> * gguf : deduplicate (ggml-org#2629) * gguf : better type names * dedup : CPU + Metal is working * ggml : fix warnings about unused results * llama.cpp : fix line feed and compiler warning * llama : fix strncpy warning + note token_to_str does not write null * llama : restore the original load/save session implementation Will migrate this to GGUF in the future * convert-llama-h5-to-gguf.py : support alt ctx param name * ggml : assert when using ggml_mul with non-F32 src1 * examples : dedup simple --------- Co-authored-by: klosax <[email protected]> * gguf.py : merge all files in gguf.py * convert-new.py : pick ggml-org#2427 for HF 70B support * examples/gguf : no need to keep q option for quantization any more * llama.cpp : print actual model size * llama.cpp : use ggml_elements() * convert-new.py : output gguf (ggml-org#2635) * convert-new.py : output gguf (WIP) * convert-new.py : add gguf key-value pairs * llama : add hparams.ctx_train + no longer print ftype * convert-new.py : minor fixes * convert-new.py : vocab-only option should work now * llama : fix tokenizer to use llama_char_to_byte * tests : add new ggml-vocab-llama.gguf * convert-new.py : tensor name mapping * convert-new.py : add map for skipping tensor serialization * convert-new.py : convert script now works * gguf.py : pick some of the refactoring from ggml-org#2644 * convert-new.py : minor fixes * convert.py : update to support GGUF output * Revert "ci : disable CI temporary to not waste energy" This reverts commit 7e82d25. * convert.py : n_head_kv optional and .gguf file extension * convert.py : better always have n_head_kv and default it to n_head * llama : sync with recent PRs on master * editorconfig : ignore models folder ggml-ci * ci : update ".bin" to ".gguf" extension ggml-ci * llama : fix llama_model_loader memory leak * gptneox : move as a WIP example * llama : fix lambda capture ggml-ci * ggml : fix bug in gguf_set_kv ggml-ci * common.h : .bin --> .gguf * quantize-stats.cpp : .bin --> .gguf * convert.py : fix HF tensor permuting / unpacking ggml-ci * llama.cpp : typo * llama : throw error if gguf fails to init from file ggml-ci * llama : fix tensor name grepping during quantization ggml-ci * gguf.py : write tensors in a single pass (ggml-org#2644) * gguf : single pass for writing tensors + refactoring writer * gguf : single pass for writing tensors + refactoring writer * gguf : single pass for writing tensors + refactoring writer * gguf : style fixes in simple conversion script * gguf : refactor gptneox conversion script * gguf : rename h5 to hf (for HuggingFace) * gguf : refactor pth to gguf conversion script * gguf : rm file_type key and method * gguf.py : fix vertical alignment * gguf.py : indentation --------- Co-authored-by: Georgi Gerganov <[email protected]> * convert-gptneox-hf-to-gguf.py : fixes * gguf.py : gptneox mapping * convert-llama-hf-to-gguf.py : fixes * convert-llama-7b-pth-to-gguf.py : fixes * ggml.h : reverse GGUF_MAGIC * gguf.py : reverse GGUF_MAGIC * test-tokenizer-0.cpp : fix warning * llama.cpp : print kv general.name * llama.cpp : get special token kv and linefeed token id * llama : print number of tensors per type + print arch + style * tests : update vocab file with new magic * editorconfig : fix whitespaces * llama : re-order functions * llama : remove C++ API + reorganize common source in /common dir * llama : minor API updates * llama : avoid hardcoded special tokens * llama : fix MPI build ggml-ci * llama : introduce enum llama_vocab_type + remove hardcoded string constants * convert-falcon-hf-to-gguf.py : falcon HF --> gguf conversion, not tested * falcon-main.cpp : falcon inference example * convert-falcon-hf-to-gguf.py : remove extra kv * convert-gptneox-hf-to-gguf.py : remove extra kv * convert-llama-7b-pth-to-gguf.py : remove extra kv * convert-llama-hf-to-gguf.py : remove extra kv * gguf.py : fix for falcon 40b * falcon-main.cpp : fix for falcon 40b * convert-falcon-hf-to-gguf.py : update ref * convert-falcon-hf-to-gguf.py : add tensor data layout * cmpnct_gpt2bpe.hpp : fixes * falcon-main.cpp : fixes * gptneox-main.cpp : fixes * cmpnct_gpt2bpe.hpp : remove non-general stuff * Update examples/server/README.md Co-authored-by: slaren <[email protected]> * cmpnct_gpt2bpe.hpp : cleanup * convert-llama-hf-to-gguf.py : special tokens * convert-llama-7b-pth-to-gguf.py : special tokens * convert-permute-debug.py : permute debug print * convert-permute-debug-master.py : permute debug for master * convert-permute-debug.py : change permute type of attn_q * convert.py : 70b model working (change attn_q permute) * Delete convert-permute-debug-master.py * Delete convert-permute-debug.py * convert-llama-hf-to-gguf.py : fix attn_q permute * gguf.py : fix rope scale kv * convert-llama-hf-to-gguf.py : rope scale and added tokens * convert-llama-7b-pth-to-gguf.py : rope scale and added tokens * llama.cpp : use rope scale kv * convert-llama-7b-pth-to-gguf.py : rope scale fix * convert-llama-hf-to-gguf.py : rope scale fix * py : fix whitespace * gguf : add Python script to convert GGMLv3 LLaMA models to GGUF (ggml-org#2682) * First pass at converting GGMLv3 LLaMA models to GGUF * Cleanups, better output during conversion * Fix vocab space conversion logic * More vocab conversion fixes * Add description to converted GGUF files * Improve help text, expand warning * Allow specifying name and description for output GGUF * Allow overriding vocab and hyperparams from original model metadata * Use correct params override var name * Fix wrong type size for Q8_K Better handling of original style metadata * Set default value for gguf add_tensor raw_shape KW arg * llama : improve token type support (ggml-org#2668) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * Improved tokenizer test But does it work on MacOS? * Improve token type support - Added @klosax code to convert.py - Improved token type support in vocabulary * Exclude platform dependent tests * More sentencepiece compatibility by eliminating magic numbers * Restored accidentally removed comment * llama : add API for token type ggml-ci * tests : use new tokenizer type API (ggml-org#2692) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * Improved tokenizer test But does it work on MacOS? * Improve token type support - Added @klosax code to convert.py - Improved token type support in vocabulary * Exclude platform dependent tests * More sentencepiece compatibility by eliminating magic numbers * Restored accidentally removed comment * Improve commentary * Use token type API in test-tokenizer-1.cpp * py : cosmetics * readme : add notice about new file format ggml-ci --------- Co-authored-by: M. Yusuf Sarıgöz <[email protected]> Co-authored-by: klosax <[email protected]> Co-authored-by: goerch <[email protected]> Co-authored-by: slaren <[email protected]> Co-authored-by: Kerfuffle <[email protected]>

ggerganov mentioned this pull request Jul 26, 2023

WIP: Implement GGUF #2397

Merged

ggerganov added high priority Very important issue refactoring Refactoring breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. labels Jul 26, 2023

ggerganov added 6 commits July 26, 2023 18:21

gguf : first API pass

6873148

gguf : read header + meta data

8d6acfe

gguf : read tensor info

d91b985

gguf : initial model loading - not tested

78b226a

gguf : add gguf_get_tensor_name()

860c9c6

gguf : do not support passing existing ggml_context to gguf_init

cb871fa

ggerganov force-pushed the gguf branch from 480a067 to cb871fa Compare July 26, 2023 15:49

ggerganov added 2 commits July 26, 2023 18:53

gguf : simplify gguf_get_val

d313c0f

gguf : gguf.c is now part of ggml.c

e46870f

skirodev mentioned this pull request Jul 26, 2023

add Falcon 40B model support rustformers/llm#368

Merged

ggerganov force-pushed the gguf branch 3 times, most recently from ff3ded1 to 8b8ad7d Compare July 26, 2023 19:39

gguf : read / write sample models

5628ec7

ggerganov force-pushed the gguf branch from 8b8ad7d to 5628ec7 Compare July 26, 2023 19:40

ggerganov force-pushed the gguf branch 2 times, most recently from bae72cb to ef482eb Compare July 26, 2023 19:59

gguf : add comments

d8491fc

ggerganov force-pushed the gguf branch from ef482eb to d8491fc Compare July 26, 2023 20:00

ymcui mentioned this pull request Jul 27, 2023

Support ggmlv3 alexrozanski/LlamaChat#34

Open

monatis and others added 2 commits July 27, 2023 10:29

refactor : reduce code duplication and better API (#2415)

c85d317

gguf : expose the gguf_type enum through the API for now

d89533d

ikawrakow mentioned this pull request Aug 26, 2023

Fix HellaSwag #2805

Merged

jkleckner mentioned this pull request Aug 26, 2023

llamma.cpp breaking change deprecating GGML in favor of GGUF ollama/ollama#423

Closed

ochafik mentioned this pull request Aug 26, 2023

Update llama2.c converter to read vocab and write models in GGUF format #2751

Merged

0x766c6f6f6b mentioned this pull request Aug 30, 2023

Support the new format of llama.cpp "gguf"? zylon-ai/private-gpt#993

Closed

EwoutH mentioned this pull request Sep 8, 2023

Support GGUF rustformers/llm#365

Open

This was referenced Sep 8, 2023

[Feature][model] llama.cpp support new GGUF file format eosphoros-ai/DB-GPT#567

Closed

assertion error eosphoros-ai/DB-GPT#563

Closed

PenutChen mentioned this pull request Sep 22, 2023

使用ggmlv3 q6_K model, inference會掉字 MiuLab/Taiwan-LLM#30

Closed

phronmophobic mentioned this pull request Sep 24, 2023

Add support for gguf phronmophobic/llama.clj#8

Closed

NigelNelson mentioned this pull request Sep 28, 2023

Update Local Llama llama.cpp nvidia-holoscan/holohub#96

Merged

abc-nix mentioned this pull request Jan 15, 2024

gguf_init_from_file: invalid magic characters #3905

Closed

cebtenzzre mentioned this pull request Mar 17, 2024

convert : use f32 outtype for bf16 tensors #6106

Merged

jmartin-tech mentioned this pull request Apr 2, 2024

Convert GGML to expect GGUF format NVIDIA/garak#581

Merged

polarathene mentioned this pull request Jun 7, 2024

Refactor: GGUF metadata tokenizer EricLBuehler/mistral.rs#389

Merged

tomdol mentioned this pull request Mar 27, 2025

Misc. bug: Data check in examples/gguf #12617

Closed

Conversation

ggerganov commented Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Convert GGML to GGUF (not guaranteed to work)

TODO

Uh oh!

monatis commented Jul 26, 2023

Uh oh!

ggerganov commented Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jul 26, 2023

Uh oh!

Green-Sky commented Jul 27, 2023

Uh oh!

cztomsik commented Sep 13, 2023

Uh oh!

ggerganov commented Sep 13, 2023

Uh oh!

cebtenzzre commented Sep 13, 2023

Uh oh!

cztomsik commented Sep 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

goerch commented Sep 14, 2023

Uh oh!

cztomsik commented Sep 14, 2023

Uh oh!

ggerganov commented Sep 14, 2023

Uh oh!

goerch commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Sep 14, 2023

Uh oh!

cztomsik commented Sep 14, 2023

Uh oh!

goerch commented Sep 14, 2023

Uh oh!

ggerganov commented Sep 14, 2023

Uh oh!

goerch commented Sep 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

ggerganov commented Jul 26, 2023 •

edited

Loading

ggerganov commented Jul 26, 2023 •

edited

Loading

cztomsik commented Sep 13, 2023 •

edited

Loading

goerch commented Sep 14, 2023 •

edited

Loading