Status: Archived / Test Project
This repository is an archived test project. Active development has ceased completely.
Piramit is an experimental Large Language Model (LLM) inference engine written in Rust, leveraging Vulkan for GPU acceleration. The main goal is to load and run language models efficiently.
How the system broadly operates:
- It takes a standard model (e.g., in
safetensorsformat) and converts it into its custom format called.cmf. - It loads this converted
.cmfformat into GPU memory using Vulkan. - It executes tensor operations (compute shaders) on the GPU to generate text based on user prompts and serves it over HTTP via a web server (Axum).
CMF (Custom/Compiled Model Format): A custom, single-file model format designed specifically for the Piramit project. It is generated from standard models using the project's internal cmf-convert tool.
- Purpose & Benefits: It applies mixed quantization (reducing weights to various precisions like Q4, Q6, Q8, f16) to ensure the model occupies significantly less memory. Furthermore, it embeds the model configuration (
config.json) and tokenizer (tokenizer.json) directly into a single binary file. This allows the system to read all the required data from one payload at optimal speeds and load it directly onto the GPU during execution.
This section is derived from final developer notes indicating why development was halted.
- GPU Upload Performance: Generally working fine, displaying good capability.
- Caching Mechanism: Improved and working better than previous iterations.
- Concurrency Failure: [CRITICAL] The project currently fails when running with 4 concurrent operations (or at higher concurrency levels). Specifically, it outputs random gibberish or garbage data.
Due to these critical concurrency bugs and the broader scope, the project has been fully paused with no intention of returning at this moment, rendering it formally archived.