Optimized AI models for the edge by Embedl.
The package provides optimized models for the edge and can be installed with
pip install embedl-models
Optimized versions of langugage models using FlashHead, Embedl’s efficient replacement for the traditional language-model head. FlashHead drastically reduces size while preserving accuracy, enabling H200-level latency on consumer GPUs (RTX Ada generation).
This model is designed for low-latency inference on NVIDIA RTX GPUs, using a combination of:
- FlashHead
- Mixed-precision quantization (W4A16)
- Custom vLLM generation
FlashHead produces outputs that match the baseline models within rounding error on standard evaluations (MMLU-Pro, HellaSwag, GSM8K etc.). Overall, this model is capable of yielding H200-level latency on consumer-grade GPUs (RTX Ada).
The recommended way to use FlashHead is vLLM with the custom generation pipeline provided in this package.
A simple chat interface can be launched with the default model as:
python3 -m embedl.models.vllm.demo
Other models can be selected by specifying --model embedl/<model> with FlashHead.
Please checkout the models and how to use them (with vLLM on NVIDIA GPUs) at https://huggingface.co/embedl/
The following improvements are planned for upcoming releases:
- HuggingFace transformers pipeline
- Benchmarking via vLLM CLI (for detailed latency benchmarking)
- Support for lm-eval-harness (for detailed accuracy measurements)
- Upstream support of models in transformers and vllm
- Compatibility with GGUF, MLC, Llama.cpp, TGI etc.
- Support for additional inference toolchains and devices
Interested in early access to new releases? Missing something? Contact us (see below).
Please check out the license file for details or contact [email protected]
For commercial licensing, enterprise support, or hardware co-marketing opportunities:
Enterprise & Commercial Inquiries [email protected]
Technical issues, feedback, and early-access requests https://github.com/embedl/embedl-models
More information about Embedl products, and model releases https://embedl.com
If you are evaluating on-device inference, building products on NVIDIA RTX/H200, or exploring custom optimized models for your workloads, we encourage you to reach out. We actively collaborate with teams integrating local AI into commercial applications, and we offer:
- Tools for edge AI optimization, compatibility, profiling, provisioning of hardware (Embedl SDK)
- Online platform for benchmarking models (Embedl HUB)
- Engineering support for on-prem and edge deployments of models
- Guidance on migration from baseline Llama/Qwen/Gemma to optimal models
- Priority access to upcoming model releases
- Partner co-marketing for qualified deployments
For early access to future models or custom porting/optimization work, contact us directly at [email protected].