#

fp8

Here are 21 public repositories matching this topic...

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.

python machine-learning deep-learning gpu cuda pytorch jax fp8 fp4

Updated Mar 25, 2026
Python

Azure / MS-AMP

Microsoft Automatic Mixed Precision Library

deep-learning gpu amp pytorch transformer mixed-precision fp8

Updated Dec 1, 2025
Python

intel / neural-speed

An innovative library for efficient LLM inference via low-bit quantization

Updated Aug 30, 2024
C++

aredden / flux-fp8-api

Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.

flux pytorch quantization diffusion fast-inference fp8

Updated Oct 12, 2024
Python

graphcore-research / jax-scalify

JAX Scalify: end-to-end scaled arithmetics

jax low-precision llm fp8

Updated Oct 30, 2024
Python

MurrellGroup / Microfloats.jl

Narrow precision floating point types

floating-point minifloat microfloat fp6 fp8 fp4 microscaling

Updated Mar 9, 2026
Julia

massif-01 / vllm_benchmark_block_fp8

Automated Triton w8a8 block FP8 kernel tuning tool for vLLM. Auto-detects model architecture, supports Qwen3-Coder-30B-A3B-Instruct-FP8/DeepSeek-V3/custom models, multi-GPU parallel tuning, and generates optimized kernel configs for quantization.

triton performance-tuning kernel-tuning fp8 vllm

Updated Oct 31, 2025
Python

zpqiu / rl-infra-notes

Source-code level analysis of LLM RL training infra: async RL, weight sync, FP8, MoE routing | LLM RL 训练基础设施源码级分析

reinforcement-learning rl moe distributed-training megatron llm fp8 vllm llm-training async-rl

Updated Mar 22, 2026

zerfoo / zerfoo

Pure Go machine learning framework. Train, run, and serve ML models with go build. Zero CGo.

go golang machine-learning deep-learning neural-network transformer fp16 autodiff distributed-training float16 onnx graph-ml ml-framework fp8 float8

Updated Mar 25, 2026
Go

tashiscool / fp8-mps-metal

FP8 Metal compute kernels for Apple Silicon MPS — fixing what PyTorch doesn't support yet. FLUX/SD3.5/ComfyUI on Mac.

flux metal pytorch mps quantization apple-silicon fp8 stable-diffusion comfyui m4-pro

Updated Feb 8, 2026
Python

stlin256 / CUDABurner

An stress and benchmark utility for NVIDIA GPUs. Measures performance across various precisions (FP64, FP32, TF32, FP16, INT8) and monitors real-time vitals like power, temperature, and clock speeds.

benchmarking performance sparsity cpp hpc cuda nvidia stress-testing fp16 int8 gpu-benchmark fp32 fp64 fp8 fp4 tf32 tf16

Updated Dec 12, 2025
C++

klessydra / spike-with-minifloat-fp8-support

Spike, a RISC-V ISA Simulator with added 8-bit vector floating point support

spike riscv minifloat fp8

Updated Sep 12, 2025
C

mukullokhande99 / XR-NPE

Python implementations for multi-precision quantization in computer vision and sensor fusion workloads, targeting the XR-NPE Mixed-Precision SIMD Neural Processing Engine. The code includes visual inertial odometry (VIO), object classification, and eye gaze extraction code in FP4, FP8, Posit4, Posit8, and BF16 formats.

object-detection quantization visual-inertial-odometry posit eye-gaze-detection bf16 fp8 fp4

Updated Aug 17, 2025
Jupyter Notebook

LessUp / triton-fused-ops

High-Performance Triton Ops: RMSNorm+RoPE Fusion, Gated MLP Fusion & FP8 Quantized GEMM for Transformers | 高性能 Triton 算子库：RMSNorm+RoPE 融合、Gated MLP 融合、FP8 量化 GEMM，专为 Transformer 优化

python deep-learning cuda pytorch triton gpu-computing fp8 operator-fusion

Updated Mar 22, 2026
Python

ethanlee928 / fp8-multiplier-adder

Verilog design of FP8 E4M3 Multiplier and Adder

verilog hdl fp8 fp8e4m3

Updated Dec 30, 2025
Verilog

zsxkib / cog-step-video-t2v

Cog Single GPU Quantized Implementation of Step-Video-T2V

replicate single-gpu fp8 h100 step-video-t2v diffsynth

Updated Feb 25, 2025
Python

sbhavani / llamafactory-fp8-hopper

LLaMA-Factory FP8 training environment for NVIDIA Hopper GPUs. Fixes common configuration issues causing 2x slowdown with FP8 mixed precision.

deep-learning pytorch nvidia hopper performance-optimization fp8 h100 llama-factory gh200 transformer-engine

Updated Dec 31, 2025
Python

Sharveswar007 / SSBLAST

First open-source FP8 linear solver for consumer NVIDIA GPUs — 2-3x faster than cuBLAS FP64. pip install ssblast

python machine-learning hpc gpu cuda nvidia triton cupy numerical-computing fp8 linear-solver rtx-4050

Updated Mar 15, 2026
Python

soy-tuber / localllama-insights

Technical insights from r/LocalLLaMA — vLLM, FP8, NVFP4, Blackwell GPU benchmarks, and more. Unverified community knowledge, generated by Nemotron 9B. Issues welcome.

gpu inference benchmarks blackwell llm fp8 vllm localllama rtx-5090 nvfp4

Updated Mar 16, 2026

umangyadav / py_fp8

FP8 dtypes enumeration in python

fp8 fp8e4m3fnuz fp8e4m3 fp8e5m2 fp8e5m2fnuz

Updated Nov 16, 2023
C++

Improve this page

Add a description, image, and links to the fp8 topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the fp8 topic, visit your repo's landing page and select "manage topics."