🚀 CUDA GEMM Benchmark Suite

This repository provides a comprehensive CUDA GEMM (Matrix Multiplication) microbenchmark, comparing multiple precision and optimization variants on RTX3060 GPU (12GB):

A baseline FP32 shared-memory implementation,
A Tensor Core accelerated FP16 → FP32 mixed-precision kernel, and
An optional BF16 → FP32 Tensor Core kernel (for Ampere or newer GPUs).

It demonstrates how shared-memory tiling, Tensor Core usage, and precision formats affect performance and accuracy across different matrix sizes.

🧩 Overview

Variant ID	Kernel Type	Precision	Strategy	Description
0	Shared-Memory GEMM	FP32	Global Memory + Shared Tile	Baseline tiled GEMM in shared memory
1	Tensor Core GEMM	FP16 → FP32	WMMA API	Tensor Core mixed-precision GEMM
2	Tensor Core GEMM	BF16 → FP32	WMMA API (Ampere+)	BFloat16 input variant for higher numeric range

Each kernel computes C = A × B for square matrices MxK and KxN, measures runtime, GFLOPs, and accuracy vs a CPU reference.

⚙️ Build Instructions

Ensure CUDA Toolkit ≥ 12.0 and a GPU with Tensor Core support.

# For Ampere (RTX 30, A100, etc.)
nvcc -O3 --gpu-architecture=sm_80 --use_fast_math matmul_variants.cu -o matmul_variants

🧠 Performance Summary (RTX 3060, CUDA 12.4)

Variant	Matrix Size	Time (ms)	GFLOPs	Speedup vs FP32	Max Abs Error
FP32 Shared Memory	512	0.272	988.4	1.00×	4.6e-05
Tensor Core FP16→FP32	512	0.039	6864.7	6.95×	145.84
Tensor Core BF16→FP32	512	0.056	4760.8	4.82×	145.84

🧩 Observation:
Tensor Cores provide up to a ~7× throughput improvement over FP32 shared-memory GEMM.
FP16 achieves the highest performance, while BF16 offers better numeric range and stability with slightly reduced throughput.

🚀 Running the Benchmark

./matmul_variants 
Or specify custom matrix size:
./matmul_variants 1024 1024 1024

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.ipynb_checkpoints		.ipynb_checkpoints
results		results
.gitattributes		.gitattributes
README.md		README.md
benchmark_guide.ipynb		benchmark_guide.ipynb
matmul_variants		matmul_variants
matmul_variants.cu		matmul_variants.cu
matmul_variants_shmoo.cu		matmul_variants_shmoo.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 CUDA GEMM Benchmark Suite

🧩 Overview

⚙️ Build Instructions

🧠 Performance Summary (RTX 3060, CUDA 12.4)

🚀 Running the Benchmark

🖼️ Shmoo Runtime Comparison

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 CUDA GEMM Benchmark Suite

🧩 Overview

⚙️ Build Instructions

🧠 Performance Summary (RTX 3060, CUDA 12.4)

🚀 Running the Benchmark

🖼️ Shmoo Runtime Comparison

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages