Skip to content

chenxinye/blrmat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

blrmat: BLR Matrix Solver

A lightweight Block Low-Rank (BLR) Matrix library implemented in CUDA. This project leverages cuBLAS and cuSOLVER to accelerate dense matrix operations by compressing off-diagonal blocks into low-rank approximations.

Features

  • Adaptive Compression: automatically identifies and compresses low-rank blocks using SVD based on a specified tolerance.
  • High-Performance Kernels:
    • GEMM: Matrix-Matrix multiplication exploiting low-rank structure ($O(N \cdot k)$ vs $O(N^2)$).
    • Cholesky Decomposition: In-place factorization for Symmetric Positive Definite (SPD) matrices using Left-Looking TRSM updates.
    • LU Decomposition: Standard block LU factorization.

Project Structure

.
├── CMakeLists.txt       # Build configuration
├── main_test.cu         # Benchmark and Test Suite
├── include/
│   └── blr_matrix.h     # Header definitions
└── src/
    └── blr_matrix.cu    # CUDA implementation

Simulations

All tests were performed on a Dell PowerEdge R750xa server with 2 TB of memory. The system is equipped with four NVIDIA A100 80GB PCIe GPUs and two Intel Xeon Gold 6330 processors (totaling 56 cores, 2.00 GHz). Benchmarks were conducted on a single GPU instance. Check details on front.convergence.lip6.

The GEMM with BLR format reaches close to 2 times speedup compared with cuBLAS GEMM in our example:

==================================================
  BLR Matrix Benchmark & Accuracy Test            
==================================================
Matrix: 10000x10000, Block: 2000, Tol: 0.0001

[Test 1] GEMM Performance
   -> Running Dense cuBLAS GEMM (4 runs):
      Run 0: 43.435 ms (Warmup)
      Run 1: 3.485 ms
      Run 2: 3.473 ms
      Run 3: 3.471 ms
      >> Avg Time (Runs 1-3): 3.476 ms

   -> Running BLR GEMM (4 runs):
      Run 0: 81.950 ms (Warmup)
      Run 1: 2.147 ms
      Run 2: 1.937 ms
      Run 3: 1.736 ms
      >> Avg Time (Runs 1-3): 1.940 ms

   [Speedup Report]
      Speedup (Dense/BLR): 1.792x
      GEMM Relative Error: 5.342e-08

[Test 2] LU Decomposition
   LU Time        : 171.044 ms
   Verifying Accuracy (Computing A - L*U)...
   LU Error       : 7.426e-02

[Test 3] Cholesky Decomposition
   Cholesky Time  : 255.214 ms
   Verifying Accuracy (Computing A - L*L^T)...
   Cholesky Error : 1.197e-09

About

BLR matrix format and solvers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors