Skip to content

profitmonk/tensor_accelerator

Repository files navigation

πŸš€ FPGA Tensor Accelerator

A production-quality RTL implementation of a tensor processing unit for neural network inference, featuring a 2Γ—2 grid of Tensor Processing Clusters (TPCs) with 16Γ—16 systolic arrays.

Architecture Status License

✨ Features

  • 4 Tensor Processing Clusters (TPCs) in a 2Γ—2 mesh
  • 16Γ—16 Systolic Arrays (256 INT8 MACs per TPC)
  • 64-lane Vector Processing Unit for activations (ReLU, GELU, Softmax)
  • 2D DMA Engine with strided access patterns
  • 16-bank SRAM Subsystem with multi-port access
  • Network-on-Chip (NoC) with XY routing
  • AXI4 Memory Interface (DDR4/LPDDR4/LPDDR5 support)
  • Cycle-accurate Python Model for software simulation and verification

πŸ“Š Performance

Metric Value
Peak Throughput 409 GOPS @ 200 MHz
Data Type INT8 (with INT32 accumulation)
On-chip SRAM 2 MB (configurable)
Target Devices Xilinx UltraScale+, Versal

πŸ§ͺ Simulation Results

MAC PE Verification βœ…

All 7 tests passing:

╔════════════════════════════════════════════════════════════╗
β•‘           MAC Processing Element Testbench                 β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

[TEST 1] Loading weight = 3
  PASS: weight_reg = 3 (expected 3)

[TEST 2] Computing 3 Γ— 4 + 0 = 12
  PASS: psum_out = 12 (expected 12)

[TEST 3] Accumulating: 12 + (3 Γ— 5) = 27
  PASS: psum_out = 27 (expected 27)

[TEST 4] Signed multiply: 3 Γ— (-2) = -6
  PASS: psum_out = -6 (expected -6)

╔════════════════════════════════════════════════════════════╗
β•‘   Passed: 7    Failed: 0                                   β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
   >>> ALL TESTS PASSED! <<<

Systolic Array Waveform

The systolic array implements weight-stationary dataflow:

Cycle   State    Activity
─────   ─────    ────────────────────────────
0-16    LOAD     Weights loaded column by column
17-48   COMPUTE  Activations stream, MACs accumulate  
49-64   DRAIN    Results emerge from bottom row
65      DONE     Computation complete

πŸ“ Project Structure

tensor_accelerator/
β”œβ”€β”€ rtl/                    # Synthesizable Verilog
β”‚   β”œβ”€β”€ core/               # Compute units
β”‚   β”‚   β”œβ”€β”€ mac_pe.v        # MAC processing element
β”‚   β”‚   β”œβ”€β”€ systolic_array.v# 16Γ—16 systolic array
β”‚   β”‚   β”œβ”€β”€ vector_unit.v   # 64-lane SIMD VPU
β”‚   β”‚   └── dma_engine.v    # 2D DMA controller
β”‚   β”œβ”€β”€ memory/             # Memory subsystem
β”‚   β”‚   β”œβ”€β”€ sram_subsystem.v
β”‚   β”‚   β”œβ”€β”€ memory_controller_wrapper.v
β”‚   β”‚   └── axi_memory_model.v (sim only)
β”‚   β”œβ”€β”€ control/            # Controllers
β”‚   β”‚   β”œβ”€β”€ local_cmd_processor.v
β”‚   β”‚   └── global_cmd_processor.v
β”‚   β”œβ”€β”€ noc/                # Network on Chip
β”‚   β”‚   └── noc_router.v
β”‚   └── top/                # Top-level modules
β”‚       β”œβ”€β”€ tensor_processing_cluster.v
β”‚       └── tensor_accelerator_top.v
β”œβ”€β”€ model/                  # Cycle-accurate Python models
β”‚   β”œβ”€β”€ systolic_array_model.py  # MXU: weight-stationary GEMM
β”‚   β”œβ”€β”€ dma_model.py        # DMA: LOAD/STORE with AXI
β”‚   β”œβ”€β”€ vpu_model.py        # VPU: ReLU, Add, reductions
β”‚   β”œβ”€β”€ lcp_model.py        # LCP: instruction fetch/decode
β”‚   └── tpc_model.py        # TPC: integrated system model
β”œβ”€β”€ tb/                     # Testbenches
β”œβ”€β”€ sw/                     # Software tools
β”‚   β”œβ”€β”€ assembler/          # Instruction assembler
β”‚   └── examples/           # Example kernels
β”œβ”€β”€ docs/                   # Documentation
β”œβ”€β”€ constraints/            # FPGA constraints
└── scripts/                # Build scripts

πŸš€ Quick Start

Prerequisites

# macOS
brew install icarus-verilog
brew install surfer          # Waveform viewer (recommended)
# Or: brew install --cask gtkwave

# Ubuntu/Debian
sudo apt install iverilog gtkwave

# Windows (via WSL or direct)
# Install Icarus Verilog from: http://bleyer.org/icarus/

Run Simulation

# Extract and enter directory
tar -xzf tensor_accelerator.tar.gz
cd tensor_accelerator

# Interactive test menu
./debug.sh

# Or run all tests directly
make test

View Waveforms

# After running tests, view with Surfer
surfer sim/waves/mac_pe.vcd
surfer sim/waves/systolic_array.vcd

# Or with GTKWave (use preset signals)
gtkwave sim/waves/mac_pe.vcd sim/waves/mac_pe.gtkw

πŸ”§ FPGA Synthesis (Vivado)

# Batch mode
vivado -mode batch -source scripts/synth.tcl

# Or in Vivado GUI
source scripts/synth.tcl

Supported Targets

Board Device Memory Status
ZCU104 XCZU7EV DDR4 βœ… Tested
VCU118 XCVU9P DDR4 βœ… Tested
VCK190 XCVC1902 DDR4/LPDDR4 βœ… Tested
VM2152 XCVM2152 LPDDR5 πŸ”œ Planned

πŸ“– Documentation

Document Description
VERILOG_TUTORIAL.md Complete design walkthrough - start here!
ISA_GUIDE.md Instruction Set Architecture - assembly β†’ binary
presentation.html Interactive slide deck - open in browser
Python Models Cycle-accurate simulation - see section below
WAVEFORMS.md Waveform capture guide for Surfer
SYNTHESIS_READINESS.md FPGA synthesis checklist
MEMORY_INTEGRATION.md DDR4/LPDDR5 integration guide
TEST_FLOW.md Verification methodology
SIMULATOR_COMPARISON.md Verilator vs ModelSim vs VCS

πŸ—οΈ Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      TENSOR ACCELERATOR                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚    TPC 0    │══│    TPC 1    β”‚  β”‚    TPC 2    │══│    TPC 3    β”‚β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”  β”‚β”‚
β”‚  β”‚  β”‚16Γ—16  β”‚  β”‚  β”‚  β”‚16Γ—16  β”‚  β”‚  β”‚  β”‚16Γ—16  β”‚  β”‚  β”‚  β”‚16Γ—16  β”‚  β”‚β”‚
β”‚  β”‚  β”‚Systolicβ”‚  β”‚  β”‚  β”‚Systolicβ”‚  β”‚  β”‚  β”‚Systolicβ”‚  β”‚  β”‚  β”‚Systolicβ”‚  β”‚β”‚
β”‚  β”‚  β”‚Array  β”‚  β”‚  β”‚  β”‚Array  β”‚  β”‚  β”‚  β”‚Array  β”‚  β”‚  β”‚  β”‚Array  β”‚  β”‚β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚                              β”‚ NoC                                  β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                           β”‚
β”‚                    β”‚  Global Controller β”‚                           β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚ AXI4
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   External Memory   β”‚
                    β”‚   (DDR4/LPDDR5)     β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🐍 Cycle-Accurate Python Model

The project includes a complete cycle-accurate Python model that mirrors the RTL implementation exactly. This enables:

  • Software simulation without Verilog tools
  • Test vector generation for RTL verification
  • Algorithm development before hardware implementation
  • Cross-validation between Python and RTL

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    TPC Python Model (tpc_model.py)              β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                               β”‚
β”‚  β”‚     LCP      β”‚  Instruction fetch, decode, dispatch          β”‚
β”‚  β”‚ lcp_model.py β”‚  Loop handling, unit synchronization          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                                               β”‚
β”‚         β”‚ dispatches to                                         β”‚
β”‚         β–Ό                                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”‚
β”‚  β”‚     MXU      β”‚     VPU      β”‚     DMA      β”‚                β”‚
β”‚  β”‚  systolic_   β”‚  vpu_model   β”‚  dma_model   β”‚                β”‚
β”‚  β”‚  array_model β”‚    .py       β”‚    .py       β”‚                β”‚
β”‚  β”‚     .py      β”‚              β”‚              β”‚                β”‚
β”‚  β”‚              β”‚              β”‚              β”‚                β”‚
β”‚  β”‚ Weight-      β”‚ ReLU, Add,   β”‚ LOAD/STORE   β”‚                β”‚
β”‚  β”‚ stationary   β”‚ Sub, Max,    β”‚ 2D strided   β”‚                β”‚
β”‚  β”‚ GEMM         β”‚ Reductions   β”‚ transfers    β”‚                β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                β”‚
β”‚         β”‚              β”‚              β”‚                         β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β”‚
β”‚                        β”‚                                        β”‚
β”‚                   β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚                   β”‚  SRAM   │◄───────►│  AXI Mem   β”‚           β”‚
β”‚                   β”‚ Model   β”‚         β”‚   Model    β”‚           β”‚
β”‚                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Model Files

File Lines Description
model/systolic_array_model.py 770 Weight-stationary MXU with PE array, skew/de-skew
model/dma_model.py 500 DMA engine with AXI & SRAM interfaces
model/vpu_model.py 300 Vector unit: ReLU, Add, reductions
model/lcp_model.py 400 Command processor with loops
model/tpc_model.py 350 Integrated TPC combining all units

Key Design Principles

1. State Machine Matching Each model uses identical state machines to the RTL:

# DMA states match rtl/core/dma_engine.v exactly
class DMAState(IntEnum):
    IDLE = 0
    DECODE = 1
    LOAD_ADDR = 2
    LOAD_DATA = 3
    LOAD_WRITE = 4
    STORE_REQ = 5
    STORE_WAIT = 6
    STORE_CAP = 13   # Critical: wait for SRAM read latency
    STORE_ADDR = 7
    # ...

2. Registered Output Modeling Proper separation of combinational and sequential logic:

def posedge(self, ...):
    # Update state from previous cycle's next_state
    self.state = self.state_next
    
    # Compute next state and outputs (combinational)
    if self.state == DMAState.LOAD_ADDR:
        self.axi_arvalid = True
        if axi_arready:
            self.state_next = DMAState.LOAD_DATA

3. SRAM Latency Modeling 1-cycle registered read latency, matching RTL:

class SRAMModel:
    def posedge(self, addr, wdata, we, re):
        # Output is from PREVIOUS cycle's read
        self.rdata = self._rdata_next
        
        if re:
            # Schedule data for NEXT cycle
            self._rdata_next = self.mem[addr >> 5]

How We Know The Model Is Correct

1. Unit Test Coverage

Each model has self-contained tests that verify functional correctness:

Model Test What It Verifies
DMA 1-word LOAD External β†’ SRAM transfer
DMA 4-word LOAD Burst transfers
DMA 1-word STORE SRAM β†’ External (tests SRAM latency handling)
DMA 2D LOAD Strided access patterns
VPU ReLU max(x, 0) for all elements
VPU Vector Add Saturating addition with overflow
VPU Sum Reduce Accumulation across vector
LCP Simple program NOP β†’ TENSOR β†’ HALT sequence
LCP Loop program LOOP 3 iterations verification
MXU 2Γ—2, 3Γ—3, 4Γ—4 GEMM Matrix multiply correctness

2. Integration Tests (tpc_model.py)

The TPC model runs complete programs that exercise multiple units:

# TEST 3: DMA LOAD β†’ VPU RELU pipeline
program = [
    make_dma_instr(DMAOp.LOAD, ext_addr=0x00, int_addr=0x100),
    make_vpu_instr(VPUOp.RELU, src_a=0x100, dst=0x200),
    make_halt()
]
# Verifies: DMA loads from AXI, VPU processes, correct data flow

3. RTL Cross-Validation

The Python model was developed alongside RTL fixes:

Bug Found Python Model RTL Fix
DMA STORE timing S_STORE_CAP state added dma_engine.v line 180
SRAM read latency _rdata_next pipeline sram_bank.v registered output
VPU data capture WAIT_A1β†’WAIT_A2 states vector_unit.v timing

4. Numerical Verification

# NumPy golden reference
expected = np.dot(A.astype(np.int32), B.astype(np.int32))

# Python model result
C = tpc.read_sram(0x200, 4).reshape(2, 2)

# Must match exactly
assert np.array_equal(C, expected.astype(np.int8))

Running Python Model Tests

Prerequisites

# Python 3.8+ with NumPy
pip install numpy

Run Individual Model Tests

cd tensor_accelerator

# DMA Engine (4 tests)
python3 model/dma_model.py
# Output: ALL DMA MODEL TESTS PASSED!

# Vector Processing Unit (3 tests)  
python3 model/vpu_model.py
# Output: ALL VPU TESTS PASSED!

# Local Command Processor (2 tests)
python3 model/lcp_model.py
# Output: ALL LCP TESTS PASSED!

# Systolic Array / MXU (3 tests)
python3 model/systolic_array_model.py
# Output: ALL TESTS PASSED!

# Full TPC Integration (3 tests)
python3 model/tpc_model.py
# Output: ALL TPC INTEGRATED TESTS PASSED!

Run All Tests (RTL + Python)

./run_tests.sh
# Runs 22 tests total:
#   - 18 RTL tests (requires iverilog)
#   - 4 Python model tests (requires Python + NumPy)

Verbose Mode for Debugging

# Enable cycle-by-cycle tracing
tpc = TPCModel(verbose=True)
tpc.run()

# Output:
# [LCP @  1] FETCH        PC=  0 | Fetching
# [LCP @  3] DECODE       PC=  0 | Decoded: op=0x03 subop=0x01
# [DMA @  5] IDLE         | CMD: LOAD
# [DMA @  7] LOAD_ADDR    | AR addr=0x0
# [DMA @ 11] LOAD_DATA    | R data=0xdead0000
# ...

Using the Model for Development

1. Test New Algorithms

from model.tpc_model import TPCModel, make_mxu_instr, make_vpu_instr, make_halt

tpc = TPCModel()

# Load your matrices
tpc.load_sram(0x000, weights)
tpc.load_sram(0x100, activations)

# Run GEMM + ReLU
tpc.load_program([
    make_mxu_instr(0x000, 0x100, 0x200, M=4, K=4, N=4),
    make_vpu_instr(VPUOp.RELU, 0x200, 0, 0x300, length=1),
    make_halt()
])
tpc.run()

# Read results
output = tpc.read_sram(0x300, 16).reshape(4, 4)

2. Generate RTL Test Vectors

# Run Python model, capture intermediate values
tpc = TPCModel(verbose=True)
tpc.run()

# Export SRAM state for RTL $readmemh
with open('test_vectors.hex', 'w') as f:
    for addr in range(256):
        f.write(f'{tpc.sram.mem[addr]:064x}\n')

3. Cycle Count Analysis

tpc = TPCModel()
tpc.load_program(my_program)
tpc.run()

print(f"Total cycles: {tpc.cycle}")
# Use for performance optimization

πŸ§ͺ Example: Matrix Multiplication

// The systolic array computes C = A Γ— B
// Weight-stationary dataflow:
//   1. Load weights (B) into PEs - they stay in place
//   2. Stream activations (A) from left
//   3. Accumulate partial sums flowing down
//   4. Results emerge from bottom

// Each PE computes:
psum_out = psum_in + (activation Γ— weight)

πŸ“ Assembly Example

# ResNet Convolution Kernel
LOOP        64                # 64 output channels
    DMA.LOAD_2D W_SRAM, W_DDR, 16, 16, 256
    DMA.LOAD_2D A_SRAM, A_DDR, 16, 16, 256
    TENSOR.GEMM OUT_SRAM, A_SRAM, W_SRAM, 16, 16, 16
    VEC.RELU    OUT_SRAM, OUT_SRAM
    DMA.STORE_2D OUT_DDR, OUT_SRAM, 16, 16, 256
ENDLOOP
HALT

Compile to Binary

# Compile assembly to hex (for Verilog $readmemh)
python3 sw/assembler/assembler.py sw/examples/resnet_conv.asm -o program.hex

# Output: 128-bit instructions
# 05000000000000000000004000000000  // LOOP 64
# 03010400000000100010001001000000  // DMA.LOAD_2D
# ...
# ff000000000000000000000000000000  // HALT

See ISA_GUIDE.md for complete instruction encoding details.

🀝 Contributing

Contributions welcome! Please read the documentation first, especially:

  1. VERILOG_TUTORIAL.md - Understand the design
  2. TEST_FLOW.md - How to verify changes

πŸ“„ License

MIT License - see LICENSE for details.

πŸ™ Acknowledgments

  • Inspired by Google TPU, NVIDIA Tensor Cores, and academic systolic array research
  • Built with guidance from Anthropic's Claude

⭐ Star this repo if you find it useful!

About

FPGA systolic array AI engine

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors