A production-quality RTL implementation of a tensor processing unit for neural network inference, featuring a 2Γ2 grid of Tensor Processing Clusters (TPCs) with 16Γ16 systolic arrays.
- 4 Tensor Processing Clusters (TPCs) in a 2Γ2 mesh
- 16Γ16 Systolic Arrays (256 INT8 MACs per TPC)
- 64-lane Vector Processing Unit for activations (ReLU, GELU, Softmax)
- 2D DMA Engine with strided access patterns
- 16-bank SRAM Subsystem with multi-port access
- Network-on-Chip (NoC) with XY routing
- AXI4 Memory Interface (DDR4/LPDDR4/LPDDR5 support)
- Cycle-accurate Python Model for software simulation and verification
| Metric | Value |
|---|---|
| Peak Throughput | 409 GOPS @ 200 MHz |
| Data Type | INT8 (with INT32 accumulation) |
| On-chip SRAM | 2 MB (configurable) |
| Target Devices | Xilinx UltraScale+, Versal |
All 7 tests passing:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MAC Processing Element Testbench β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[TEST 1] Loading weight = 3
PASS: weight_reg = 3 (expected 3)
[TEST 2] Computing 3 Γ 4 + 0 = 12
PASS: psum_out = 12 (expected 12)
[TEST 3] Accumulating: 12 + (3 Γ 5) = 27
PASS: psum_out = 27 (expected 27)
[TEST 4] Signed multiply: 3 Γ (-2) = -6
PASS: psum_out = -6 (expected -6)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Passed: 7 Failed: 0 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
>>> ALL TESTS PASSED! <<<
The systolic array implements weight-stationary dataflow:
Cycle State Activity
βββββ βββββ ββββββββββββββββββββββββββββ
0-16 LOAD Weights loaded column by column
17-48 COMPUTE Activations stream, MACs accumulate
49-64 DRAIN Results emerge from bottom row
65 DONE Computation complete
tensor_accelerator/
βββ rtl/ # Synthesizable Verilog
β βββ core/ # Compute units
β β βββ mac_pe.v # MAC processing element
β β βββ systolic_array.v# 16Γ16 systolic array
β β βββ vector_unit.v # 64-lane SIMD VPU
β β βββ dma_engine.v # 2D DMA controller
β βββ memory/ # Memory subsystem
β β βββ sram_subsystem.v
β β βββ memory_controller_wrapper.v
β β βββ axi_memory_model.v (sim only)
β βββ control/ # Controllers
β β βββ local_cmd_processor.v
β β βββ global_cmd_processor.v
β βββ noc/ # Network on Chip
β β βββ noc_router.v
β βββ top/ # Top-level modules
β βββ tensor_processing_cluster.v
β βββ tensor_accelerator_top.v
βββ model/ # Cycle-accurate Python models
β βββ systolic_array_model.py # MXU: weight-stationary GEMM
β βββ dma_model.py # DMA: LOAD/STORE with AXI
β βββ vpu_model.py # VPU: ReLU, Add, reductions
β βββ lcp_model.py # LCP: instruction fetch/decode
β βββ tpc_model.py # TPC: integrated system model
βββ tb/ # Testbenches
βββ sw/ # Software tools
β βββ assembler/ # Instruction assembler
β βββ examples/ # Example kernels
βββ docs/ # Documentation
βββ constraints/ # FPGA constraints
βββ scripts/ # Build scripts
# macOS
brew install icarus-verilog
brew install surfer # Waveform viewer (recommended)
# Or: brew install --cask gtkwave
# Ubuntu/Debian
sudo apt install iverilog gtkwave
# Windows (via WSL or direct)
# Install Icarus Verilog from: http://bleyer.org/icarus/# Extract and enter directory
tar -xzf tensor_accelerator.tar.gz
cd tensor_accelerator
# Interactive test menu
./debug.sh
# Or run all tests directly
make test# After running tests, view with Surfer
surfer sim/waves/mac_pe.vcd
surfer sim/waves/systolic_array.vcd
# Or with GTKWave (use preset signals)
gtkwave sim/waves/mac_pe.vcd sim/waves/mac_pe.gtkw# Batch mode
vivado -mode batch -source scripts/synth.tcl
# Or in Vivado GUI
source scripts/synth.tcl| Board | Device | Memory | Status |
|---|---|---|---|
| ZCU104 | XCZU7EV | DDR4 | β Tested |
| VCU118 | XCVU9P | DDR4 | β Tested |
| VCK190 | XCVC1902 | DDR4/LPDDR4 | β Tested |
| VM2152 | XCVM2152 | LPDDR5 | π Planned |
| Document | Description |
|---|---|
| VERILOG_TUTORIAL.md | Complete design walkthrough - start here! |
| ISA_GUIDE.md | Instruction Set Architecture - assembly β binary |
| presentation.html | Interactive slide deck - open in browser |
| Python Models | Cycle-accurate simulation - see section below |
| WAVEFORMS.md | Waveform capture guide for Surfer |
| SYNTHESIS_READINESS.md | FPGA synthesis checklist |
| MEMORY_INTEGRATION.md | DDR4/LPDDR5 integration guide |
| TEST_FLOW.md | Verification methodology |
| SIMULATOR_COMPARISON.md | Verilator vs ModelSim vs VCS |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TENSOR ACCELERATOR β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββββ
β β TPC 0 ββββ TPC 1 β β TPC 2 ββββ TPC 3 ββ
β β βββββββββ β β βββββββββ β β βββββββββ β β βββββββββ ββ
β β β16Γ16 β β β β16Γ16 β β β β16Γ16 β β β β16Γ16 β ββ
β β βSystolicβ β β βSystolicβ β β βSystolicβ β β βSystolicβ ββ
β β βArray β β β βArray β β β βArray β β β βArray β ββ
β β βββββββββ β β βββββββββ β β βββββββββ β β βββββββββ ββ
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬ββββββββ
β ββββββββββββββββββ΄βββββββββββββββββ΄βββββββββββββββββ β
β β NoC β
β βββββββββββΌββββββββββ β
β β Global Controller β β
β βββββββββββ¬ββββββββββ β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β AXI4
ββββββββββββΌβββββββββββ
β External Memory β
β (DDR4/LPDDR5) β
βββββββββββββββββββββββ
The project includes a complete cycle-accurate Python model that mirrors the RTL implementation exactly. This enables:
- Software simulation without Verilog tools
- Test vector generation for RTL verification
- Algorithm development before hardware implementation
- Cross-validation between Python and RTL
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TPC Python Model (tpc_model.py) β
β β
β ββββββββββββββββ β
β β LCP β Instruction fetch, decode, dispatch β
β β lcp_model.py β Loop handling, unit synchronization β
β ββββββββ¬ββββββββ β
β β dispatches to β
β βΌ β
β ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ β
β β MXU β VPU β DMA β β
β β systolic_ β vpu_model β dma_model β β
β β array_model β .py β .py β β
β β .py β β β β
β β β β β β
β β Weight- β ReLU, Add, β LOAD/STORE β β
β β stationary β Sub, Max, β 2D strided β β
β β GEMM β Reductions β transfers β β
β ββββββββ¬ββββββββ΄βββββββ¬ββββββββ΄βββββββ¬ββββββββ β
β β β β β
β ββββββββββββββββ΄βββββββββββββββ β
β β β
β ββββββΌβββββ ββββββββββββββ β
β β SRAM ββββββββββΊβ AXI Mem β β
β β Model β β Model β β
β βββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| File | Lines | Description |
|---|---|---|
model/systolic_array_model.py |
770 | Weight-stationary MXU with PE array, skew/de-skew |
model/dma_model.py |
500 | DMA engine with AXI & SRAM interfaces |
model/vpu_model.py |
300 | Vector unit: ReLU, Add, reductions |
model/lcp_model.py |
400 | Command processor with loops |
model/tpc_model.py |
350 | Integrated TPC combining all units |
1. State Machine Matching Each model uses identical state machines to the RTL:
# DMA states match rtl/core/dma_engine.v exactly
class DMAState(IntEnum):
IDLE = 0
DECODE = 1
LOAD_ADDR = 2
LOAD_DATA = 3
LOAD_WRITE = 4
STORE_REQ = 5
STORE_WAIT = 6
STORE_CAP = 13 # Critical: wait for SRAM read latency
STORE_ADDR = 7
# ...2. Registered Output Modeling Proper separation of combinational and sequential logic:
def posedge(self, ...):
# Update state from previous cycle's next_state
self.state = self.state_next
# Compute next state and outputs (combinational)
if self.state == DMAState.LOAD_ADDR:
self.axi_arvalid = True
if axi_arready:
self.state_next = DMAState.LOAD_DATA3. SRAM Latency Modeling 1-cycle registered read latency, matching RTL:
class SRAMModel:
def posedge(self, addr, wdata, we, re):
# Output is from PREVIOUS cycle's read
self.rdata = self._rdata_next
if re:
# Schedule data for NEXT cycle
self._rdata_next = self.mem[addr >> 5]Each model has self-contained tests that verify functional correctness:
| Model | Test | What It Verifies |
|---|---|---|
| DMA | 1-word LOAD | External β SRAM transfer |
| DMA | 4-word LOAD | Burst transfers |
| DMA | 1-word STORE | SRAM β External (tests SRAM latency handling) |
| DMA | 2D LOAD | Strided access patterns |
| VPU | ReLU | max(x, 0) for all elements |
| VPU | Vector Add | Saturating addition with overflow |
| VPU | Sum Reduce | Accumulation across vector |
| LCP | Simple program | NOP β TENSOR β HALT sequence |
| LCP | Loop program | LOOP 3 iterations verification |
| MXU | 2Γ2, 3Γ3, 4Γ4 GEMM | Matrix multiply correctness |
The TPC model runs complete programs that exercise multiple units:
# TEST 3: DMA LOAD β VPU RELU pipeline
program = [
make_dma_instr(DMAOp.LOAD, ext_addr=0x00, int_addr=0x100),
make_vpu_instr(VPUOp.RELU, src_a=0x100, dst=0x200),
make_halt()
]
# Verifies: DMA loads from AXI, VPU processes, correct data flowThe Python model was developed alongside RTL fixes:
| Bug Found | Python Model | RTL Fix |
|---|---|---|
| DMA STORE timing | S_STORE_CAP state added |
dma_engine.v line 180 |
| SRAM read latency | _rdata_next pipeline |
sram_bank.v registered output |
| VPU data capture | WAIT_A1βWAIT_A2 states |
vector_unit.v timing |
# NumPy golden reference
expected = np.dot(A.astype(np.int32), B.astype(np.int32))
# Python model result
C = tpc.read_sram(0x200, 4).reshape(2, 2)
# Must match exactly
assert np.array_equal(C, expected.astype(np.int8))# Python 3.8+ with NumPy
pip install numpycd tensor_accelerator
# DMA Engine (4 tests)
python3 model/dma_model.py
# Output: ALL DMA MODEL TESTS PASSED!
# Vector Processing Unit (3 tests)
python3 model/vpu_model.py
# Output: ALL VPU TESTS PASSED!
# Local Command Processor (2 tests)
python3 model/lcp_model.py
# Output: ALL LCP TESTS PASSED!
# Systolic Array / MXU (3 tests)
python3 model/systolic_array_model.py
# Output: ALL TESTS PASSED!
# Full TPC Integration (3 tests)
python3 model/tpc_model.py
# Output: ALL TPC INTEGRATED TESTS PASSED!./run_tests.sh
# Runs 22 tests total:
# - 18 RTL tests (requires iverilog)
# - 4 Python model tests (requires Python + NumPy)# Enable cycle-by-cycle tracing
tpc = TPCModel(verbose=True)
tpc.run()
# Output:
# [LCP @ 1] FETCH PC= 0 | Fetching
# [LCP @ 3] DECODE PC= 0 | Decoded: op=0x03 subop=0x01
# [DMA @ 5] IDLE | CMD: LOAD
# [DMA @ 7] LOAD_ADDR | AR addr=0x0
# [DMA @ 11] LOAD_DATA | R data=0xdead0000
# ...from model.tpc_model import TPCModel, make_mxu_instr, make_vpu_instr, make_halt
tpc = TPCModel()
# Load your matrices
tpc.load_sram(0x000, weights)
tpc.load_sram(0x100, activations)
# Run GEMM + ReLU
tpc.load_program([
make_mxu_instr(0x000, 0x100, 0x200, M=4, K=4, N=4),
make_vpu_instr(VPUOp.RELU, 0x200, 0, 0x300, length=1),
make_halt()
])
tpc.run()
# Read results
output = tpc.read_sram(0x300, 16).reshape(4, 4)# Run Python model, capture intermediate values
tpc = TPCModel(verbose=True)
tpc.run()
# Export SRAM state for RTL $readmemh
with open('test_vectors.hex', 'w') as f:
for addr in range(256):
f.write(f'{tpc.sram.mem[addr]:064x}\n')tpc = TPCModel()
tpc.load_program(my_program)
tpc.run()
print(f"Total cycles: {tpc.cycle}")
# Use for performance optimization// The systolic array computes C = A Γ B
// Weight-stationary dataflow:
// 1. Load weights (B) into PEs - they stay in place
// 2. Stream activations (A) from left
// 3. Accumulate partial sums flowing down
// 4. Results emerge from bottom
// Each PE computes:
psum_out = psum_in + (activation Γ weight)# ResNet Convolution Kernel
LOOP 64 # 64 output channels
DMA.LOAD_2D W_SRAM, W_DDR, 16, 16, 256
DMA.LOAD_2D A_SRAM, A_DDR, 16, 16, 256
TENSOR.GEMM OUT_SRAM, A_SRAM, W_SRAM, 16, 16, 16
VEC.RELU OUT_SRAM, OUT_SRAM
DMA.STORE_2D OUT_DDR, OUT_SRAM, 16, 16, 256
ENDLOOP
HALT# Compile assembly to hex (for Verilog $readmemh)
python3 sw/assembler/assembler.py sw/examples/resnet_conv.asm -o program.hex
# Output: 128-bit instructions
# 05000000000000000000004000000000 // LOOP 64
# 03010400000000100010001001000000 // DMA.LOAD_2D
# ...
# ff000000000000000000000000000000 // HALTSee ISA_GUIDE.md for complete instruction encoding details.
Contributions welcome! Please read the documentation first, especially:
- VERILOG_TUTORIAL.md - Understand the design
- TEST_FLOW.md - How to verify changes
MIT License - see LICENSE for details.
- Inspired by Google TPU, NVIDIA Tensor Cores, and academic systolic array research
- Built with guidance from Anthropic's Claude
β Star this repo if you find it useful!