Target device: Xilinx Zynq-7020 (xc7z020clg484-1) | Arithmetic: Q16.16 fixed-point | Simulation: Vivado XSim | Training: PyTorch 2.x
This project implements three progressively improved neural network architectures for handwritten digit recognition (MNIST) directly in synthesizable SystemVerilog — a Multi-Layer Perceptron (MLP), a 1D Convolutional Neural Network, and a 2D Convolutional Neural Network. Each model is trained in PyTorch, weights are exported to .mem files in Q16.16 fixed-point format, and the RTL is verified end-to-end in Vivado XSim before synthesis on a physical Zynq-7020 FPGA.
The project traces a complete engineering journey: starting from a simple MLP baseline, identifying its hardware cost bottlenecks, redesigning with 1D convolution to cut weights 19×, then moving to 2D convolution to achieve 98.35% accuracy with only 1,728 weights — the most resource-efficient design on this chip.
| Stage | Architecture | Weights | DSPs Used | Sim Accuracy | Notes |
|---|---|---|---|---|---|
| Stage 1 | MLP 784→10→10 | 7,950 | 20 | 89.08% | Baseline, synthesizable |
| Stage 1b | MLP 784→256→128→64→10 | 242,304 | 458 | ~96.8% | Simulation only — exceeds DSP limit |
| Stage 2 | 1D CNN (4ch→8ch) + FC(32→10) | 12,778 | 54 | ~94% | 19× weight reduction vs large MLP |
| Stage 3 | 2D CNN (4ch→8ch) + FC(32→10) | 1,728 | 54 | 98.35% | 7× fewer weights than 1D CNN, best accuracy |
The architecture evolution is driven entirely by FPGA constraints. The large MLP hits a hard wall: the xc7z020 has 220 DSP48E1 slices and the 784→256→128→64→10 topology needs 458. Switching to a convolutional front-end reduces the weight count dramatically (shared filters), and moving to 2D convolution better captures the spatial structure of the 28×28 image — giving higher accuracy with fewer parameters.
┌───────────────────────────────────────────────────────────────────────┐
│ Input [784 × Q16.16] │
│ │ │
│ │ w1[784×10] + b1[10] │
│ ▼ │
│ Layer 1 [10 neurons] ──► ReLU ──► [10 × Q24.16] │
│ │ │
│ │ w2[10×10] + b2[10] │
│ ▼ │
│ Layer 2 [10 logits] ──► argmax ──► Predicted Digit (0–9) │
└───────────────────────────────────────────────────────────────────────┘
Datapath: Each neuron performs a sequential MAC (multiply-accumulate) stepped by a counter. One 32-bit × 32-bit Q16.16 multiply yields a 64-bit product, which is accumulated then right-shifted by 16 to return to Q16.16. The counter fires done when all inputs are consumed, chaining Layer 1 → Layer 2 automatically.
RTL modules: input_layer.sv → neuron_inputlayer.sv → multiplier.sv → adder.sv → ReLu.sv → hidden_layer.sv → neuron_hiddenlayer.sv
Weights (20 files): mlp_weights/w1_1.mem … w1_10.mem, w2_1.mem … w2_10.mem, b1.mem, b2.mem
The larger 784→256→128→64→10 MLP exists in
neural_network_param.svfor simulation and study. It cannot be synthesized on this chip — see the Synthesis Failure Analysis section.
┌───────────────────────────────────────────────────────────────────────┐
│ Input [784 × 1ch × Q16.16] │
│ │ │
│ │ Conv1d kernel=5, 4 filters (20 weights) │
│ ▼ │
│ [780 × 4ch] ──► MaxPool1d(4) ──► [195 × 4ch] │
│ │ │
│ │ Conv1d kernel=3, 8 filters (96 weights) │
│ ▼ │
│ [193 × 8ch] ──► MaxPool1d(4) ──► [48 × 8ch] ──► Flatten [384] │
│ │ │
│ │ FC 384→32 (12,288 weights) + FC 32→10 (320 weights) │
│ ▼ │
│ Logits [10] ──► argmax ──► Predicted Digit │
└───────────────────────────────────────────────────────────────────────┘
Datapath: The fused conv_pool_1d.sv module slides a kernel across the input, applies ReLU, and immediately max-pools — processing one filter at a time so the intermediate conv buffer stays internal (saving ~100 K bits of exposed wiring). The FC head uses layer_seq.sv, a serial MAC module that reads weights from a single BRAM ROM one element at a time (1 DSP per FC layer instead of 32 + 10 in the original parallel design).
RTL modules: cnn_top.sv → conv_pool_1d.sv, layer_seq.sv
Weights (8 files): cnn_weights/conv1_w.mem, conv1_b.mem, conv2_w.mem, conv2_b.mem, fc1_w.mem, fc1_b.mem, fc2_w.mem, fc2_b.mem
┌───────────────────────────────────────────────────────────────────────┐
│ Input [28×28 × 1ch × Q16.16] │
│ │ │
│ │ Conv2d 3×3, 4 filters (36 weights) │
│ ▼ │
│ [26×26 × 4ch] ──► ReLU ──► MaxPool2d(2×2) ──► [13×13 × 4ch] │
│ │ │
│ │ Conv2d 3×3, 8 filters (288 weights) │
│ ▼ │
│ [11×11 × 8ch] ──► ReLU ──► MaxPool2d(2×2) ──► [5×5 × 8ch] │
│ │ │
│ │ Flatten [200] → FC 200→32 → FC 32→10 │
│ ▼ │
│ Logits [10] ──► argmax ──► Predicted Digit │
└───────────────────────────────────────────────────────────────────────┘
Why 2D wins: The 2×2 max-pool after each 3×3 conv reduces spatial size aggressively — 28×28 → 13×13 → 5×5. This gives a flatten of only 200 elements (vs 384 for 1D), so the FC head needs far fewer weights. Meanwhile the 3×3 kernel in 2D actually sees both horizontal and vertical patterns in the digit, which 1D convolution treating the image as a flat signal cannot do.
Datapath: conv2d.sv uses a nested loop state machine — outer loop over output pixel positions (height × width), inner loop over kernel taps (9 taps × channels). The data_idx for each tap is computed as ch*(H*W) + (row+kr)*W + (col+kc). maxpool2d.sv initializes each channel's max to the most-negative representable value, then compares over the 2×2 window.
RTL modules: cnn2d_top.sv → conv2d.sv, maxpool2d.sv, layer.sv
Weights (8 files): cnn2d_weights/conv1_w.mem, conv1_b.mem, conv2_w.mem, conv2_b.mem, fc1_w.mem, fc1_b.mem, fc2_w.mem, fc2_b.mem
All three designs are verified in Vivado XSim behavioral simulation using real MNIST test images exported in Q16.16 format. The testbench loads weight .mem files, drives rstn, waits for the done signal, then reads the output logits.
| Signal | Expected behaviour |
|---|---|
rstn |
Pulses low at t=0, goes high at ~20 ns to start inference |
done / counter_donestatus |
Rises once all MACs are complete for a layer |
neuralnet_out[0:9] / cnn_out[0:9] |
Output logits in Q16.16; the highest value's index is the predicted digit |
pred_out |
Argmax output — should match expected_label.mem |
- Test image: index 100 — true label 6, predicted 6 ✅
- Runtime: ~20,000 ns
- Layer 1 fires after ~10,000 ns; Layer 2 fires ~4,000 ns later
- Test image: true label 6, predicted 6 ✅
- Runtime: FC2 done at ~88,345,000 ns simulation time (conv layers are sequential and take most cycles)
- Verified with a dedicated box-filter unit test (
tb_conv2d_box.sv): 32/32 exact match against Python reference
Full simulation console output
============================================================
1D CNN TESTBENCH - LOADING DATA
============================================================
[INFO] Loading Conv1 weights (conv1_w.mem) - 20 entries ...
[INFO] Loading Conv1 biases (conv1_b.mem) - 4 entries ...
[INFO] Loading Conv2 weights (conv2_w.mem) - 96 entries ...
[INFO] Loading Conv2 biases (conv2_b.mem) - 8 entries ...
[INFO] Loading FC1 weights (fc1_w.mem) - 32 neurons — 424 entries ...
[INFO] Loading FC1 biases (fc1_b.mem) - 32 biases ...
[INFO] Loading FC2 weights (fc2_w.mem) - 10 neurons — 72 entries ...
[INFO] Loading FC2 biases (fc2_b.mem) - 10 biases ...
[INFO] Loading input data (data_in.mem) - 784 pixels ...
[INFO] Loading expected label (expected_label.mem) ...
[INFO] Expected label: 6
[INFO] Applying reset ...
[INFO] Reset released at 20000 ns. Inference running ...
[INFO] Conv1 DONE at 46835000 ns. Pool1 starting ...
[INFO] Pool1 DONE at 56605000 ns. Conv2 starting ...
[INFO] Conv2 DONE at 81715000 ns. Pool2 starting ...
[INFO] Pool2 DONE at 84135000 ns. FC1 starting ...
[INFO] FC1 DONE at 88345000 ns. FC2 starting ...
*** RESULT: PASS - Prediction matches expected label! ***
============================================================
CNN INFERENCE COMPLETE - RESULTS
============================================================
CNN OUTPUT VALUES (Q16.16 raw logits)
------------------------------------------------------------
Output[0] (digit 0) = -140125
Output[1] (digit 1) = -45700
Output[2] (digit 2) = -37000
Output[3] (digit 3) = -479983
Output[4] (digit 4) = 242062
Output[5] (digit 5) = -15346
Output[6] (digit 6) = 650351
Output[7] (digit 7) = -363607
Output[8] (digit 8) = 116136
Output[9] (digit 9) = -164515
>>> DETECTED DIGIT: 6 <<<
>>> Confidence (raw Q16.16 logit): 650351 <<<
--- EXPECTED DIGIT: 6 ---
*** RESULT: PASS - Prediction matches expected label! ***
- Test image: true label 9, predicted 9 ✅
- Runtime: FC2 done at ~124,535,000 ns simulation time
- Software accuracy: 98.35% on the full 10,000-image MNIST test set
Full simulation console output
============================================================
2D CNN TESTBENCH - LOADING DATA
============================================================
[INFO] Loading Conv1 weights (conv1_w.mem) - 36 entries ...
[INFO] Loading Conv1 biases (conv1_b.mem) - 4 entries ...
[INFO] Loading Conv2 weights (conv2_w.mem) - 288 entries ...
[INFO] Loading Conv2 biases (conv2_b.mem) - 8 entries ...
[INFO] Loading FC1 weights (fc1_w.mem) - 32 neurons — 240 entries ...
[INFO] Loading FC1 biases (fc1_b.mem) - 32 biases ...
[INFO] Loading FC2 weights (fc2_w.mem) - 10 neurons — 72 entries ...
[INFO] Loading FC2 biases (fc2_b.mem) - 10 biases ...
[INFO] Loading input data (data_in.mem) - 784 pixels (28x28) ...
[INFO] Loading expected label (expected_label.mem) ...
[INFO] Expected label: 9
[INFO] Applying reset ...
[INFO] Reset released at 20000 ns. Inference running ...
[INFO] Conv1 DONE at 67635000 ns. Pool1 starting ...
[INFO] Pool1 DONE at 76105000 ns. Conv2 starting ...
[INFO] Conv2 DONE at 120895000 ns. Pool2 starting ...
[INFO] Pool2 DONE at 122165000 ns. FC1 starting ...
[INFO] FC1 DONE at 124535000 ns. FC2 starting ...
============================================================
2D CNN INFERENCE COMPLETE - RESULTS
============================================================
2D CNN OUTPUT VALUES (Q16.16 raw logits)
------------------------------------------------------------
Output[0] (digit 0) = -476696
Output[1] (digit 1) = -433495
Output[2] (digit 2) = -329567
Output[3] (digit 3) = 211216
Output[4] (digit 4) = -104097
Output[5] (digit 5) = 11031
Output[6] (digit 6) = -1333607
Output[7] (digit 7) = -97071
Output[8] (digit 8) = -146093
Output[9] (digit 9) = 578677
>>> DETECTED DIGIT: 9 <<<
>>> Confidence (raw Q16.16 logit): 578677 <<<
--- EXPECTED DIGIT: 9 ---
*** RESULT: PASS - Prediction matches expected label! ***
The synthesizable wrapper modules (mlp_synth_top.sv, cnn1d_synth_top.sv, cnn2d_synth_top.sv) embed all weights as internal register arrays initialized via $readmemh — a Vivado-supported ROM initialization method. Zero logic changes to any compute module.
The wrappers expose three clean ports to the synthesis tool: clk, rstn, pixel_in[0:783][31:0] (784 Q16.16 pixels), and pred_out[3:0] (argmax class 0–9). Vivado sees real timing paths from input pixels through all MAC stages to the output register.
The 784→10→10 MLP is the simplest design. It uses ~20 DSP48E1 slices (one per output neuron) and moderate LUT-RAM for the 7,950-word weight store. Timing closure at 100 MHz is straightforward due to the shallow network depth, and power consumption is modest — primarily dynamic switching in the MAC datapath with a static contribution from the LUT-RAM weight store.
Expected: ~14 DSP48E1 (4 + 8 conv filters processed sequentially in conv_pool_1d, 1 DSP for FC1, 1 DSP for FC2 via layer_seq). FC weights stored in BRAM (~15 BRAM36k blocks) instead of LUT-RAM.
The conv_pool_1d state machine processes one filter at a time; all paths are purely sequential so timing closure at 50–100 MHz is comfortable.
Higher activity factor than MLP due to the convolution state machines running for hundreds of thousands of cycles per inference.
Expected: ~14 DSP48E1 (sequential per-filter processing in conv_pool_2d, 1 DSP each for FC1/FC2 via layer_seq). FC weights stored in BRAM; smaller weight footprint than 1D CNN (200×32 = 6,400 vs 384×32 = 12,288 words).
The conv_pool_2d nested loop has a longer critical path than conv_pool_1d; verify slack is positive at your target clock.
Comparable to 1D CNN — conv loops dominate dynamic power.
neural_network_param.sv implements 784→256→128→64→10. It is provided for simulation and educational comparison, not for deployment:
| Resource | Required | Available (xc7z020) | Status |
|---|---|---|---|
| DSP48E1 | 458 | 220 | ❌ 2.1× over limit |
| Weight storage (LUT-RAM, 32-bit words) | 242,304 | 53,200 | ❌ 4.6× over limit |
| Weight storage (BRAM, 32-bit words) | 242,304 | 143,360 | ❌ 1.7× over limit |
All three resources fail simultaneously — no amount of floor-planning can fix this on the xc7z020. The design would need a larger device (e.g., xc7z045) or a streaming weight architecture where weights are read from external DDR one row at a time.
Full tables are in docs/FPGA_RESOURCE_LIMITS.md. The hard ceiling is the 220 DSP48E1 count; weight storage is the secondary limit.
| Architecture | Best config (no BRAM) | Best config (with BRAM) | Limiting factor |
|---|---|---|---|
| MLP | 784 → 67 → 10 | 784 → 180 → 10 | Weight storage |
| 1D CNN (C1=4, C2=8) | FC1 = 134 neurons | FC1 = 198 neurons | Weight → DSP |
| 2D CNN (F1=4, F2=8) | FC1 = 198 neurons | FC1 = 198 neurons | DSP (even without BRAM!) |
The 2D CNN is uniquely efficient: its small flatten size (200) means weight storage is never the bottleneck — the design hits the DSP ceiling before it runs out of memory, with weights occupying only 41,904 of the 53,200 available LUT-RAM words.
| Model | Testbench | Weight folder | Sim time |
|---|---|---|---|
| MLP | tb_neuralnetwork.sv |
mlp_weights/ |
25,000 ns |
| 1D CNN | tb_cnn.sv |
cnn_weights/ |
200,000 ns |
| 2D CNN | tb_cnn2d.sv |
cnn2d_weights/ |
200,000 ns |
- Vivado → Create Project → RTL Project → device
xc7z020clg484-1 - Add Sources → add all
verilog_files/*.sv; set the relevant testbench as simulation top - Copy all
.memfiles from the relevant weights folder to the Vivado simulation working directory (typically<project>/<project>.sim/sim_1/behav/xsim/) - Run Simulation → in the Tcl console:
run 200000ns - In the waveform viewer, add
rstn,clk, and the output signals to verify correctness
- Vivado → Create Project → RTL Project →
xc7z020clg484-1 - Add all
verilog_files/*.svas sources (no testbenches) - Set the desired synthesis top:
- MLP →
mlp_synth_top - 1D CNN →
cnn1d_synth_top - 2D CNN →
cnn2d_synth_top
- MLP →
- Project Settings → General → IP → File Search Paths → add the repository root
(so Vivado resolves paths likemlp_weights/w1_1.mem) - Run Synthesis → Run Implementation → open reports:
- Reports → Report Utilization — LUT, FF, DSP, BRAM counts
- Reports → Report Timing Summary — worst negative slack (WNS)
- Reports → Report Power — dynamic + static power breakdown
cd python_files
# MLP 784→10→10 — trains and writes mlp_weights/*.mem
python mlp_simple_model.py
# 1D CNN — trains and writes cnn_weights/*.mem
python cnn_model.py
# 2D CNN — trains and writes cnn2d_weights/*.mem
python cnn2d_model.py
# Change the test image for any model (index 0–9999)
python cnn2d_test_image.py 42
python cnn_test_image.py 42FPGA_NN-main/
│
├── verilog_files/ SystemVerilog source files
│ ├── ── Compute modules ─────────────────────────────── (never modified)
│ │ ├── neural_network.sv MLP 784→10→10
│ │ ├── neural_network_param.sv MLP 784→256→128→64→10 (sim-only)
│ │ ├── input_layer.sv First FC layer (ReLU)
│ │ ├── hidden_layer.sv Hidden / output FC layer
│ │ ├── neuron_inputlayer.sv Single neuron: MAC + ReLU
│ │ ├── neuron_hiddenlayer.sv Single neuron: MAC only
│ │ ├── layer.sv Generic counter-based FC layer (sim-only)
│ │ ├── layer_seq.sv Sequential FC layer (BRAM weights, 1 DSP)
│ │ ├── cnn_top.sv 1D CNN top-level (synthesis-ready)
│ │ ├── conv_pool_1d.sv Fused Conv1D + MaxPool1D
│ │ ├── conv1d.sv 1D convolution state machine (sim-only)
│ │ ├── maxpool1d.sv 1D max-pooling (sim-only)
│ │ ├── cnn2d_top.sv 2D CNN top-level (synthesis-ready)
│ │ ├── conv_pool_2d.sv Fused Conv2D + MaxPool2D
│ │ ├── conv2d.sv 2D convolution state machine (sim-only)
│ │ ├── maxpool2d.sv 2D max-pooling (sim-only)
│ │ ├── multiplier.sv Q16.16 × Q16.16 → Q16.16
│ │ ├── adder.sv Signed accumulator
│ │ ├── ReLu.sv ReLU activation + bias
│ │ ├── register.sv Pipeline register
│ │ └── counter.sv Timing / sequencing counter
│ │
│ ├── ── Synthesizable wrappers ──────────────────────── (weights as ROM)
│ │ ├── mlp_synth_top.sv Synthesis entry point: MLP
│ │ ├── cnn1d_synth_top.sv Synthesis entry point: 1D CNN
│ │ └── cnn2d_synth_top.sv Synthesis entry point: 2D CNN
│ │
│ └── ── Testbenches ─────────────────────────────────── (sim only)
│ ├── tb_neuralnetwork.sv MLP testbench
│ ├── tb_neuralnetwork_param.sv Large MLP testbench
│ ├── tb_cnn.sv 1D CNN testbench
│ ├── tb_cnn2d.sv 2D CNN testbench
│ └── tb_conv2d_box.sv 2D conv unit test (box filter)
│
├── python_files/ PyTorch training and weight export
│ ├── mlp_simple_model.py Train 784→10→10, export → mlp_weights/
│ ├── cnn_model.py Train 1D CNN, export → cnn_weights/
│ ├── cnn2d_model.py Train 2D CNN, export → cnn2d_weights/
│ ├── cnn_test_image.py Export MNIST test image (1D CNN)
│ ├── cnn2d_test_image.py Export MNIST test image (2D CNN)
│ └── input.py Export MNIST test image (MLP)
│
├── mlp_weights/ .mem files: MLP weights + test image
├── cnn_weights/ .mem files: 1D CNN weights + test image
├── cnn2d_weights/ .mem files: 2D CNN weights + test image
│
├── images/
│ ├── README.md Screenshot naming guide + Vivado steps
│ ├── mlp/ ← drop simulation.jpeg, utilization.jpeg,
│ ├── 1dcnn/ timing.jpeg, power.jpeg
│ └── 2dcnn/ here for each model
│
└── docs/
├── CNN_PROJECT_DOCUMENTATION.md 1D CNN full design document
├── CNN2D_PROJECT_DOCUMENTATION.md 2D CNN full design document
└── FPGA_RESOURCE_LIMITS.md Max network size tables for xc7z020
| Layer | Tools / Version |
|---|---|
| Hardware description | SystemVerilog (IEEE 1800-2012) |
| Simulation | Xilinx Vivado XSim |
| Synthesis / P&R | Xilinx Vivado 2023+ |
| Target FPGA | xc7z020clg484-1 (Zynq-7020, speed grade -1) |
| Deep learning | PyTorch 2.x |
| Arithmetic format | Q16.16 fixed-point (32-bit signed two's complement) |
| Python | 3.10+ |
Q16.16 fixed-point throughout — Every weight, activation, bias, and pixel uses 32-bit signed fixed-point with 16 integer bits and 16 fractional bits. A multiply of two Q16.16 values produces a 64-bit result; right-shifting by 16 restores the Q16.16 scale. This avoids floating-point hardware entirely, mapping perfectly to DSP48E1 slices (18×18 or 27×18 multiply modes).
Counter-based MAC — Rather than unrolling all multiplications in parallel (which would require N DSPs per neuron), each FC neuron uses a single multiplier driven by a counter. The counter steps through all input weights sequentially, accumulating into a register. This trades inference latency (O(N) cycles) for a 1-DSP-per-neuron area cost — exactly the operating point where the 220-DSP xc7z020 fits all three models.
Synthesizable wrappers, unchanged DUTs — Testbenches load weights from .mem files and drive them as input ports. Ports carrying large arrays are not synthesizable (no constant driver). The wrapper modules (*_synth_top.sv) declare the same arrays as internal reg, initialize them with initial $readmemh, and wire them to the original DUT unchanged. Vivado treats initial-loaded reg arrays as ROM, inferring LUT-RAM or BRAM automatically based on size. Zero changes to any compute module.
2D spatial pooling is the key — Moving from 1D to 2D convolution is not just about accuracy. The 2×2 max-pool after each 3×3 conv reduces the feature map from 26×26 → 13×13 → 11×11 → 5×5, giving a flatten of 200. The 1D equivalent after two pooling stages gives 384. This difference (200 vs 384) is why the 2D FC head needs 44% fewer weights, making the full 2D CNN DSP-bound (not weight-bound) on the xc7z020 even when no BRAM is used.







