This document records the performance benchmarks and hardware initialization status of the DE10-Nano video processing pipeline.
⬅️ Back to README (2026-02-12)
| Test Case | Size | Software (cycles) | Hardware (cycles) | MB/s (HW) | Speedup |
|---|---|---|---|---|---|
| OCM to DDR | 4KB x 100 | 4,185,427 | 166,211 | 117.5 | 25 x |
| DDR to DDR | 1MB | 207,071,817 | 393,942 | 126.9 | 525 x |
Note
DMA (Burst Master 4) significantly offloads the CPU, providing over 500x speedup for 1MB transfers.
- HDMI PLL: Locked at ~37.8 MHz (960x540p60 target)
- ADV7513 IC: Configured via I2C successfully
- Memory Map: Nios II & DMA isolated at 0x20000000 (512MB offset)
- Modular Filter: 4-bit mode CSR verified, 60fps throughput confirmed
--- [TEST 1] OCM to DDR DMA (burst_master_0) ---
Starting SW Copy (4KB x 100)... Done (4185427 cycles, ~4.6 MB/s)
Starting HW DMA (4KB x 100)... Done (166211 cycles, ~117.5 MB/s)
SUCCESS: OCM to DDR Verified!
--- [TEST 2] DDR to DDR DMA (Burst Master 4) ---
Starting HW DMA (1MB)... Done (393942 cycles, ~126.9 MB/s)
SUCCESS: DDR to DDR Verified!
--- [TEST 3] Real-time Modular Filter Verified (2026-02-20) ---
- **Resolution**: 960x540p @ 60fps (Stable)
- **Filter Pipeline**:
- Mode 1: Grayscale (Verified)
- Mode 3: Color Blur (Verified via Cocotb & Hardware)
- Mode 5: Color Edge (Verified)
- Mode 6: Emboss (Verified via Cocotb)
- Mode 7: Sharpen (Verified via Cocotb)
- **Verification**: Cocotb Simulation processed 518,400 pixels in 98.21s (Sim time).
- **Latency**: 3-clock pipeline delay matched across all modes.
- **Visuals**: Confirmed zero jitter and correct spatial convolution.
The sim_filters.py script provides a high-level reference implementation.
| Original | Grayscale |
|---|---|
![]() |
![]() |
| Blur | Edge (Gray) |
|---|---|
![]() |
![]() |
| Edge (Color) | Emboss |
|---|---|
![]() |
![]() |
| Sharpen |
|---|
![]() |
The pipeline implements a Bit-Split Hybrid Architecture: 2-bit Temporal LSB scrambling followed by 4-bit spatial Floyd-Steinberg Error Diffusion.
Comparison of the 2-Stage Hybrid algorithm against simple 4-bit truncation using a dog portrait test image.
| Evaluation Region | Truncation (Baseline) | Proposed Hybrid | Improvement |
|---|---|---|---|
| Whole Image | 29.16 dB | 32.46 dB | +3.30 dB |
| Near-Black (< 0x20) | 29.15 dB | 32.44 dB | +3.29 dB |
| Original (8-bit) | 4-bit Truncated (Banding) | Hybrid 2-Stage (Result) |
|---|---|---|
![]() |
![]() |
![]() |
Important
The +3.3dB PSNR improvement in near-black regions scientifically confirms the effectiveness of our "Seamless Error Propagation" logic. This architecture achieves 3D-like spatiotemporal quality without any external frame buffer (0 MB/s DDR bandwidth).
To fundamentally solve color banding in dark regions during 3x3 gamut correction, we upgraded the internal precision to 12-bit.
Verification conducted via bit-accurate Cocotb simulation.
| Metric | Target | Result | Status |
|---|---|---|---|
| Internal Bit-depth | 12-bit (4096 levels) | 12-bit Verified | ✅ PASSED |
| Roundtrip Error | Max 1 LSB (8-bit space) | 1 LSB Verified | ✅ PASSED |
| Shadow detail preservation | No Crushing (< 0.05% error) | Verified OK | ✅ PASSED |
| Pipeline Throughput | 37.8 MHz (60fps) | 37.8 MHz Verified | ✅ PASSED |
Analysis: By using 12-bit for the 3x3 matrix stage, we preserved the distinct steps of the lowest sRGB gradations (sRGB 1-15) that were previously crushed to zero in an 8-bit pipeline. This ensures a smooth, artifact-free transition in shadows even after complex gamut remapping.
Created by Nios II Performance Monitoring Unit.











