Projects

INT8 Fixed-Point CNN Hardware Accelerator and Image-Processing Suite

View Project

Designed and evaluated multiple CIFAR-10 CNNs, selecting a Pareto-optimal 6-layer residual model balancing accuracy (~84%), parameter memory (~52 kB), and compute (~12–13 M FLOPs) for hardware deployment
Implemented a tiling-based convolution/GEMM accelerator with reusable MAC PEs and line-buffered dataflow; integrated AXI4-Lite control + AXI-Stream/DMA data movement; verified end-to-end via RTL testbenches against Python reference models
Developed pipelined processing elements using 8-bit Booth–Kogge MACs, with FSM-based control and a 2-cycle ready/valid handshake, ensuring timing-clean and scalable datapath operation
Performed quantization studies (PTQ/QAT) from FP32 (Q1.31) to fixed-point (Q1.7), achieving ~4× memory reduction with <1% accuracy loss; validated TensorFlow FP32 to RTL numerical consistency
Built automation flows (TCL/Python) for ROM/weight generation, testbench stimulus, and inference execution; generated coefficient memories and ensured deterministic layer-by-layer verification
Implemented a streaming image-processing toolkit (AXI-Stream) including edge detection, filtering, denoising, and enhancement, with pipelined RTL and FIFO-based backpressure handling; included MLP-based (E)MNIST classifier with automated preprocessing/inference

Verilog SystemVerilog TensorFlow Python TCL Perl AXI-Stream

Design & Formal Verification of Parameterizable Fixed-Point CORDIC IP

View Project

Implemented shift-add datapath with all 6 modes rotation/vectoring (circular/linear/hyperbolic); width/iter/angle frac/output width–shift scaling swept across configs
Built trig/mag/atan2/mul/div/exp wrappers; observed ∼e-5 RMS (@32b, 16iter) baseline vs double-precision references
Proved handshake, deadlock-free bounded liveness, range safety, symmetry & monotonicity via SystemVerilog assertions (SymbiYosys/Yices2)
Auto-generated atan tables & param files via Python; FuseSoC-packaged core with documented sensitivity, error trends & failure regions
Variants: pipelined/SIMD/multi-issue; Systems: radix-2 FFT/IFFT, DPLL, Sigma-Delta ADC Front-End, QAM16 receiver (Costas carrier + Gardner timing recovery)

Verilog SystemVerilog SymbiYosys Yices2 Python FuseSoC

Pipelined Systolic Array for GEMM/Conv2D with MAC PPA Study (Sky130 OpenLane)

View Project

Designed parameterized output-stationary 2D systolic array for signed 8-bit GEMM/Conv2D with wavefront scheduling and pipelined PEs
Implemented im2col-based Conv2D mapping onto GEMM core (4×16×36), achieving 42.47 MAC/cycle (99.5% peak) at 66.3% PE utilization
Explored 9 MAC architectures (Array/Baugh/Booth × RCA/Kogge/CSA) via full RTL-to-GDSII (OpenLane, sky130_fd_sc_hd); achieved 100 MHz timing closure with post-route STA correlation and 0 DRC/LVS violations
Quantified PPA tradeoffs - Booth+RCA 15.8k µm² / 537 cells (min area); Array+Kogge 5.68 ns (~176 MHz, max Fmax); Kogge ~4–5% faster at ~20% higher power; CSA ~67% larger and ~2× power with no timing gain
Designed ping-pong SRAM tiled GEMM with DMA-backed data movement (1-cycle buffer swap), enabling overlap of load and compute
Implemented direct-mapped tile cache (tag+valid) achieving 93.75% hit rate; verified across 240+ tests (GEMM/Conv, random/boundary/burst/multi-tile), confirming functional correctness and linear systolic scaling (K+M+N−2)

Verilog Sky130 RTL2GDS OpenLane DFT

Dual-Issue Superscalar RV32I CPU: Design, Verification, and Performance Evaluation

View Project

Built a two-wide in-order superscalar RISC processor with parallel IF–ID–EX–MEM–WB lanes and independent pipeline registers per lane
Designed dual 32-bit instruction fetch per cycle with inter-lane dependency checks, hazard suppression, and load-use stall handling
Implemented 4R2W register file, RAW/WAW detection, branch squashing, & multi-port memory for concurrent fetch and data access
Evaluated SC/MC/5-stage pipelined designs via directed programs (assembly with RISCV GCC Toolchain & QEMU reference), analyzing CPI (1/3.8/1.6), cycle counts, and hazard overhead
Validated RV32I compliance using the RISC-V Architectural Compliance Test (RISCV-ACT), verified functionality with the Dhrystone benchmark
Evaluated LUT utilization, CPI, and instruction throughput, achieving ~1.6 CPI and ~0.63 instructions/cycle throughput in the two-wide superscalar pipeline

Verilog RISC-V GCC Toolchain QEMU

Pipelined Low Power ALU with Scan Chain Integration

View Project

Designed non-pipelined/pipelined/scan-enabled 4-stage ALU; pipeline FFs replaced with scan FFs for scan-in/capture/scan-out; added CDC bridge (async FIFO + 2FF sync) between clk domains
Gate-level timing analysis in Yosys/OpenSTA (Sky130) with clock uncertainty, I/O delays, & input slew; ∼1.7x fmax gain with pipelining
RTL2GDS flow: scan vs no-scan, single vs dual scan, CTS skew tightening, util/density & floorplan stress; closed timing throughout
Integrated scan-safe clock gating on pipeline regs; reduced switching power ~19% & internal power ~38% with negligible timing impact
Analyzed IO-driven routing effects; worst-case pinning increased clock wire length by >2x despite CTS/placement optimization; recovered via pin-arch optimization (>50% clk WL ↓); PDN stress at signoff showed +21% total & +58% switching power

Verilog Yosys OpenSTA Sky130 DFT

AHB–APB Bridge with Self-Checking Verification

Designed a parameterizable AHB-Lite to APB bridge with FSM-based control supporting single & burst read/write transactions
Implemented address/data latching, write buffering, read return, and burst sequencing, handling pipelined and non-pipelined accesses
Built a self-checking SV testbench with macro-controlled test modes (single/burst R/W) and assertion-based data validation
Verified protocol correctness across all transaction types; additionally designed & verified standalone (I2C/SPI/UART) peripheral controllers

Verilog SystemVerilog

Functional & UVM Verification of SHA-256 Core (secworks)

Developed a self-checking functional verification environment for the secworks SHA-256 core, validating compression rounds, IV initialization, message scheduling, and multi-block chaining
Implemented directed, random, corner-case, and fail-case stimuli (e.g., `abc`, empty message, all-zero, all-ones) to verify digest correctness and interface handshake behavior (`ready`, `digest_valid`)
Injected malformed control sequences, undefined inputs, and invalid block ordering to stress protocol robustness and confirm mismatch detection and failure reporting
Automated compilation, simulation, and log aggregation through a TCL-driven regression flow, enabling repeatable runs and consolidated verification reporting
Implemented a basic SystemVerilog/UVM verification environment with agent, driver, sequencer, monitor, and scoreboard to modularize stimulus generation and digest checking

Verilog Icarus Verilog TCL

CMOS Bandgap Reference Simulation

Simulated and verified a CMOS bandgap reference in OSU 180 nm CMOS using LTspice at a nominal 3.3 V supply
Validated first-order temperature compensation by analyzing PTAT, CTAT, and Vref behavior across −40 °C to 200 °C
Evaluated line regulation via DC supply sweeps from 2 V to 4 V and measured Vref sensitivity using waveform cursors
Performed 100-run Monte Carlo mismatch analysis and quantified Vref variation with a standard deviation of 4.6 mV

LTspice OSU 180nm CMOS

Two-Stage CMOS Op-Amp with Miller Compensation

Designed an NMOS differential input pair with PMOS current-mirror load and common-source gain stage using Miller compensation, achieving 53.1 dB DC gain and 4.35 MHz unity-gain bandwidth
Derived transistor sizing from slew-rate, input common-mode range, and gain constraints; verified correct biasing with a stable 0.60 V operating point
Measured a 9.6 kHz −3 dB frequency, 448× small-signal gain, and clean 0.14–1.03 V output swing with no observable nonlinear distortion
Evaluated key performance metrics including 1 mW power dissipation, 32 dB CMRR, 64.6/80.8 dB PSRR±, and 10 V/µs slew rate, confirming loop stability

LTspice

CMOS Inverter Layout & Post-Layout Simulation

Built a CMOS inverter layout in Magic VLSI (SCMOS), including PMOS in n-well, NMOS in p-substrate, taps and contacts, and M1 routing; achieved DRC-clean layout
Performed DC analysis in Ngspice on the extracted netlist to evaluate VTC behavior, observing VOH ~1.8 V, VOL ~0 V, and switching threshold VM ~0.95 V at 27 °C
Analyzed transient switching behavior at 1.8 V operation, measuring TPHL ~282 ps, TPLH ~216 ps, and rise/fall times of ~0.50/0.53 ns
Evaluated dynamic performance versus load conditions; observed average power ~2.51 µW and average current ~1.40 µA using Level-1 MOS models

Magic VLSI Ngspice

Analog Function Generator with Adjustable Amplitude, Offset, Phase, Modulation & VCO

Designed an op-amp–based function generator using TL082, generating sine, square (<200 ns rise/fall), and triangular waveforms
Implemented amplitude (±10 V), DC offset (±5 V), and phase control (0°–160°) over a 1 kHz–500 kHz operating range using a first-order all-pass filter
Built the signal chain using a Wien-bridge oscillator, Schmitt trigger, integrator, and CD4051 multiplexer; validated via LTspice simulation and TI ASLK Pro hardware
Integrated AM and PM modulation blocks along with a relaxation-oscillator-based VCO in LTspice, using unity-gain buffers to minimize inter-stage loading

LTspice TL082 TI ASLK Pro

Semiconductor Device Modeling using Sentaurus TCAD

Modeled N-resistor, PN diode, and NMOS devices in Sentaurus TCAD with parameterized doping profiles and device geometries
Configured and automated process and device simulations using Sentaurus Workbench with command-based scripting workflows
Analyzed and visualized electrostatic potential, carrier concentration distributions, and I–V characteristics using Sentaurus Visual and Inspect tools

Sentaurus TCAD

FIR DSP Accelerator SoC (Sky130, Caravel)

View Project

Designed a fixed-point DSP SoC (Sky130, Caravel) with CIC decimator → 8-tap FIR → PWM DAC pipeline
Implemented Wishbone-mapped control interface enabling runtime FIR coefficient updates and datapath configuration
Verified functionality via RTL simulations (filter response, gain, PWM linearity) using Icarus Verilog
Achieved timing closure at 40 MHz and clean physical signoff (OpenLane, LVS/DRC clean)

Verilog OpenLane Sky130 PDK Wishbone Icarus Verilog

EDA Tools and ML-Based Design Analysis

View Project

Developed a Python-based analysis and fault modeling tool for ISCAS’85/89 benchmark circuits, enabling automated evaluation of logic faults and circuit behavior.
Built a machine learning-driven pre-route congestion prediction tool using macro-aware RUDY correction and region-stratified evaluation on CircuitNet-N14 datasets.
Designed and evaluated ML-based branch prediction and cache replacement strategies using ChampSim simulation traces, improving performance analysis workflows.
Implemented CSR-based sparse matrix-vector multiplication (SpMV) benchmarks with heterogeneous CPU–GPU execution and conducted memory roofline analysis for performance optimization.
Developed a PCB fault detection and classification system using YOLOv7, achieving automated defect identification on PCB datasets.

Python Machine Learning

Autonomous Drone for GNSS-Denied Environments (ISRO IRoC-U 2025)

View Project

Integrated NVIDIA Jetson Nano for onboard compute with Pixhawk 4 flight controller to enable autonomous navigation and precision landing
Designed a sub-2 kg quadrotor optimized for GNSS-denied mapping, localization, and vision-based safe-zone detection
Calibrated ESCs and implemented a stable 5 V / 3 A BEC power system; established bidirectional long-range telemetry using ESP32 (~500 m)
Interfaced barometer, optical-flow, and stereo-vision sensors with Pixhawk over I2C and UART for fused state estimation
Implemented visual–inertial odometry using ORB-SLAM3 and VINS-Fusion on ROS 2, achieving ~5 m localization with <5 cm drift
Simulated Mars-like no-GPS flight scenarios in Webots with 0.38 g gravity, enabling autonomous landings within 1.5 m × 1.5 m safe zones

Jetson Nano Pixhawk 4 ROS 2 ORB-SLAM3 VINS-Fusion Webots

PPO-Based Reinforcement Learning for Autonomous Racing on AWS DeepRacer

Trained continuous-action PPO agents on AWS SageMaker for end-to-end, camera-based autonomous racing with steering and speed control
Designed reward functions emphasizing centerline stability, heading alignment, curvature-aware waypoint tracking, and velocity-weighted progress
Stabilized training using distance-band shaping, steering smoothness constraints, and tuned PPO hyperparameters (entropy annealing, ε-clipping, GAE λ)
Evaluated robustness under simulated perturbations including waypoint jitter, curvature sweeps, and speed-limit randomization
Achieved consistent sub-2-minute lap times, outperforming default baselines and reaching top global leaderboard rankings in 2024

AWS SageMaker PPO Reinforcement Learning

Autonomous Multi-Sensor Robot Simulation (GPS/IMU/LiDAR/2-DOF Vision)

View Project

Developed a fully simulated 4-wheel autonomous robot equipped with GPS, 9-axis IMU, 2D LiDAR, ultrasonic distance sensors, and an actively actuated 2-DOF camera system
Implemented global position tracking, local free-space detection, camera-based object observation, and reactive obstacle avoidance within the simulation stack
Modeled sensor fusion inputs including GPS (x,y), IMU orientation/angular velocity, LiDAR ranging, and short-range distance sensing for collision-free navigation
Designed independent wheel velocity control enabling smooth translation and turning, with teleoperation and autonomous wandering modes
Built as a baseline multi-sensor robotics testbed for evaluating classical navigation and control behaviors without SLAM or learning-based methods

Simulation Robotics Sensor Fusion