Projects
- Designed and evaluated multiple CIFAR-10 CNNs, selecting a Pareto-optimal 6-layer residual model balancing accuracy (~84%), parameter memory (~52 kB), and compute (~12–13 M FLOPs) for hardware deployment
- Implemented a tiling-based convolution/GEMM accelerator with reusable MAC PEs and line-buffered dataflow; integrated AXI4-Lite control + AXI-Stream/DMA data movement; verified end-to-end via RTL testbenches against Python reference models
- Developed pipelined processing elements using 8-bit Booth–Kogge MACs, with FSM-based control and a 2-cycle ready/valid handshake, ensuring timing-clean and scalable datapath operation
- Performed quantization studies (PTQ/QAT) from FP32 (Q1.31) to fixed-point (Q1.7), achieving ~4× memory reduction with <1% accuracy loss; validated TensorFlow FP32 to RTL numerical consistency
- Built automation flows (TCL/Python) for ROM/weight generation, testbench stimulus, and inference execution; generated coefficient memories and ensured deterministic layer-by-layer verification
- Implemented a streaming image-processing toolkit (AXI-Stream) including edge detection, filtering, denoising, and enhancement, with pipelined RTL and FIFO-based backpressure handling; included MLP-based (E)MNIST classifier with automated preprocessing/inference
Verilog
SystemVerilog
TensorFlow
Python
TCL
Perl
AXI-Stream
- Implemented shift-add datapath with all 6 modes rotation/vectoring (circular/linear/hyperbolic); width/iter/angle frac/output width–shift scaling swept across configs
- Built trig/mag/atan2/mul/div/exp wrappers; observed ∼e-5 RMS (@32b, 16iter) baseline vs double-precision references
- Proved handshake, deadlock-free bounded liveness, range safety, symmetry & monotonicity via SystemVerilog assertions (SymbiYosys/Yices2)
- Auto-generated atan tables & param files via Python; FuseSoC-packaged core with documented sensitivity, error trends & failure regions
- Variants: pipelined/SIMD/multi-issue; Systems: radix-2 FFT/IFFT, DPLL, Sigma-Delta ADC Front-End, QAM16 receiver (Costas carrier + Gardner timing recovery)
Verilog
SystemVerilog
SymbiYosys
Yices2
Python
FuseSoC
- Designed parameterized output-stationary 2D systolic array for signed 8-bit GEMM/Conv2D with wavefront scheduling and pipelined PEs
- Implemented im2col-based Conv2D mapping onto GEMM core (4×16×36), achieving 42.47 MAC/cycle (99.5% peak) at 66.3% PE utilization
- Explored 9 MAC architectures (Array/Baugh/Booth × RCA/Kogge/CSA) via full RTL-to-GDSII (OpenLane, sky130_fd_sc_hd); achieved 100 MHz timing closure with post-route STA correlation and 0 DRC/LVS violations
- Quantified PPA tradeoffs - Booth+RCA 15.8k µm² / 537 cells (min area); Array+Kogge 5.68 ns (~176 MHz, max Fmax); Kogge ~4–5% faster at ~20% higher power; CSA ~67% larger and ~2× power with no timing gain
- Designed ping-pong SRAM tiled GEMM with DMA-backed data movement (1-cycle buffer swap), enabling overlap of load and compute
- Implemented direct-mapped tile cache (tag+valid) achieving 93.75% hit rate; verified across 240+ tests (GEMM/Conv, random/boundary/burst/multi-tile), confirming functional correctness and linear systolic scaling (K+M+N−2)
Verilog
Sky130
RTL2GDS
OpenLane
DFT
- Built a two-wide in-order superscalar RISC processor with parallel IF–ID–EX–MEM–WB lanes and independent pipeline registers per lane
- Designed dual 32-bit instruction fetch per cycle with inter-lane dependency checks, hazard suppression, and load-use stall handling
- Implemented 4R2W register file, RAW/WAW detection, branch squashing, & multi-port memory for concurrent fetch and data access
- Evaluated SC/MC/5-stage pipelined designs via directed programs (assembly with RISCV GCC Toolchain & QEMU reference), analyzing CPI (1/3.8/1.6), cycle counts, and hazard overhead
- Validated RV32I compliance using the RISC-V Architectural Compliance Test (RISCV-ACT), verified functionality with the Dhrystone benchmark
- Evaluated LUT utilization, CPI, and instruction throughput, achieving ~1.6 CPI and ~0.63 instructions/cycle throughput in the two-wide superscalar pipeline
Verilog
RISC-V GCC Toolchain
QEMU
- Designed non-pipelined/pipelined/scan-enabled 4-stage ALU; pipeline FFs replaced with scan FFs for scan-in/capture/scan-out; added CDC bridge (async FIFO + 2FF sync) between clk domains
- Gate-level timing analysis in Yosys/OpenSTA (Sky130) with clock uncertainty, I/O delays, & input slew; ∼1.7x fmax gain with pipelining
- RTL2GDS flow: scan vs no-scan, single vs dual scan, CTS skew tightening, util/density & floorplan stress; closed timing throughout
- Integrated scan-safe clock gating on pipeline regs; reduced switching power ~19% & internal power ~38% with negligible timing impact
- Analyzed IO-driven routing effects; worst-case pinning increased clock wire length by >2x despite CTS/placement optimization; recovered via pin-arch optimization (>50% clk WL ↓); PDN stress at signoff showed +21% total & +58% switching power
Verilog
Yosys
OpenSTA
Sky130
DFT

AHB–APB Bridge with Self-Checking Verification
- Designed a parameterizable AHB-Lite to APB bridge with FSM-based control supporting single & burst read/write transactions
- Implemented address/data latching, write buffering, read return, and burst sequencing, handling pipelined and non-pipelined accesses
- Built a self-checking SV testbench with macro-controlled test modes (single/burst R/W) and assertion-based data validation
- Verified protocol correctness across all transaction types; additionally designed & verified standalone (I2C/SPI/UART) peripheral controllers
Verilog
SystemVerilog

Functional & UVM Verification of SHA-256 Core (secworks)
- Developed a self-checking functional verification environment for the secworks SHA-256 core, validating compression rounds, IV initialization, message scheduling, and multi-block chaining
- Implemented directed, random, corner-case, and fail-case stimuli (e.g., `abc`, empty message, all-zero, all-ones) to verify digest correctness and interface handshake behavior (`ready`, `digest_valid`)
- Injected malformed control sequences, undefined inputs, and invalid block ordering to stress protocol robustness and confirm mismatch detection and failure reporting
- Automated compilation, simulation, and log aggregation through a TCL-driven regression flow, enabling repeatable runs and consolidated verification reporting
- Implemented a basic SystemVerilog/UVM verification environment with agent, driver, sequencer, monitor, and scoreboard to modularize stimulus generation and digest checking
Verilog
Icarus Verilog
TCL

CMOS Bandgap Reference Simulation
- Simulated and verified a CMOS bandgap reference in OSU 180 nm CMOS using LTspice at a nominal 3.3 V supply
- Validated first-order temperature compensation by analyzing PTAT, CTAT, and Vref behavior across −40 °C to 200 °C
- Evaluated line regulation via DC supply sweeps from 2 V to 4 V and measured Vref sensitivity using waveform cursors
- Performed 100-run Monte Carlo mismatch analysis and quantified Vref variation with a standard deviation of 4.6 mV
LTspice
OSU 180nm CMOS

Two-Stage CMOS Op-Amp with Miller Compensation
- Designed an NMOS differential input pair with PMOS current-mirror load and common-source gain stage using Miller compensation, achieving 53.1 dB DC gain and 4.35 MHz unity-gain bandwidth
- Derived transistor sizing from slew-rate, input common-mode range, and gain constraints; verified correct biasing with a stable 0.60 V operating point
- Measured a 9.6 kHz −3 dB frequency, 448× small-signal gain, and clean 0.14–1.03 V output swing with no observable nonlinear distortion
- Evaluated key performance metrics including 1 mW power dissipation, 32 dB CMRR, 64.6/80.8 dB PSRR±, and 10 V/µs slew rate, confirming loop stability
LTspice

CMOS Inverter Layout & Post-Layout Simulation
- Built a CMOS inverter layout in Magic VLSI (SCMOS), including PMOS in n-well, NMOS in p-substrate, taps and contacts, and M1 routing; achieved DRC-clean layout
- Performed DC analysis in Ngspice on the extracted netlist to evaluate VTC behavior, observing VOH ~1.8 V, VOL ~0 V, and switching threshold VM ~0.95 V at 27 °C
- Analyzed transient switching behavior at 1.8 V operation, measuring TPHL ~282 ps, TPLH ~216 ps, and rise/fall times of ~0.50/0.53 ns
- Evaluated dynamic performance versus load conditions; observed average power ~2.51 µW and average current ~1.40 µA using Level-1 MOS models
Magic VLSI
Ngspice

Analog Function Generator with Adjustable Amplitude, Offset, Phase, Modulation & VCO
- Designed an op-amp–based function generator using TL082, generating sine, square (<200 ns rise/fall), and triangular waveforms
- Implemented amplitude (±10 V), DC offset (±5 V), and phase control (0°–160°) over a 1 kHz–500 kHz operating range using a first-order all-pass filter
- Built the signal chain using a Wien-bridge oscillator, Schmitt trigger, integrator, and CD4051 multiplexer; validated via LTspice simulation and TI ASLK Pro hardware
- Integrated AM and PM modulation blocks along with a relaxation-oscillator-based VCO in LTspice, using unity-gain buffers to minimize inter-stage loading
LTspice
TL082
TI ASLK Pro

Semiconductor Device Modeling using Sentaurus TCAD
- Modeled N-resistor, PN diode, and NMOS devices in Sentaurus TCAD with parameterized doping profiles and device geometries
- Configured and automated process and device simulations using Sentaurus Workbench with command-based scripting workflows
- Analyzed and visualized electrostatic potential, carrier concentration distributions, and I–V characteristics using Sentaurus Visual and Inspect tools
Sentaurus TCAD
- Designed a fixed-point DSP SoC (Sky130, Caravel) with CIC decimator → 8-tap FIR → PWM DAC pipeline
- Implemented Wishbone-mapped control interface enabling runtime FIR coefficient updates and datapath configuration
- Verified functionality via RTL simulations (filter response, gain, PWM linearity) using Icarus Verilog
- Achieved timing closure at 40 MHz and clean physical signoff (OpenLane, LVS/DRC clean)
Verilog
OpenLane
Sky130 PDK
Wishbone
Icarus Verilog
- Developed a Python-based analysis and fault modeling tool for ISCAS’85/89 benchmark circuits, enabling automated evaluation of logic faults and circuit behavior.
- Built a machine learning-driven pre-route congestion prediction tool using macro-aware RUDY correction and region-stratified evaluation on CircuitNet-N14 datasets.
- Designed and evaluated ML-based branch prediction and cache replacement strategies using ChampSim simulation traces, improving performance analysis workflows.
- Implemented CSR-based sparse matrix-vector multiplication (SpMV) benchmarks with heterogeneous CPU–GPU execution and conducted memory roofline analysis for performance optimization.
- Developed a PCB fault detection and classification system using YOLOv7, achieving automated defect identification on PCB datasets.
Python
Machine Learning
- Integrated NVIDIA Jetson Nano for onboard compute with Pixhawk 4 flight controller to enable autonomous navigation and precision landing
- Designed a sub-2 kg quadrotor optimized for GNSS-denied mapping, localization, and vision-based safe-zone detection
- Calibrated ESCs and implemented a stable 5 V / 3 A BEC power system; established bidirectional long-range telemetry using ESP32 (~500 m)
- Interfaced barometer, optical-flow, and stereo-vision sensors with Pixhawk over I2C and UART for fused state estimation
- Implemented visual–inertial odometry using ORB-SLAM3 and VINS-Fusion on ROS 2, achieving ~5 m localization with <5 cm drift
- Simulated Mars-like no-GPS flight scenarios in Webots with 0.38 g gravity, enabling autonomous landings within 1.5 m × 1.5 m safe zones
Jetson Nano
Pixhawk 4
ROS 2
ORB-SLAM3
VINS-Fusion
Webots

PPO-Based Reinforcement Learning for Autonomous Racing on AWS DeepRacer
- Trained continuous-action PPO agents on AWS SageMaker for end-to-end, camera-based autonomous racing with steering and speed control
- Designed reward functions emphasizing centerline stability, heading alignment, curvature-aware waypoint tracking, and velocity-weighted progress
- Stabilized training using distance-band shaping, steering smoothness constraints, and tuned PPO hyperparameters (entropy annealing, ε-clipping, GAE λ)
- Evaluated robustness under simulated perturbations including waypoint jitter, curvature sweeps, and speed-limit randomization
- Achieved consistent sub-2-minute lap times, outperforming default baselines and reaching top global leaderboard rankings in 2024
AWS SageMaker
PPO
Reinforcement Learning
- Developed a fully simulated 4-wheel autonomous robot equipped with GPS, 9-axis IMU, 2D LiDAR, ultrasonic distance sensors, and an actively actuated 2-DOF camera system
- Implemented global position tracking, local free-space detection, camera-based object observation, and reactive obstacle avoidance within the simulation stack
- Modeled sensor fusion inputs including GPS (x,y), IMU orientation/angular velocity, LiDAR ranging, and short-range distance sensing for collision-free navigation
- Designed independent wheel velocity control enabling smooth translation and turning, with teleoperation and autonomous wandering modes
- Built as a baseline multi-sensor robotics testbed for evaluating classical navigation and control behaviors without SLAM or learning-based methods
Simulation
Robotics
Sensor Fusion









