Speech Recognition AudioCapture

Part 1 of the Speech Recognition System Series

Real-time 3-microphone beamforming and frequency analysis system for Raspberry Pi Pico (RP2040).

Overview

This project implements a dual-core audio processing pipeline that:

Captures audio from 3 microphones at 16 kHz with 12-bit precision
Uses 31 mm microphone spacing (center-to-center, linear 3-mic array)
Performs delay-and-sum beamforming for 5 directional beams (-60°, -30°, 0°, +30°, +60°)
Computes 256-point FFT for frequency analysis (0-5500 Hz)
Transmits 45 frequency bands per beam via I2C to downstream processors (8-bit)

Terminology convention used in this project:

ADC front-end values: mic / channel.
Post-beamforming values: beam / beamformed channel.
Post-FFT values: bin.
I2C payload values: band.
2D indexing convention: FFT arrays are [beam][bin]; I2C-output arrays are [beam][band].

Architecture Highlights

Dual-Core Parallelized Processing

Core 0 (ADC + Beamforming + I2C)

DMA-driven 3-channel ADC capture
Delay-and-sum beamforming (5 beams)
I2C transmission during idle time
Never blocks Core 1's FFT computation

Core 1 (FFT Only)

256-point Q15 fixed-point FFT
Processes all 5 beams continuously
Queues results for Core 0 transmission
No I2C delays - maximum FFT throughput

Key Benefits

Parallelized: I2C transmission overlaps with next FFT cycle
No overruns: ~5ms I2C time doesn't block FFT processing
Buffered: Bidirectional 5-deep queues handle timing variations
High precision: Full 12-bit ADC resolution preserved

Hardware Requirements

Raspberry Pi Pico (RP2040)
3× analog microphones (ADC-compatible)
I2C slave devices at addresses 0x60-0x64 (optional, for data export)
GPIO LEDs on pins 0-4 for error indication (optional)

Pin Configuration

Pin	Function	Description
26-28	ADC 0-2	3 microphone inputs
8-9	I2C SDA/SCL	FFT data export (@400kHz)
0-4	GPIO OUT	I2C error LEDs (one per beam)
5, 10, 11	GPIO OUT	Debug timing signals
-----------	-------------	-------------------------------

Building

# Using VS Code task
Ctrl+Shift+B

# Or command line
ninja -C build

Flashing

# Via picotool
picotool load build/Speech_Recognition_AudioCapture.uf2 -fx

# Or drag-and-drop .uf2 to BOOTSEL drive

I2C Protocol

Each beam transmits a 46-byte packet:

Byte 0: 0xAA (header)
Bytes 1-45: 45 frequency bands (8-bit)

Slave addresses:

0x60 = Beam -60°
0x61 = Beam -30°
0x62 = Beam 0°
0x63 = Beam +30°
0x64 = Beam +60°

Performance Metrics

Sample rate: 16 kHz per channel
Mic spacing: 31 mm (adjacent microphones)
ADC precision: 12-bit (0-4095)
FFT size: 256 points
Frequency range: 0-5500 Hz (45 bins)
Processing: ~16ms per 5-beam cycle
I2C throughput: ~5ms for all beams @ 400 kHz

Beamforming Delay Model (Integer Math)

The beamformer uses integer sample shifts, with two effects combined:

Geometric steering delay from microphone spacing and beam angle
Fixed ADC round-robin skew compensation for sequential channel sampling

Why these delays work (beginner intuition)

Sound is not instantaneous: it propagates through air at about $c=343\text{ m/s}$ (around room temperature).

At this project sample rate, $f_s=16\text{ kHz}$, one sample period is:

$$ T_s=\frac{1}{f_s}=62.5,\mu s $$

In one sample period, sound travels:

$$ d_{sample}=c,T_s=\frac{343}{16000}\text{ m}\approx 21.4\text{ mm} $$

Because adjacent microphones are spaced by $d=31\text{ mm}$, a wave from the side reaches one mic first, then the next after a small delay. The adjacent-mic geometric delay is:

$$ \Delta t(\theta)=\frac{d,\sin\theta}{c} $$

Equivalent delay in samples:

$$ \Delta n(\theta)=\Delta t,f_s=\frac{d,\sin\theta}{c}f_s $$

Useful numbers for this geometry:

$\theta=0^\circ$: $\Delta n\approx 0$ (arrives nearly together)
$\theta=30^\circ$: $\Delta n\approx 0.72$ samples between adjacent mics
$\theta=60^\circ$: $\Delta n\approx 1.25$ samples between adjacent mics
End-to-end (mic0 to mic2) delay is about $2\times$ adjacent delay

Beamforming applies compensating delays so the chosen-direction waveforms line up in time before summing. When aligned, they add constructively (larger output). For other directions, timing mismatch creates phase mismatch, so summation is partially destructive (reduced output). That is the directional selectivity.

Simple arrival-order sketch (3 mics in a line):

Mic positions:  Mic0 -------- 31 mm -------- Mic1 -------- 31 mm -------- Mic2

Source at -60° (from left side):
Wave travel --->  hits Mic0 first, then Mic1, then Mic2
				 t0            t0+Δt         t0+2Δt

Source at +60° (from right side):
Wave travel <---  hits Mic2 first, then Mic1, then Mic0
				 t0            t0+Δt         t0+2Δt

Beam steering idea:
- To listen LEFT (-60°): delay Mic0 least and Mic2 most, so all three align before sum.
- To listen RIGHT (+60°): delay Mic2 least and Mic0 most, so all three align before sum.
- For other angles, signals are not perfectly aligned, so summed amplitude drops.

Phase intuition: when peaks line up with peaks, amplitudes add; when peaks line up with troughs, they partially cancel.

Round-robin ADC timing offset at 16 kHz/channel is:

Mic 0: 0 samples
Mic 1: +1/3 sample
Mic 2: +2/3 sample

The implemented integer shift is computed as:

$$ \text{delay}[\text{beam},\text{mic}] = -\mathrm{round}\left(\left(\tau_{geom}-\tau_{skew}\right)f_s\right) $$

With 31 mm spacing, this yields the following delay table (samples):

Beam angle	Mic 1	Mic 2
-60°	+2	+3
-30°	+1	+2
0°	0	+1
+30°	0	-1
+60°	-1	-2

Geometric-Only vs Skew-Compensated Delays

If you ignore ADC round-robin skew, a beginner intuition is often:

Beam angle	Mic 1	Mic 2
-60°	+2	+3
-30°	+1	+2
0°	0	0
+30°	-1	-2
+60°	-2	-3

The firmware table differs because it compensates the deterministic ADC channel skew (CH1 sampled +1/3 sample later than CH0, CH2 sampled +2/3 sample later than CH0).

Equivalent Angle Shift from ADC Skew

At this geometry ($d=31\text{ mm}$, $f_s=16\text{ kHz}$), a $1/3$ sample timing offset is roughly equivalent to a steering offset of about $13^\circ$:

$$ \sin(\theta_{eq}) \approx \frac{(1/3)}{(d,f_s/c)} = \frac{1/3}{1.446} \Rightarrow \theta_{eq}\approx 13.3^\circ $$

So the skew term effectively shifts the integer-delay solution by about $\pm13^\circ$ unless compensated (which this firmware does).

Debug & Monitoring

GPIO timing signals:

GPIO 10: Pulses on each FFT completion (5× per cycle)
GPIO 11: Pulses at start/end of 5-beam cycle
GPIO 5: Core 0 activity heartbeat

USB serial monitor (115200 baud):

Core 0/1 status messages
Queue full warnings
Cycle counters

Runtime Serial Diagnostic Commands (Beginner)

You can enable/disable debug streams at runtime without reflashing:

Mic / ADC input stage
micgain M P → set microphone gain for mic M (0..2) in percent P (25..800, starts at 150%)
rawclip L H → set RAW ADC clip window in mV (L=low, H=high, default 150 3150)
micstatus → show compact per-mic gain + clipping status
raw on|off → show/hide RAW ADC table
Beamforming stage
skew on|off → enable/disable ADC round-robin skew correction in beamforming
beam on|off → show/hide BEAMFORMED table
bgain B N → set gain for one beam B (0..4) to N (1..64)
benable B on|off → enable/disable processing for one beam B (0..4)
FFT stage
fft on|off → show/hide FFT section in periodic diagnostics
fftbeam B on|off → enable/disable RAW FFT table for beam B (0..4, default only B2 enabled)
fftraw on|off → show/hide RAW FFT bin tables
fftrawfmt hex|pct → RAW FFT display as 16-bit hex or % of full-scale (65535)
nfloor N → set FFT post-bin noise floor for all beams (0..255, lower = more speech detail)
bnfloor B N → set FFT post-bin noise floor for one beam B (0..4)
bfilter on|off → enable/disable long-term per-bin background suppression (auto-tunes from 5s average)
bsec N → set background suppression time constant in seconds (1..8)
bstrength N → set background suppression subtraction strength for all beams (0..100)
bbstrength B N → set subtraction strength for one beam B (0..4)
bcap N → set max per-bin subtraction cap (0..95, lower preserves more speech)
Output stage
i2clog on|off → select I2C 16→8 mapping mode: logarithmic companding (ON) or linear mapping (OFF), default ON
i2coutput [B] on|off → show/hide post-conversion I2C output table for beam B (0..4, if B omitted defaults to 2, all beams start OFF)
i2coutputfmt hex|pct → I2C output display as 8-bit hex or % of full-scale (255), with % shown in 3.1 format
i2c on|off → show/hide I2C failure log prints
tone on|off → show/hide tone monitor line
band N or just N → set tone-monitor output band index (0..44) (not raw FFT bin 0..127)
bin N → alias for band N (legacy command)
General / diagnostics
help → show command list
status → show current mode states
diagrate N → set diagnostic print period in seconds (1..60)
diagcompact on|off → switch periodic RAW/BEAM tables between compact and verbose formats
agc on|off → currently disabled (gain control is manual)
agcclip N → set AGC backoff threshold in clipped samples/frame (1..64)
savecfg → force-save current mic/beam settings to flash
loadcfg → reload settings from flash

Gain and band-filter behavior:

Pre-FFT gain is manual and starts at 20x.
Each beam has independent gain, noise floor, and subtraction strength controls.
Band-filter subtraction auto-adjusts from a ~5s average of each beam’s own raw FFT bins.
Low 5s average increases suppression on that beam; high 5s average reduces suppression to preserve speech visibility.
Mic gains are applied pre-beamforming with integer Q8 math and are persisted in flash.
I2C packing performs the 16-bit-to-8-bit stage in one selectable mode: linear full-scale mapping (i2clog off) or logarithmic companding (i2clog on) to give low-amplitude bins more visual impact.
Post-FFT bin filtering is currently bypassed while FFT tuning is in progress (for cleaner raw FFT diagnostics).

Per-beam field semantics (shown in diagnostics):

en: enables/disables spectral processing for that beam. When OFF, that beam still participates in beamforming timing and packet flow, but FFT output bins are forced to zero.
nFloor: per-beam FFT post-bin magnitude floor (0..255 internal magnitude units). Bins below this threshold are zeroed.
bStr: manual/base per-beam subtraction strength (% of estimated baseline).
bAuto: runtime auto-adjusted subtraction strength (%), derived from bStr using each beam's ~5s average level.
clip: clipped input sample count seen in the pre-FFT scaling stage for that beam.

Periodic diagnostic report order (single print burst per diagrate):

STATUS section
- Shows ON/OFF state for diagnostic sections (status, watchdog, raw, beam, fft) and stream toggles.
WATCHDOG section
- Shows ADC/DMA health snapshot (dmaBusy, block diff, DMA remaining count, DMA rearm count, queue depths).
RAW ADC section
- Shows AGC state and clip threshold, then per-mic values:
- manual gain, auto gain (currently fixed at 100%), average voltage, RAW/POST P2P min/avg/max, clip count.
BEAMFORMED section
- Global line: AGC, skew compensation state, beam averaging time constant.
- Per-beam rows: beam index, angle, mic delays (D0/D1/D2), gain, RAW/POST P2P min/max/avg, clip, bStr, bAuto.
FFT section
- Shows RAW FFT diagnostics for enabled beams.
- RAW table prints all 128 FFT bins in 8 rows × 16 columns.
- Output-band bins are highlighted; lower single bins and upper paired bins are bracketed.
I2C_OUTPUT section
- Shows post-conversion (16-bit FFT to 8-bit I2C payload) values for enabled beams.
- Per-beam table header is I2C_Output_Beam X.
- Displays 45 output bands using a 5×8 grid (0..39) plus a tail row (40..44).

Combined BEAMFORMED table:

Periodic output now combines configuration and measured amplitude stats into one BEAMFORMED (combined) table when beam on is enabled.
If beam on is OFF, the status stream still shows the compact Per-beam table.

RAW ADC diagnostics now include per-microphone:

current mic gain (%),
average input voltage (for bias verification),
average voltage error from target bias (1650 mV),
raw min/max absolute voltage (for clip-window checks),
raw P2P level,
post-gain P2P level,
post-gain clipping count.

Useful tone band examples:

band 8 → ~980 Hz
band 16 → ~1960 Hz
band 33 → ~4030 Hz

Approximate mapping:

$$f_{band}(n)\approx \frac{5500}{45}n\text{ Hz}\approx 122.2n\text{ Hz},\quad n=0\ldots44$$

Expected RAW ADC Values (What "Good" Looks Like)

For correctly biased single-supply microphone signals into RP2040 ADC:

AVG(mV) for CH0/CH1/CH2 should be near mid-rail (typically ~1600–1700 mV)
MIN(mV) should stay above 0 mV
MAX(mV) should stay below 3300 mV
P2P(mV) should increase clearly when tone/speech is present

Why these limits matter:

RP2040 ADC input range is 0–3.3 V (no negative input range)
Signals outside this range clip and distort beamforming/FFT
See RP2040 datasheet (ADC/electrical limits) and Raspberry Pi Pico datasheet (board electrical characteristics)

If AVG(mV) is far from ~1.65 V, adjust analog bias network before tuning beamforming.

How it works (summary)

Core 0 continuously captures 3-channel ADC data via DMA at 16 kHz.
Core 0 performs delay-and-sum beamforming for 5 angles (-60°, -30°, 0°, +30°, +60°).
Core 1 runs a 256-point fixed-point FFT (Q15), computes magnitudes, and extracts 45 FFT bins.
Core 0 transmits the 45-band packet over I2C to slaves at addresses 0x60–0x64.

Frequency bin centers (Hz)

To avoid ambiguity:

FFT bins = raw FFT indices ($k=0..127$) with spacing $\Delta f = f_s/N = 16000/256 = 62.5$ Hz.
Band 0..7 use FFT bins 1..8 directly (single-bin bands).
Band 8..42 use paired bins: (9,10), (11,12), ... (77,78).
Band 43..44 use single bins 79 and 80.
FFT bin 0 is DC and excluded from output bands.
Output bands = downstream values transmitted over I2C.

Representative band centers used by this mapping:

Band 0 = 62.5 Hz (FFT bin 1)
Band 7 = 500.0 Hz (FFT bin 8)
Band 8 = 593.75 Hz (avg of FFT bins 9 and 10)
Band 42 = 4843.75 Hz (avg of FFT bins 77 and 78)
Band 43 = 4937.5 Hz (FFT bin 79)
Band 44 = 5000.0 Hz (FFT bin 80)

45 Band Center Frequencies (Hz)

Band	Center (Hz)
0	62.5
1	125.0
2	187.5
3	250.0
4	312.5
5	375.0
6	437.5
7	500.0
8	593.75
9	718.75
10	843.75
11	968.75
12	1093.75
13	1218.75
14	1343.75
15	1468.75
16	1593.75
17	1718.75
18	1843.75
19	1968.75
20	2093.75
21	2218.75
22	2343.75
23	2468.75
24	2593.75
25	2718.75
26	2843.75
27	2968.75
28	3093.75
29	3218.75
30	3343.75
31	3468.75
32	3593.75
33	3718.75
34	3843.75
35	3968.75
36	4093.75
37	4218.75
38	4343.75
39	4468.75
40	4593.75
41	4718.75
42	4843.75
43	4937.5
44	5000.0

Documentation

See .github/copilot-instructions.md for detailed architecture, algorithms, and beginner-friendly explanations.

Part of Speech Recognition System

This is the first module in a series:

AudioCapture (this project) - Beamforming and frequency analysis
Feature extraction - TBD
Recognition engine - TBD
Command interface - TBD

License

MIT License - See LICENSE file

Authors

Ray Edgley & Copilot.

Developed as part of a modular speech recognition system for embedded platforms.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
.vscode		.vscode
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
Speech_Recognition_AudioCapture.c		Speech_Recognition_AudioCapture.c
pico_sdk_import.cmake		pico_sdk_import.cmake
test_gpio6_only.c		test_gpio6_only.c
test_gpio_only.c		test_gpio_only.c
test_gpio_scan.c		test_gpio_scan.c
test_usb_only.c		test_usb_only.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Recognition AudioCapture

Part 1 of the Speech Recognition System Series

Overview

Architecture Highlights

Dual-Core Parallelized Processing

Key Benefits

Hardware Requirements

Pin Configuration

Building

Flashing

I2C Protocol

Performance Metrics

Beamforming Delay Model (Integer Math)

Why these delays work (beginner intuition)

Geometric-Only vs Skew-Compensated Delays

Equivalent Angle Shift from ADC Skew

Debug & Monitoring

Runtime Serial Diagnostic Commands (Beginner)

Expected RAW ADC Values (What "Good" Looks Like)

How it works (summary)

Frequency bin centers (Hz)

45 Band Center Frequencies (Hz)

Documentation

Part of Speech Recognition System

License

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Speech Recognition AudioCapture

Part 1 of the Speech Recognition System Series

Overview

Architecture Highlights

Dual-Core Parallelized Processing

Key Benefits

Hardware Requirements

Pin Configuration

Building

Flashing

I2C Protocol

Performance Metrics

Beamforming Delay Model (Integer Math)

Why these delays work (beginner intuition)

Geometric-Only vs Skew-Compensated Delays

Equivalent Angle Shift from ADC Skew

Debug & Monitoring

Runtime Serial Diagnostic Commands (Beginner)

Expected RAW ADC Values (What "Good" Looks Like)

How it works (summary)

Frequency bin centers (Hz)

45 Band Center Frequencies (Hz)

Documentation

Part of Speech Recognition System

License

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages