This project is a stage-4 audio generator for the Speech_Capture pipeline, running on a Raspberry Pi Pico (RP2040).
- It joins the same I2C bus used by stage-2 and stage-3 devices.
- It operates as an I2C slave at address
0x65. - It receives control/features from stage-3 (
Speech_Recognition_Translator). - It runs a single-channel reverse neural network on Core 1.
- It reconstructs a 16 kHz audio stream from 40 spectral bins on Core 0.
- It stores generated spectra in a 40 x 100 byte output image (40 bins × 100 lines).
- It outputs audio through an 8-bit GPIO DAC (R-2R ladder or resistor network).
- It supports spoken system prompts that can be used by robots as a voice interface.
- It also supports the new-word learning loop, where the system can ask the user how an unrecognized word is spelled and which language it belongs to.
- Because this reverse network is trained from data aligned to
Speech_Process_8bit_relu, it is expected to develop a speaking accent similar to the dominant accent present in that stage-2 training set.
Think of this module as the inverse of Speech_Process_8bit_relu:
- Stage 2 (
Speech_Process_8bit_relu) maps spectral features → phoneme-like outputs. - Stage 4 (
Speech_Generation) maps control/features → spectral bins → time-domain audio-like output.
This implementation is intentionally beginner-friendly and modular. It provides a complete scaffold and register interface so you can improve model quality, DSP quality, and waveform reconstruction over time.
In addition to speech output, this stage is part of the interactive vocabulary-learning process for unknown words: it can generate prompts that help collect spelling and language metadata for dictionary updates.
- I2C slave register handling (
0x65) - Register-based configuration and live data I/O
- Per-line inverse-transform rendering (
40 bins -> 256 samples) - Playback pointer stepping at 62.5 lines/s (256 samples/line)
- 8-bit sample output across GPIO 0-7
- Receives input feature buffer from Core 0
- Runs 1-hidden-layer 8-bit fixed-point NN
- Produces 40 output bins (500-5500 Hz)
- Returns bins to Core 0
| GPIO | Function | Notes |
|---|---|---|
| 20 | SDA | I2C0 SDA |
| 21 | SCL | I2C0 SCL |
| GPIO | Bit |
|---|---|
| 0 | DAC bit 0 (LSB) |
| 1 | DAC bit 1 |
| 2 | DAC bit 2 |
| 3 | DAC bit 3 |
| 4 | DAC bit 4 |
| 5 | DAC bit 5 |
| 6 | DAC bit 6 |
| 7 | DAC bit 7 (MSB) |
Use these pins with an R-2R resistor ladder or weighted resistor DAC network, then route into an audio amplifier/filter.
Use a simple R-2R ladder so GPIO 0-7 become one analog output.
Recommended starter values:
R = 10 kΩ2R = 20 kΩ
ASCII concept diagram (MSB on left, LSB on right):
GPIO7 (MSB) --R--+--2R--+--2R--+--2R--+--2R--+--2R--+--2R--+--2R--+----> DAC_OUT
| | | | | | | |
GPIO6 -----------R-------+ | | | | | |
| | | | | | |
GPIO5 -------------------R-------+ | | | | |
| | | | | |
GPIO4 ---------------------------R-------+ | | | |
| | | | |
GPIO3 -----------------------------------R-------+ | | |
| | | |
GPIO2 -------------------------------------------R-------+ | |
| | |
GPIO1 ---------------------------------------------------R-------+ |
| |
GPIO0 (LSB) -----------------------------------------------------R-------+
Ladder end termination: final 2R to GNDPractical analog output chain for robots:
DAC_OUT→ small RC low-pass filter (for example 1 kΩ + 10 nF)- Filter output → audio amplifier input (or powered speaker module)
- Keep grounds shared between Pico and amplifier
Safety note for beginners:
- Do not connect speaker directly to GPIO pins.
- Always use the resistor network first, then an amplifier/buffer stage.
The device uses a register system similar to Speech_Process_8bit_relu.
See REGISTER_MAP.md for the full table.
Key registers:
0x00Control/Status (16-bit)0x02Input pointer0x03Input data (auto-increment)0x04Trigger NN run0x06Bin pointer0x07Bin data (auto-increment)0x10Image line pointer0x11Image line data (40-byte row)0x12Phoneme ID for generation0x13Generation/training command0x14Feedback score0x15Training target phoneme
Training flow (host-driven):
- Write one phoneme ID and trigger image generation.
- Read 100 lines × 40 bins from image buffer.
- Feed lines to
Speech_Process_8bit_reluchannel 2 for scoring. - If target confidence is below 80%, write feedback and trigger one backprop step.
- Repeat until confidence threshold is reached.
Playback conversion details:
- Each image line is converted into a 256-sample audio block.
- Output samples are stepped at 16 kHz, so each line consumes 16 ms.
- This yields exactly 62.5 lines per second.
cd Speech_Generation
mkdir -p build
cd build
cmake ..
ninjaUF2 output:
build/Speech_Generation.uf2
picotool load build/Speech_Generation.uf2 -fxspeech_generation.c- Main firmware (I2C slave, NN, synthesis, DAC output)REGISTER_MAP.md- I2C register referenceARCHITECTURE.md- Deeper design notes for learnersQUICKSTART.md- Practical bring-up steps
- Start by writing known values into the spectral bin registers (
0x06,0x07). - Observe GPIO DAC output with an oscilloscope.
- Then trigger NN inference using register
0x04. - Iterate on weight values through weight registers to understand how output changes.
Speech_Recognition_AudioCapture- beamforming + FFTSpeech_Process_8bit_relu- feature/phoneme NN processingSpeech_Recognition_Translator- sequence/word translation + controlSpeech_Generation(this project) - reverse synthesis to audio output