This repository contains the code and notebooks for the MSc dissertation “Timbre Transfer of the Human Singing Voice for Effective Automatic Music Transcription” by Bonface Martin Oywa (University of East London / UNICAF, 2025). Timbre Transfer of Human Singing Voice for Effective Automatic Music Transcription Dissertation.pdf
The project investigates whether converting human singing voice into instrument-like timbres (via DDSP) can improve the performance of state-of-the-art Automatic Music Transcription (AMT) systems such as Basic Pitch and MT3.
Short answer: timbre transfer substantially changes the audio, but does not reliably improve overall vocal AMT accuracy compared to transcribing the raw singing voice.
Does converting human singing voice into instrument-like timbres via timbre transfer improve downstream AMT performance compared to direct vocal transcription, and under what conditions?
To answer this, the repository implements a unified pipeline:
-
Data Engineering / Ground Truth
- Start from Annotated-VocalSet extended CSV files (note onsets, offsets, MIDI pitches, etc.).
- Generate ground-truth MIDI files and a
metadata.csvlinking each audio excerpt to its MIDI.
-
Timbre Transfer (DDSP)
- Convert monophonic singing into instrument-like audio (Violin, Flute, Trumpet, Tenor Saxophone) using pre-trained DDSP autoencoders.
-
Automatic Music Transcription (AMT)
- Transcribe both raw vocals and timbre-transferred audio using:
- Basic Pitch (Spotify)
- MT3 (Multi-Task Multitrack Music Transcription, Google)
- Transcribe both raw vocals and timbre-transferred audio using:
-
Evaluation
- Objective AMT evaluation with MIR-Eval: Precision, Recall, F1, Onset F1, Onset-Offset F1, and Transcription Error Rate (TER).
- Timbre fidelity evaluation with MFCC similarity and OpenL3 embedding similarity between original and converted audio.
- Small subjective listening test for “note-flow” similarity.
At a high level:
singing-voice-amt/
├── dataset/ # Local data organisation (WAV, MIDI, metadata)
│ └── raw/ # Entry point for downloaded datasets (see below)
├── timbretransfer/ # DDSP timbre transfer utilities & helpers
├── evaluation/ # AMT + timbre fidelity evaluation scripts (MIR-Eval, OpenL3, etc.)
├── data_engineering.ipynb # Build metadata.csv + generate MIDI from Annotated-VocalSet
├── timbre_transfer.ipynb # End-to-end DDSP timbre transfer for all excerpts
├── Music_Transcription_with_Transformers.ipynb # MT3-based AMT experiments
├── Onsets_and_Frames.ipynb # Additional AMT baseline experiments (piano-style model)
├── requirements-MT3.txt # Dependencies for MT3-based experiments
├── requirements-TT.txt # Dependencies for DDSP timbre transfer
├── README.md # (You are here)
└── LICENSE # CC0-1.0
Note: Paths like dataset/wav, dataset/midi, tt_dataset/<instrument>/, and amt_dataset/... are created as you run the notebooks and scripts. 
The repository works with publicly-available singing datasets; it does not redistribute any audio.
Experiments in the dissertation primarily use Annotated-VocalSet (extension of VocalSet) which provides: 
-
Monophonic a cappella singing by 20 professional singers (9 male, 11 female).
-
Extended CSV annotations with:
- Onset / offset times
- MIDI pitch (estimated and ground truth)
- Note durations
- Vowel and lyric information
-
You’ll need:
- Extended CSVs (extended_2 variant recommended in the project)
- Corresponding .wav audio files
-
Download from Zenodo (see links already listed in Timbre Transfer of Human Singing Voice for Effective Automatic Music Transcription Dissertation.pdf).
The dissertation also discusses, but does not necessarily use in the main experiments:
- Choral Singing Dataset – choir recordings with MIDI, f0, and note annotations (rejected due to vibrato-only style). 
- Other vocal datasets (MIR-1K, NUS-48E, etc.) were considered but not adopted due to lack of aligned MIDI. 
The original experiments used two machines because of library constraints: 
- macOS (M1 Pro, 16 GB RAM):
- Python 3.8 / 3.11 via Anaconda
- Used for DDSP timbre transfer and some AMT runs.
- Ubuntu (RTX 3060, 16 GB RAM):
- Used for MT3 (transformer) experiments due to TensorFlow / JAX / GPU setup.
You can adapt this to a single machine if your environment supports all dependencies.
git clone https://github.com/martinoywa/singing-voice-amt.git
cd singing-voice-amt
Timbre Transfer (DDSP)
python3.11 -m venv .venv-tt
source .venv-tt/bin/activate
pip install --upgrade pip
pip install -r requirements-macOS.txt
MT3 / AMT Experiments
python3.8 -m venv .venv-mt3
source .venv-mt3/bin/activate
pip install --upgrade pip
pip install -r requirements-Ubuntu.txt
The requirements files include core libraries such as DDSP, TensorFlow/JAX, librosa, pretty_midi, MIR-Eval, OpenL3, and the relevant AMT libraries (e.g. MT3, Basic Pitch, etc.). See the requirements-*.txt files for the precise versions.
This section summarises how to go from raw Annotated-VocalSet files to evaluation results.
Notebook: data_engineering.ipynb
Tasks: 
- Filter excerpts:
- Select only straight singing styles (e.g. “caro”, “dona”, “row”) with natural voice and minimal stylistic variation.
- Generate MIDI:
- Parse extended_2 CSV annotations.
- Create MIDI notes using:
- Onset time
- Offset time
- MIDI pitch
- Ignore rows labelled as non-sound events (rests, transitions).
- Apply a fixed velocity for all notes (e.g. 100) for simplicity.
- Align audio & MIDI:
- Copy the corresponding .wav files into a local folder (e.g. dataset/wav/).
- Save generated .mid files into dataset/midi/.
- Build metadata. Create dataset/metadata.csv with at least:
- excerpt_name (e.g. m6_row_straight)
- wav_path (e.g. dataset/wav/m6_row_straight.wav)
- midi_path (e.g. dataset/midi/m6_row_straight.mid)
Notebook: timbre_transfer.ipynb Helpers: timbretransfer/ 
- Download pre-trained DDSP models:
- Violin, Flute, Trumpet, Tenor Saxophone
- Place checkpoints under pretrained/<instrument_name>/:
- operative_config-0.gin
- ckpt-*
- Run the notebook:
- For each entry in metadata.csv:
- Load 16 kHz mono audio (dataset/wav/.wav).
- Extract DDSP features: f0, loudness, confidence, etc.
- Normalise/auto-tune f0 and loudness using dataset statistics.
- Synthesize with the chosen instrument model.
- Output layout
- For each entry in metadata.csv:
tt_dataset/
├── violin/
│ ├── m6_row_straight.wav
│ └── ...
├── flute/
├── trumpet/
└── tenor_saxophone/
On the original hardware, each 33-second excerpt took ~1.5 minutes to convert per instrument. 
Notebook examples:
- Music_Transcription_with_Transformers.ipynb – MT3 experiments
The dissertation’s main AMT systems are: 
- Basic Pitch (instrument-agnostic, lightweight AMT)
- MT3 (transformer-based, multi-instrument AMT)
You will typically:
- Transcribe baseline vocal audio
- For each dataset/wav/.wav:
- Basic Pitch → amt_dataset/baselines/basic_pitch/.mid
- Run MT3 → amt_dataset/baselines/MT3/.mid
- For each dataset/wav/.wav:
- Transcribe timbre-transferred audio
- For each instrument and each file in tt_dataset//.wav:
- Run Basic Pitch → amt_dataset/converted/basic_pitch//.mid
- Run MT3 → amt_dataset/converted/MT3//.mid
- For each instrument and each file in tt_dataset//.wav:
- Keep sample rate consistent (16 kHz is used across DDSP and AMT in the project). 
Scripts in evaluation/ compute:
- Note-level AMT metrics (using MIR-Eval):
- Precision, Recall, F1
- Onset F1
- Onset-Offset F1
- Transcription Error Rate (TER)
- Typical onset tolerance: 50 ms
- Offset tolerance: 20% of note duration
- Timbre fidelity metrics:
- MFCC similarity (cosine similarity of time-aggregated MFCCs)
- Embedding similarity using OpenL3 (cosine similarity of music embeddings) 
- Listening test utilities:
- Simple randomised selection of original vs converted excerpts.
- Allows a binary “musically similar note flow? (Yes/No)” rating.
⸻
High-level findings from the dissertation: 
- Baseline vocal AMT:
- Both Basic Pitch and MT3 underperform their reported piano/instrument benchmarks on singing voice.
- Average note-level F1 ≈ 0.33–0.34, Onset-Offset F1 ≈ 0.18–0.20.
- MT3 generally has more stable Onset F1 and lower TER; Basic Pitch sometimes has slightly higher note-level F1.
- Converted (timbre-transferred) AMT:
- Tenor Saxophone is the most AMT-friendly timbre:
- Best Basic Pitch F1 ≈ 0.20
- Best MT3 F1 ≈ 0.14
- Onset F1 > 0.5 for both models.
- Flute and Violin timbres yield very low F1 (near zero for some settings).
- Overall, timbre transfer does not outperform baseline vocal transcription when averaged over all excerpts and metrics; F1 drops and TER rises compared to raw vocals.
- Tenor Saxophone is the most AMT-friendly timbre:
- Timbre fidelity vs transcription accuracy:
- MFCC similarity is a poor predictor of AMT performance (weak/non-significant correlations).
- Embedding similarity (OpenL3) correlates strongly and positively with AMT metrics (r ≳ 0.9).
- Increasing timbre fidelity beyond a certain threshold (~0.96–0.97 MFCC similarity) shows diminishing or even negative returns on F1.
- Listening test:
- Only 2/5 randomly selected converted excerpts were judged as clearly similar in “note flow” to the original vocals, indicating residual artefacts and noisy sustains in some conversions.
⸻
- Obtain datasets
- Download Annotated-VocalSet extended CSVs and corresponding WAVs from Zenodo.
- Generate MIDI + metadata
- Run data_engineering.ipynb to create:
- dataset/midi/
- dataset/wav/
- dataset/metadata.csv
- Run data_engineering.ipynb to create:
- Run timbre transfer
- Set up DDSP environment (e.g. macOS).
- Download pre-trained DDSP models and place them under pretrained//.
- Run timbre_transfer.ipynb → tt_dataset//.
- Run AMT models
- Set up an environment for MT3 (e.g. Ubuntu + GPU).
- Use Basic Pitch + MT3 to transcribe:
- dataset/wav/ → amt_dataset/baselines/...
- tt_dataset/... → amt_dataset/converted/...
- Evaluate
- Use scripts in evaluation/ to compute:
- Per-file metrics, aggregated results tables & plots to compare:
- Baseline vs converted
- Different timbres (Violin, Flute, Trumpet, Tenor Sax)
- Per-file metrics, aggregated results tables & plots to compare:
- Use scripts in evaluation/ to compute:
- Optional
- Run the listening test utilities to replicate the small subjective evaluation.
⸻
If you use this code or reproduce parts of the pipeline, please cite the dissertation:
@mastersthesis{oywa2025timbretransfer,
author = {Bonface Martin Oywa},
title = {Timbre Transfer of the Human Singing Voice for Effective Automatic Music Transcription},
school = {University of East London / UNICAF},
year = {2025},
month = {December}
}
⸻
This repository is released under the CC0-1.0 license (public domain dedication). See LICENSE for details.