Audify.AI

Inspiration

Music is a universal language — but lyrics aren't. We wanted to break that barrier. Audify.AI was born from the idea that you should be able to enjoy any song in any language, not just through subtitles or static translations, but as an actual re-sung version of the track. We envisioned an end-to-end pipeline that takes a song, understands it at every level — vocals, words, pitch, timing — and reconstructs it in a new language while preserving the original melody. The goal: make music truly borderless.

What It Does

Audify.AI takes in an audio file of a song and produces a version of that song translated into a different language via AI. The pipeline works in several stages:

Stem Separation — The uploaded audio is split into isolated vocal and instrumental tracks using Demucs.
Lyric Transcription — The vocal track is transcribed into text with per-word timestamps using faster-whisper (e.g., {"word": "Baby", "start": 0.00, "end": 0.30}), providing the precise timing data needed for downstream alignment.
Phoneme Extraction — Each transcribed word is converted into its phonetic representation for pronunciation analysis.
Translation — The transcribed lyrics are translated into the target language using MarianMT (a local neural machine translation stack built on HuggingFace and PyTorch) and the Gemini API.
Pitch Extraction — The vocal track is analyzed using librosa's pyin algorithm to extract per-note musical data (pitch, duration, timing), which is quantized into discrete MIDI note numbers using NumPy.
Singing Voice Synthesis — The translated lyrics and instrumental track are fed into Mellotron (NVIDIA), which generates a realistic singing voice fitted to the existing rhythm of the song, producing the final translated audio output.

The frontend provides a simple React-based interface where users upload audio, optionally trim it, kick off a processing job, and watch progress update in real time. The UI displays playable outputs alongside a transcript table with timestamps, phonemes, and translations.

How We Built It

Backend: Python with FastAPI, chosen for its speed and natural fit with Python-based audio and ML tooling. The job system uses a lightweight threaded, in-memory architecture designed for rapid prototyping, with a clear upgrade path to a queue-based system (Redis + workers) for production scale.

Audio & Language Pipeline:

Demucs — Stem separation (vocals vs. instrumental)
faster-whisper — Local Whisper inference for word-level transcription with timestamps (word_timestamps=True). Chosen over the Gemini API to eliminate rate limits, API costs, and API key dependencies.
MarianMT — Local/offline neural machine translation via HuggingFace + PyTorch for lyric translation
Gemini 2.0 Flash — Used for supplementary translation tasks
librosa (pyin) — Pitch detection on the vocal track. Chosen over Spotify's BASIC-PITCH due to TensorFlow incompatibility with Python 3.13 on macOS. librosa was already a transitive dependency of Demucs and faster-whisper, requiring no additional installation.
NumPy — Quantizing continuous pitch contours into discrete MIDI note numbers and processing voiced/unvoiced frames
midiutil — Exporting extracted notes to standard MIDI files for verification and debugging
FFmpeg — Audio trimming and preprocessing
Mellotron — NVIDIA's singing voice synthesis model, chosen for its ability to generate realistic singing voice from lyrics and instrumentals while fitting the existing rhythm. Selected over DiffSinger and ElevenLabs for its balance of functionality, realism, and feasibility within our timeframe.
g2p_en — Grapheme-to-phoneme conversion for phoneme extraction

Frontend: React 19 with Create React App.

Infrastructure: Google Cloud for file storage and deployment.

Challenges We Ran Into

Gemini API rate limits and cost. Early in development, we used Gemini 2.0 Flash for transcription, which required manually chunking audio into 15-second segments with a separate API call per chunk. On the free tier, we quickly hit 429 RESOURCE_EXHAUSTED errors — even with a 30-second clip. We addressed this by migrating to faster-whisper, which runs entirely locally and simplified the codebase by removing the chunking logic altogether.

Whisper transcription accuracy on singing vocals. The base Whisper model produced poor results on singing — hallucinated text, repeated nonsensical phrases, and timestamps extending far beyond the audio duration. We fixed the timestamp issue by clipping vocals to the requested duration with FFmpeg before passing them to Whisper, and improved accuracy by upgrading to the small model, which handled melodic vocals much better.

Dependency compatibility on macOS + Python 3.13. Our original pitch extraction plan used Spotify's BASIC-PITCH (a neural network approach), but it depends on TensorFlow, and tensorflow-macos doesn't support Python 3.13. After troubleshooting, we pivoted to librosa's pyin algorithm — a DSP-based approach already available in our environment. The lesson: check dependency compatibility early before committing to a library.

Finding a viable singing voice synthesis solution. We initially assumed ElevenLabs could generate singing voice from translated text that matches an instrumental — or that some existing infrastructure would support this. Finding out that no off-the-shelf solution offered this functionality to the extent we needed was discouraging. ElevenLabs generates its own melody and rhythm rather than following the original song's tune, and DiffSinger, while offering explicit pitch control, presented integration challenges. After extensive research into different solutions — evaluating pitfalls, compatibility with existing components, and implementation complexity — we settled on Mellotron (NVIDIA), which gave us realistic singing voice generation while remaining feasible within our timeframe.

Performance bottlenecks across the pipeline. We addressed performance issues by isolating each stage of the workflow and identifying bottlenecks individually — for instance, using a faster Demucs model for stem separation and implementing a word-phoneme dictionary for quick lookup times, accepting targeted accuracy tradeoffs where necessary.

Converting continuous pitch to discrete notes. librosa's pyin returns a pitch value for every audio frame (~5–10ms). We had to write custom logic to group consecutive frames with the same quantized MIDI note into discrete note events with start/end times, filtering out notes shorter than 50ms to remove noise artifacts.

Accomplishments That We're Proud Of

We're proud of shipping a fully functional end-to-end pipeline within such a tight time window. From raw audio upload to a translated, re-sung output — every stage works. Getting the voice synthesis stage to produce a working demo was especially rewarding given the number of dead ends we encountered (ElevenLabs, DiffSinger) before landing on Mellotron and generating output close to what we originally envisioned.

One highlight was migrating the transcription system from a cloud API to a fully local solution. The Gemini-based approach required an API key, had rate limits, cost money, and needed complex chunking logic. The faster-whisper replacement is simpler (one function call instead of a chunking loop), free, and has no rate limits — while also removing two files from the codebase and making the code easier to maintain.

We're also proud of the pitch extraction module — a compact ~80-line module that takes a raw vocal audio file and produces clean, structured per-note data ready for voice synthesis. Being able to export the extracted melody to MIDI and listen to it to verify accuracy was particularly satisfying.

Above all, none of this would have been possible without the collaboration and brainstorming across the team. We couldn't have asked for a better group.

What We Learned

Audio processing pipelines. We gained hands-on experience with how Whisper segments audio into timestamped phrases, how pitch detection algorithms like pyin work at the frame level, and how Hz values map to MIDI notes using the formula:

$$f = 440 \times 2^{\frac{m - 69}{12}}$$

where $f$ is the frequency in Hz and $m$ is the MIDI note number. We also learned the critical distinction between text-to-speech and singing voice synthesis — matching an original song's melody requires explicit pitch and duration data, not just text.

Library evaluation beyond functionality. We learned to evaluate libraries by considering dependency chains, Python version support, and environment compatibility — not just feature sets. The BASIC-PITCH to librosa pivot taught us that the "best" library on paper isn't always the right choice for a given setup.

New frameworks and tools. Several team members learned FastAPI and Google Cloud from scratch under time pressure, along with the rich ecosystem of niche audio processing libraries (Demucs, faster-whisper, librosa, Mellotron, MarianMT).

Team coordination. We learned the importance of defining clear interfaces upfront. Our team split the pipeline into parallel modules (pitch extraction, word-level timestamps, voice synthesis setup, alignment/integration), agreeing on exact function signatures and output formats before coding. This let team members work independently while integration code could be written against those interfaces in parallel.

What's Next for Audify.AI

Karaoke mode — Let users sing along with the translated lyrics using real-time playback with synchronized text.
Artist voice matching — Make the synthesized singing voice sound closer to the original artist for a more seamless listening experience.
Music creation in unfamiliar languages — Beyond translating existing songs, enable users to create original music in languages they don't speak.
Real-time melody visualization — A piano-roll style display that shows the detected pitch contour alongside transcribed lyrics as the song plays, letting users visually verify the extraction and manually correct errors before translation and synthesis.
Production scaling — Migrate the job system from the current threaded, in-memory architecture to a Redis-backed task queue with dedicated workers.