Overture: Turn your hum into a score, and your moment into a masterpiece.
Inspiration
We wanted a “one-button” creative tool that turns an idea you can’t quite articulate yet (a hum, a rhythm, a vibe) into something playable and shareable: a real song, not just a text-to-speech demo. But we also didn’t want the process to be a black box, so we built in a path for hands-on control. The goal is: capture a human spark, let automation generate a full track, then give the user the option to break the song into stems (vocals, drums, bass, other) and do manual production in a multitrack editor, shaping the final mix like a real producer.
What We Learned
- A reliable AI pipeline is mostly systems engineering: queues, timeouts, streaming, retries, and observability matter as much as model prompts.
- “It works on my machine” fails at boundaries: environment variables, long-running HTTP responses, large audio files, and external API quotas.
- Audio workflows are dependency-heavy; features like stem separation depend on tools like
ffmpegand model downloads, which introduce OS/network friction.
How We Built It
1) Frontend (recording + status + playback)
- Next.js web app with a Record / Stop flow using
MediaRecorder. - The browser uploads audio to the backend and polls job status.
- Playback uses a backend audio proxy endpoint to avoid CORS issues.
2) Backend API (job creation + orchestration)
- FastAPI endpoint accepts uploaded audio and creates a job record.
- The API schedules a Celery pipeline:
generate_blueprint: transcribe + generate a structured blueprintmix_and_master: synthesize a full song via MCP (ElevenLabs Music)separate_stems(optional): split the final mix into stems for editing
3) MCP layer (tools + validation + synthesis)
- A music-tools MCP server provides:
validate_blueprintcreate_song(ElevenLabs Music)synthesize_preview(fallback / debugging)
- The backend calls MCP over HTTP and safely handles streamable/SSE responses.
4) AI step (audio → transcript → blueprint)
- Audio is transcribed with OpenAI, then converted into a JSON blueprint matching a schema.
- The blueprint captures intent (style/tempo/key/sections/lyrics/voice settings) so downstream tools generate consistently.
5) Storage + outputs
- Uploaded audio and generated outputs are stored in DigitalOcean Spaces via presigned URLs.
- Stems (when available) are uploaded as separate files for the editor.
A Little Bit of Math
Tempo relates to beat duration as:
[ t_{\text{beat}}=\frac{60}{\text{BPM}}\ \text{seconds} ]
This helps reason about timing, section lengths, and why certain model constraints (like max segment length for some separators) matter.
Challenges We Faced (and Fixes)
- Jobs stuck in
PENDING: Celery workers weren’t consuming the right queue. Fixed with queue isolation (CELERY_QUEUE) and running the worker with-Q. - MCP read timeouts: some MCP calls return SSE/streamed responses. Fixed by streaming and parsing until
[DONE]. - Placeholder audio / wrong tool: ensured
output_kind=songpersists through the pipeline so the backend callscreate_songinstead of preview/TTS. - Stems failures:
- Missing
ffmpegprevented WAV conversion. - Demucs weight download failed on SSL cert verification; fixed by passing
SSL_CERT_FILEfromcertifiinto the subprocess env. - Demucs transformer segment constraints required reducing
--segmentto a supported value.
- Missing
- Env var overrides: stale exported keys caused confusion. Updated MCP server dotenv loading to prefer
.envin local dev. - External API quota limits: when credits ran out, the pipeline surfaced clear errors back to the UI instead of silently failing.
What’s Next
- Better UX for stems: show progress + remediation (missing
ffmpeg, model download issues). - More observability: job timelines, per-stage durations, and structured errors visible in the UI.
- Richer blueprint: more section structure and stronger musical constraints for more controllable songs.
Built With
- celery
- demucs-(stem-separation)
- digitalocean-managed-postgresql
- digitalocean-managed-valkey/redis-apis:-elevenlabs-music-(song-generation)
- elevenlabs
- languages:-typescript
- mediarecorder-api
- python-frontend:-next.js-(react)
- pytorch/torchaudio-audio-tooling:-ffmpeg-storage/cloud:-digitalocean-spaces-(s3-compatible-object-storage)
- redis/valkey-mcp:-model-context-protocol-sdk-(node/express-transport)-ai/ml:-openai-(audio-transcription-+-json-blueprint-generation)
- sqlalchemy
- tts
- uvicorn
- wavesurfer.js-backend:-fastapi


Log in or sign up for Devpost to join the conversation.