Overture: Turn your hum into a score, and your moment into a masterpiece.

Inspiration

We wanted a “one-button” creative tool that turns an idea you can’t quite articulate yet (a hum, a rhythm, a vibe) into something playable and shareable: a real song, not just a text-to-speech demo. But we also didn’t want the process to be a black box, so we built in a path for hands-on control. The goal is: capture a human spark, let automation generate a full track, then give the user the option to break the song into stems (vocals, drums, bass, other) and do manual production in a multitrack editor, shaping the final mix like a real producer.

What We Learned

A reliable AI pipeline is mostly systems engineering: queues, timeouts, streaming, retries, and observability matter as much as model prompts.
“It works on my machine” fails at boundaries: environment variables, long-running HTTP responses, large audio files, and external API quotas.
Audio workflows are dependency-heavy; features like stem separation depend on tools like ffmpeg and model downloads, which introduce OS/network friction.

How We Built It

1) Frontend (recording + status + playback)

Next.js web app with a Record / Stop flow using MediaRecorder.
The browser uploads audio to the backend and polls job status.
Playback uses a backend audio proxy endpoint to avoid CORS issues.

2) Backend API (job creation + orchestration)

FastAPI endpoint accepts uploaded audio and creates a job record.
The API schedules a Celery pipeline:
1. generate_blueprint: transcribe + generate a structured blueprint
2. mix_and_master: synthesize a full song via MCP (ElevenLabs Music)
3. separate_stems (optional): split the final mix into stems for editing

3) MCP layer (tools + validation + synthesis)

A music-tools MCP server provides:
- validate_blueprint
- create_song (ElevenLabs Music)
- synthesize_preview (fallback / debugging)
The backend calls MCP over HTTP and safely handles streamable/SSE responses.

4) AI step (audio → transcript → blueprint)

Audio is transcribed with OpenAI, then converted into a JSON blueprint matching a schema.
The blueprint captures intent (style/tempo/key/sections/lyrics/voice settings) so downstream tools generate consistently.

5) Storage + outputs

Uploaded audio and generated outputs are stored in DigitalOcean Spaces via presigned URLs.
Stems (when available) are uploaded as separate files for the editor.

A Little Bit of Math

Tempo relates to beat duration as:

[ t_{\text{beat}}=\frac{60}{\text{BPM}}\ \text{seconds} ]

This helps reason about timing, section lengths, and why certain model constraints (like max segment length for some separators) matter.

Challenges We Faced (and Fixes)

Jobs stuck in PENDING: Celery workers weren’t consuming the right queue. Fixed with queue isolation (CELERY_QUEUE) and running the worker with -Q.
MCP read timeouts: some MCP calls return SSE/streamed responses. Fixed by streaming and parsing until [DONE].
Placeholder audio / wrong tool: ensured output_kind=song persists through the pipeline so the backend calls create_song instead of preview/TTS.
Stems failures:
- Missing ffmpeg prevented WAV conversion.
- Demucs weight download failed on SSL cert verification; fixed by passing SSL_CERT_FILE from certifi into the subprocess env.
- Demucs transformer segment constraints required reducing --segment to a supported value.
Env var overrides: stale exported keys caused confusion. Updated MCP server dotenv loading to prefer .env in local dev.
External API quota limits: when credits ran out, the pipeline surfaced clear errors back to the UI instead of silently failing.

What’s Next

Better UX for stems: show progress + remediation (missing ffmpeg, model download issues).
More observability: job timelines, per-stage durations, and structured errors visible in the UI.
Richer blueprint: more section structure and stronger musical constraints for more controllable songs.

Built With

celery
demucs-(stem-separation)
digitalocean-managed-postgresql
digitalocean-managed-valkey/redis-apis:-elevenlabs-music-(song-generation)
elevenlabs
languages:-typescript
mediarecorder-api
python-frontend:-next.js-(react)
pytorch/torchaudio-audio-tooling:-ffmpeg-storage/cloud:-digitalocean-spaces-(s3-compatible-object-storage)
redis/valkey-mcp:-model-context-protocol-sdk-(node/express-transport)-ai/ml:-openai-(audio-transcription-+-json-blueprint-generation)
sqlalchemy
tts
uvicorn
wavesurfer.js-backend:-fastapi

Updates

Claire Li started this project — Feb 08, 2026 10:43 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.