Overture: Turn your hum into a score, and your moment into a masterpiece.

Inspiration

We wanted a “one-button” creative tool that turns an idea you can’t quite articulate yet (a hum, a rhythm, a vibe) into something playable and shareable: a real song, not just a text-to-speech demo. But we also didn’t want the process to be a black box, so we built in a path for hands-on control. The goal is: capture a human spark, let automation generate a full track, then give the user the option to break the song into stems (vocals, drums, bass, other) and do manual production in a multitrack editor, shaping the final mix like a real producer.

What We Learned

  • A reliable AI pipeline is mostly systems engineering: queues, timeouts, streaming, retries, and observability matter as much as model prompts.
  • “It works on my machine” fails at boundaries: environment variables, long-running HTTP responses, large audio files, and external API quotas.
  • Audio workflows are dependency-heavy; features like stem separation depend on tools like ffmpeg and model downloads, which introduce OS/network friction.

How We Built It

1) Frontend (recording + status + playback)

  • Next.js web app with a Record / Stop flow using MediaRecorder.
  • The browser uploads audio to the backend and polls job status.
  • Playback uses a backend audio proxy endpoint to avoid CORS issues.

2) Backend API (job creation + orchestration)

  • FastAPI endpoint accepts uploaded audio and creates a job record.
  • The API schedules a Celery pipeline:
    1. generate_blueprint: transcribe + generate a structured blueprint
    2. mix_and_master: synthesize a full song via MCP (ElevenLabs Music)
    3. separate_stems (optional): split the final mix into stems for editing

3) MCP layer (tools + validation + synthesis)

  • A music-tools MCP server provides:
    • validate_blueprint
    • create_song (ElevenLabs Music)
    • synthesize_preview (fallback / debugging)
  • The backend calls MCP over HTTP and safely handles streamable/SSE responses.

4) AI step (audio → transcript → blueprint)

  • Audio is transcribed with OpenAI, then converted into a JSON blueprint matching a schema.
  • The blueprint captures intent (style/tempo/key/sections/lyrics/voice settings) so downstream tools generate consistently.

5) Storage + outputs

  • Uploaded audio and generated outputs are stored in DigitalOcean Spaces via presigned URLs.
  • Stems (when available) are uploaded as separate files for the editor.

A Little Bit of Math

Tempo relates to beat duration as:

[ t_{\text{beat}}=\frac{60}{\text{BPM}}\ \text{seconds} ]

This helps reason about timing, section lengths, and why certain model constraints (like max segment length for some separators) matter.

Challenges We Faced (and Fixes)

  • Jobs stuck in PENDING: Celery workers weren’t consuming the right queue. Fixed with queue isolation (CELERY_QUEUE) and running the worker with -Q.
  • MCP read timeouts: some MCP calls return SSE/streamed responses. Fixed by streaming and parsing until [DONE].
  • Placeholder audio / wrong tool: ensured output_kind=song persists through the pipeline so the backend calls create_song instead of preview/TTS.
  • Stems failures:
    • Missing ffmpeg prevented WAV conversion.
    • Demucs weight download failed on SSL cert verification; fixed by passing SSL_CERT_FILE from certifi into the subprocess env.
    • Demucs transformer segment constraints required reducing --segment to a supported value.
  • Env var overrides: stale exported keys caused confusion. Updated MCP server dotenv loading to prefer .env in local dev.
  • External API quota limits: when credits ran out, the pipeline surfaced clear errors back to the UI instead of silently failing.

What’s Next

  • Better UX for stems: show progress + remediation (missing ffmpeg, model download issues).
  • More observability: job timelines, per-stage durations, and structured errors visible in the UI.
  • Richer blueprint: more section structure and stronger musical constraints for more controllable songs.

Built With

  • celery
  • demucs-(stem-separation)
  • digitalocean-managed-postgresql
  • digitalocean-managed-valkey/redis-apis:-elevenlabs-music-(song-generation)
  • elevenlabs
  • languages:-typescript
  • mediarecorder-api
  • python-frontend:-next.js-(react)
  • pytorch/torchaudio-audio-tooling:-ffmpeg-storage/cloud:-digitalocean-spaces-(s3-compatible-object-storage)
  • redis/valkey-mcp:-model-context-protocol-sdk-(node/express-transport)-ai/ml:-openai-(audio-transcription-+-json-blueprint-generation)
  • sqlalchemy
  • tts
  • uvicorn
  • wavesurfer.js-backend:-fastapi
Share this project:

Updates