$2,000 free credits for voice AI Startups$2,000 free creditsApply to Startup ProgramApply now

Voice AI infrastructure
that scales

Layercode handles real-time audio transport, speech-to-text, text-to-speech, and global edge deployment. You handle your agent's logic with any LLM, any framework.

Start Building Read Docs

Text in. Text out.

You receive transcribed text from your users. You send text back. Layercode handles everything in between: audio capture, speech-to-text, text-to-speech, WebSocket connections, and global delivery.

User speaks

Audio → Layercode

You process text

Text → Your Backend → Text

User hears response

Layercode → Audio

Getting Started

Add voice to your app in under 50 lines

A complete voice agent implementation. Server-side streaming with the Vercel AI SDK, client-side audio handling with our React hook.

app/api/agent/route.ts

import { createOpenAI } from "@ai-sdk/openai";
import { streamText } from "ai";
import { streamResponse } from "@layercode/node-server-sdk";

const openai = createOpenAI({ apiKey: process.env.OPENAI_API_KEY! });

export const POST = async (request: Request) => {
  const body = await request.json();

  return streamResponse(body, async ({ stream }) => {
    if (body.type === "message") {
      const { textStream } = streamText({
        model: openai("gpt-4o-mini"),
        system: "You are a helpful voice assistant.",
        messages: [{ role: "user", content: body.text }],
        onFinish: () => stream.end(),
      });

      await stream.ttsTextStream(textStream);
    }
  });
};

Architecture

How the pipeline works

Every voice conversation flows through a five-stage pipeline optimized for sub-second round-trip latency.

User speaks

Audio captured from browser, mobile app, or phone. Streamed to nearest edge location via WebSocket.

WebSocket connection with automatic reconnection and jitter buffering

Speech-to-text at the edge

Real-time transcription using Deepgram. Voice Activity Detection (VAD) handles turn-taking automatically.

Sub-50ms processing latency at 330+ global edge locations

Webhook to your backend

Transcribed text sent to your webhook via HMAC-signed POST request. You control the LLM and logic.

Works with any backend: Next.js, Express, FastAPI, Rails, etc.

Stream your response

Use our SDK to stream text back. We handle text-to-speech conversion in real-time.

Compatible with Vercel AI SDK, LangChain, and any streaming LLM

User hears response

Audio synthesized and streamed back to the user. Word-level timestamps enable precise interruption handling.

Choose from ElevenLabs, Cartesia, or Rime for TTS

Developer Experience

SDKs for every stack

First-class TypeScript and Python support. Client SDKs for React and vanilla JavaScript. Server SDKs for Node.js and Python.

React SDK

useLayercodeAgent hook for managing voice sessions, speaking states, and audio visualization.

npm install @layercode/react-sdk

View docs

•Session lifecycle management
•Real-time speaking states
•Audio amplitude for visualizations
•Mute/unmute controls

Vanilla JS SDK

Framework-agnostic client SDK for any JavaScript environment.

npm install @layercode/client-sdk

View docs

•Browser & mobile support
•WebSocket management
•Event-driven API
•TypeScript support

Node.js SDK

Server-side SDK for handling webhooks and streaming responses.

npm install @layercode/node-server-sdk

View docs

•streamResponse() helper
•HMAC signature verification
•TTS text streaming
•Tool call support

Python SDK

Python SDK for FastAPI, Flask, Django, and other Python backends.

pip install layercode

View docs

•Async/await support
•Framework agnostic
•Streaming responses
•Type hints included

Works with the LLM libraries you already use

Vercel AI SDK

OpenAI

Anthropic

LangChain

Ollama

Mastra

Voice Providers

Hot-swap text-to-speech providers

Choose the voice provider that fits your use case. Switch providers with a single config change — no code modifications required.

Text to Speech

Managed

Rime

mistv2

•Zero setup required
•Managed by Layercode
•PCM, MP3, μ-law formats
•Streaming timestamps

Best for: Quick start & managed billing

Cartesia

sonic-2

•16kHz PCM streaming
•Word-level timestamps
•High-fidelity voices
•Precise interruption handling

Best for: High-fidelity & detailed timestamps

ElevenLabs

eleven_v2_5_flash

•Voice cloning support
•Multilingual capabilities
•Stability controls
•Character-level alignment

Best for: Cloned voices & multilingual

Managed

Inworld

inworld-tts

•Character-driven voices
•Emotional expression
•Game & entertainment focus
•Real-time streaming

Best for: Gaming & interactive characters

Speech to Text

Deepgram

flux & nova-3

Best-in-class speech recognition: Flux is our recommended model for real-time voice pipelines—purpose-built for streaming with unmatched speed and accuracy. We also support Nova-3 for use cases requiring proven reliability.

•Ultra-low latency

•Streaming-first architecture

•Noise-robust accuracy

•Realtime word timestamps

Best for: Real-time voice pipelines

Infrastructure

Built for low-latency at global scale

Other voice AI platforms run on centralized cloud infrastructure. When your user is in Tokyo and your servers are in Virginia, latency kills the conversation.

Layercode runs on Cloudflare Workers. Audio processing happens at the edge location nearest to your user, not in a distant data center.

330+

Edge locations worldwide

<50ms

Audio processing latency

Zero

Cold starts

100%

Session isolation

Features

Everything you need for production-ready voice AI agents

Hot-swap voice providers

Switch between Rime, Cartesia, ElevenLabs, and Inworld with a single config change. No code changes required.

Analytics & observability

Replay conversations, inspect latency breakdowns, view transcripts, and debug production issues with full visibility.

Session recording

Every call recorded automatically. Download audio, export transcripts, build training datasets. All stored securely.

Per-second billing

Pay only for active conversation time. Silence is always free. No minimum commitments or hidden fees.

Multi-channel support

Web browsers, iOS, Android, and phone (via Twilio). Same backend, same pipeline, multiple channels.

Telephony integration

Inbound and outbound calling via Twilio. Full call recording and transcript analysis included.

Tool calling

Build agents that execute functions. Works with Vercel AI SDK, LangChain, LlamaIndex, and CrewAI.

Multi-agent orchestration

Transfer between agents mid-conversation. Build complex workflows with specialized agents.

Enterprise security

SOC 2 Type II compliant*, GDPR compliant, TLS 1.3 encryption, AES-256 at rest.

Performance

Optimized for natural conversations

Latency is the enemy of natural voice interactions. Layercode is engineered to minimize time-to-first-token at every stage.

Use low-TTFT models

Gemini Flash 2.5-lite and gpt-4o-mini are optimized for speed. Avoid reasoning-extended models — they trade large amounts of latency for marginal quality gains in spoken conversations.

Speech priming

Emit response.tts events like "Let me look that up" before heavy processing begins. Users hear immediate audio while your backend works.

Optimize RAG patterns

Running retrieval on every turn adds network hops and stalls conversations. Fetch external data only when needed.

Colocate your infrastructure

Store conversation history in fast, nearby databases like Redis. Collocate services with Layercode deployments to minimize cross-region latency.

Enterprise-ready security

Layercode is built for production workloads with enterprise security requirements. Your data is encrypted in transit and at rest. Session recordings are stored securely in SOC 2 compliant infrastructure.

SOC 2 Type II*

GDPR Compliant

TLS 1.3

AES-256

Simple, predictable pricing

Per-second billing for active conversation time. Silence is free. STT, TTS, and infrastructure costs consolidated into one simple rate. Start with $100 in free credits, no credit card required.

View pricing details

Ready to build?

Get started with $100 in free credits. No credit card required.

Start Building View quickstart guide

Voice AI infrastructure that scales

Text in. Text out.

User speaks

You process text

User hears response

Add voice to your app in under 50 lines

How the pipeline works

User speaks

Speech-to-text at the edge

Webhook to your backend

Stream your response

User hears response

SDKs for every stack

React SDK

Vanilla JS SDK

Node.js SDK

Python SDK

Works with the LLM libraries you already use

Hot-swap text-to-speech providers

Text to Speech

Rime

Cartesia

ElevenLabs

Inworld

Speech to Text

Deepgram

Built for low-latency at global scale

Everything you need for production-ready voice AI agents

Hot-swap voice providers

Analytics & observability

Session recording

Per-second billing

Multi-channel support

Telephony integration

Tool calling

Multi-agent orchestration

Enterprise security

Optimized for natural conversations

Use low-TTFT models

Speech priming

Optimize RAG patterns

Colocate your infrastructure

Enterprise-ready security

Simple, predictable pricing

Ready to build?

Voice AI infrastructure
that scales