Inspiration

AI VTubers like Neuro-sama are fundamentally reactive — they wait for chat and respond. The "stream direction" still comes from a human operator deciding when to switch content, push for donations, or run Q&A. I wanted to build a system where the AI runs the entire stream: not just the voice and avatar, but the content strategy — and learns to get better at it over time without any human tuning.

Two ideas made this click:

  • Open-LLM-VTuber, an open-source project that solves real-time Live2D avatar + voice interaction
  • Multi-armed bandits from reinforcement learning, which balance exploration (trying new things) with exploitation (doubling down on what works)

If a streamer's content choices are arms of a bandit, the system can learn its own optimal content mix from engagement and revenue signals.

What it does

Autonomous VTuber runs a Twitch channel end-to-end with zero human intervention. It:

  • Directs the stream — an Orchestrator agent (Claude Opus) observes viewer count, chat velocity, donation rate, and engagement every 5 seconds, then uses tool-calling to decide what happens next
  • Speaks with a Live2D avatar — real-time TTS and facial expressions via OpenAI TTS and a Live2D model
  • Prioritizes messages — donations get read first, then subs, then mods, then regular chat, mirroring how human streamers naturally behave
  • Remembers viewers — a Neo4j knowledge graph tracks every donor, subscriber, and regular chatter across streams for personalized responses
  • Learns from every stream — a Thompson Sampling bandit adjusts content mode probabilities (talk, react, game, Q&A, idle) based on what actually drove engagement and revenue

The bandit starts with uniform priors. After stream 1, weights shift toward what worked. After stream 10, it has a real model of the audience. The streamer improves itself.

How we built it

Two-layer architecture:

Layer 1 — Avatar + Voice (Open-LLM-VTuber): Sherpa-ONNX ASR → GPT-4o → OpenAI TTS → Live2D rendering over WebSocket. The avatar runs at localhost:12393 and plugs into OBS as a Browser Source.

Layer 2 — Autonomous Control:

Component Role
OrchestratorAgent Claude Opus tool-use loop — directs the stream every 5s
ChatAgent Message triage + viewer-personalized responses
ThompsonBandit Beta-Bernoulli bandit that learns optimal content mix
StreamRetrospective Post-stream analysis — updates bandit weights
MetricsCollector Real-time engagement/revenue tracking
Neo4j graph Persistent viewer memory across streams

Components communicate through an async EventBus (pub/sub). Twitch chat flows through a PriorityMessageQueue into a WebSocket bridge to the avatar server. A Next.js 15 dashboard (Recharts + Zustand) provides a real-time ops view.

The self-improvement loop uses Thompson Sampling. Each content mode is an arm with parameters $\alpha$ (successes) and $\beta$ (failures). To pick an action, the system samples:

$$\theta_i \sim \text{Beta}(\alpha_i, \beta_i)$$

and selects the arm with the highest sample. After each stream, revenue-per-hour is converted to a reward signal:

$$\text{reward} = \min!\left(\frac{\text{revenue_per_hour}}{50},\; 1.0\right)$$

Top-performing activities get $\alpha$ increased; underperformers get $\beta$ increased. Over time, the bandit converges on what the audience actually responds to.

Built in three phases:

  1. v0.1.0 — Infrastructure: event bus, persona system, orchestrator skeleton, Twitch IRC bot, FFmpeg RTMP pipeline
  2. v0.2.0 — Intelligence: Neo4j viewer memory, FastAPI analytics, chat agent personalization
  3. v0.3.0 — Learning: Thompson Sampling bandit, post-stream retrospective, donation/subscriber tracking

Challenges we ran into

Bridging two async systems. The Twitch IRC bot and Open-LLM-VTuber's WebSocket server run separate event loops. The bridge must send a message, then block until the avatar finishes speaking (waiting for a conversation-chain-end signal) before sending the next one — otherwise responses overlap. Getting this flow-control right required careful async coordination.

Bandit cold start. With uniform priors ($\alpha = 1, \beta = 1$), every arm's expected value is $\frac{1}{2}$ — the first few streams produce essentially random selections. The mitigation: the orchestrator treats bandit suggestions as recommendations, not commands, and can override based on real-time context.

TTS latency. The full pipeline (chat → LLM → TTS → avatar) has to feel conversational. TTS is the bottleneck — OpenAI's free tier caps at 3 requests/minute, creating painful gaps during fast chat. Streaming responses (the avatar starts speaking before the full response generates) helps but doesn't fully solve it.

Orchestrator hallucination. Claude occasionally tries to call nonexistent tools or passes malformed arguments. Strict tool schemas with typed inputs act as guardrails — any response without valid tool_use blocks is silently discarded. This constraint-based approach was more reliable than prompt engineering.

Accomplishments that we're proud of

  • A self-improving AI streamer that gets measurably better at engaging audiences without any human tuning
  • A dual-LLM architecture where Claude Opus directs strategy and GPT-4o handles conversation — each model used for what it's best at
  • Viewer memory that persists across streams via Neo4j, enabling personalized responses for returning donors and subscribers
  • A clean priority queue that mirrors natural streamer behavior — big donations interrupt, subs get acknowledged, regular chat fills the gaps
  • The entire system is modular — swap TTS engines, LLM providers, or ASR backends without touching the orchestration layer

What we learned

  • Thompson Sampling is remarkably elegant for this problem — it naturally balances trying new content formats with exploiting what works, and the Beta distribution gives you principled uncertainty estimates for free
  • Tool-use constraints make LLMs better directors — forcing the orchestrator to act through typed tool calls (set_activity, send_chat_response) produces more concrete decisions than open-ended prompting
  • Neo4j is a natural fit for social data — viewer relationships, donation history, and subscription tenure map cleanly to a graph model
  • Event-driven architecture pays off — the pub/sub EventBus kept every agent decoupled, making it possible to develop and test them independently

What's next for Autonomous VTuber

  • Multi-stream bandit convergence — running enough streams to see the bandit develop strong priors and comparing its content recommendations against human streamer intuition
  • Richer body animations — the current Live2D model (mao_pro) supports facial expressions but no body motion; swapping to a model with dance/gesture animations would make the avatar more expressive
  • Audience segmentation — using the Neo4j graph to identify viewer clusters (lurkers, chatters, donors) and tailor content differently for each
  • Multi-platform support — extending beyond Twitch to YouTube Live and Bilibili simultaneously
  • Voice cloning — replacing generic TTS with a custom voice model for a more distinctive personality

Built With

Share this project:

Updates