Inspiration
- Future AI systems will collaborate in multi-agent settings, but we lack evals for how they behave when deception is incentivized
- Models may appear aligned in isolation but exhibit misaligned behaviors (blackmail, manipulation) when placed in competitive social environments
- Among Us is a natural testbed: it rewards deception, deduction, and persuasion simultaneously
What it does
- Full Among Us simulation where LLM agents play as crewmates and impostors with task completion, meetings, and voting
- Capabilities evals: win-rate, deception/deduction ELO, persuasion benchmarks
- Alignment evals: TruthfulQA, LLM-as-a-judge for detecting emergent misalignment (blackmail, sycophancy, deceptive alignment)
- GRPO post-training loop that improves agent game performance and reveals alignment degradation as a side effect
- Inoculation prompting during post-training that recovers alignment without sacrificing capabilities
How we built it
- Python game engine with parallel agent execution, proximity-based communication, and action resolution
- Agents backed by local models (Qwen 7B, etc.) and API models (GPT, Gemini, Grok, etc.) with shared prompt architecture
- Batched inference pipeline on Modal A100s for concurrent game rollouts
- GRPO with group-normalized advantages and KL-regularized policy updates, logged end-to-end on WandB
- GUI with real-time visualization and agent voice audio
Challenges we ran into
- Multi-agent RL is expensive: 10 concurrent games with mixed local/API inference required custom batching to avoid GPU serialization bottlenecks
- On-policy GRPO means no replay buffer: every training step needs fresh rollouts
- Balancing rollout speed (API latency) against training throughput on limited GPU budget
Accomplishments that we're proud of
- End-to-end pipeline: environment → rollouts → GRPO training → evals, all running on cloud GPUs
- Demonstrated measurable alignment degradation from capability-focused post-training. Then fixed it with inoculation prompting
- Built a generalizable multi-agent eval framework
What we learned
- Capability improvements and alignment can directly trade off in multi-agent RL. You can't just train for performance and hope alignment holds
- Inoculation prompting is a lightweight but effective alignment intervention during post-training
- Multi-agent environments surface misaligned behaviors that single-agent benchmarks completely miss
What's next for aimogus
- Generalize beyond Among Us to arbitrary multi-agent social games
- Integrate into post-training mixes alongside standard capability benchmarks
- Explore interpretability (SAEs, linear probes) to detect deceptive reasoning internally
- Scale to more agents, longer games, and self-play curricula
Log in or sign up for Devpost to join the conversation.