aimogus

Inspiration

Future AI systems will collaborate in multi-agent settings, but we lack evals for how they behave when deception is incentivized
Models may appear aligned in isolation but exhibit misaligned behaviors (blackmail, manipulation) when placed in competitive social environments
Among Us is a natural testbed: it rewards deception, deduction, and persuasion simultaneously

Full Among Us simulation where LLM agents play as crewmates and impostors with task completion, meetings, and voting
Capabilities evals: win-rate, deception/deduction ELO, persuasion benchmarks
Alignment evals: TruthfulQA, LLM-as-a-judge for detecting emergent misalignment (blackmail, sycophancy, deceptive alignment)
GRPO post-training loop that improves agent game performance and reveals alignment degradation as a side effect
Inoculation prompting during post-training that recovers alignment without sacrificing capabilities

Python game engine with parallel agent execution, proximity-based communication, and action resolution
Agents backed by local models (Qwen 7B, etc.) and API models (GPT, Gemini, Grok, etc.) with shared prompt architecture
Batched inference pipeline on Modal A100s for concurrent game rollouts
GRPO with group-normalized advantages and KL-regularized policy updates, logged end-to-end on WandB
GUI with real-time visualization and agent voice audio

Multi-agent RL is expensive: 10 concurrent games with mixed local/API inference required custom batching to avoid GPU serialization bottlenecks
On-policy GRPO means no replay buffer: every training step needs fresh rollouts
Balancing rollout speed (API latency) against training throughput on limited GPU budget

End-to-end pipeline: environment → rollouts → GRPO training → evals, all running on cloud GPUs
Demonstrated measurable alignment degradation from capability-focused post-training. Then fixed it with inoculation prompting
Built a generalizable multi-agent eval framework

Capability improvements and alignment can directly trade off in multi-agent RL. You can't just train for performance and hope alignment holds
Inoculation prompting is a lightweight but effective alignment intervention during post-training
Multi-agent environments surface misaligned behaviors that single-agent benchmarks completely miss

Generalize beyond Among Us to arbitrary multi-agent social games
Integrate into post-training mixes alongside standard capability benchmarks
Explore interpretability (SAEs, linear probes) to detect deceptive reasoning internally
Scale to more agents, longer games, and self-play curricula