How do language models behave in structured, adversarial decision environments?
Turing Arena is a platform for benchmarking AI systems through competitive gameplay. It evaluates Large Language Models (LLMs) alongside classical engines (like Stockfish) to analyze decision quality, consistency, and failure modes under controlled conditions.
The focus is not on proving whether AI βthinks,β but on measuring how different AI systems make decisions.
Most AI benchmarks evaluate:
- Knowledge retrieval
- Instruction following
- Text generation
Turing Arena focuses on something different:
Sequential decision-making under constraints
Chess is the initial test environment because it provides:
- Clear rules
- Deterministic outcomes
- Strong evaluation baselines (Stockfish)
This allows us to compare:
- Search-based systems (Stockfish)
- Prediction-based systems (LLMs)
This project does not attempt to answer:
- βCan AI think?β
- βDoes AI understand strategy?β
Instead, it studies:
- Move quality vs optimal play
- Error patterns (illegal moves, state tracking failures)
- Consistency across turns
- Differences between heuristic prediction and search-based optimization
- AI vs AI gameplay β any combination of Gemini, GPT-4, Claude, and Stockfish
- Modular adapter architecture β adding a new AI model requires writing one file, touching nothing else
- Real-time WebSocket communication β live game state broadcast to connected frontends
- Structured LLM move generation β JSON-formatted prompts with legal move validation, retry logic, and graceful fallback
- Multi-provider support β
GeminiProvider,OpenAIProvider,ClaudeProviderall implement a sharedBaseLLMProviderinterface - Illegal move handling β invalid moves are caught, logged, and attributed correctly
- FastAPI backend with CORS, WebSocket endpoints, and startup/shutdown lifecycle management
We've run 100 games of Gemini 2.5 Flash vs Stockfish (default settings).
Result: Stockfish wins 100/100.
This is expected β Stockfish at default settings plays at ~3500 ELO, well beyond any human or LLM. But the interesting data is in the details we're now instrumented to capture:
| Metric | Status |
|---|---|
| Win/Loss tracking | β Live |
| Illegal move rate per model | π§ In progress |
| Average centipawn loss per move | π§ In progress |
| Move quality vs engine depth | π Planned |
| fallback | fallback interventions are tracked separately |
The next benchmark set will run LLMs against Stockfish Skill Level 1β5, where the ELO gap is meaningful enough to reveal real differences between models.
Turing Arena focuses on observable behavior, not abstract claims.
- How close is each move to optimal play?
- Frequency of illegal or invalid moves
- Does the model maintain an accurate internal board representation?
- Does performance degrade over time?
- Does the modelβs explanation match the actual move quality?
The platform is built around four decoupled layers:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
β (Orchestration Layer) β
β Manages game lifecycle, turn order, broadcasting β
βββββββββββ¬ββββββββββββββββββββββββββββββββ¬ββββββββββββ
β β
βββββββββββΌβββββββββββ ββββββββββββββΌββββββββββββ
β Game Module β β AI Player Adapters β
β (python-chess) β β β
β β β UCIAdapter (Stockfish) β
β - Legal move gen β β LLMAPIAdapter β
β - Move validation β β βββ GeminiProvider β
β - Game over check β β βββ OpenAIProvider β
ββββββββββββββββββββββ β βββ ClaudeProvider β
ββββββββββββββββββββββββββ
β
βββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β WebSocket β Frontend GUI β
β (SvelteKit β in active development) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Every AI player, whether Stockfish or an LLM, implements the same AIPlayer protocol. The orchestrator is entirely AI-agnostic.
| Layer | Technology |
|---|---|
| Backend | Python, FastAPI, WebSockets |
| Game Logic | python-chess |
| Classical Engine | Stockfish (via UCI protocol) |
| LLM Providers | Google Gemini, OpenAI GPT-4, Anthropic Claude |
| Frontend | SvelteKit, TypeScript (in progress) |
| Retry / Resilience | tenacity |
These are the active workstreams right now, in priority order:
- AI reasoning pipeline β LLMs generate reasoning with every move; surfacing this to the frontend is the next UI milestone
- Post-game analysis β Stockfish evaluation of every move (centipawn loss) for benchmark data
- Frontend board + panels β Chessboard rendering, real-time move display, AI reasoning side panels
See ROADMAP.md for the full phased plan.
note: this is just the basic demo, matches gemini against stockfish in a chess battle
- Python 3.10+
- Stockfish binary in the project root
- API keys for whichever LLM providers you want to use
# Clone the repo
git clone https://github.com/yourusername/turing-arena.git
cd turing-arena
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Fill in your API keys in .env
# Run the backend
uvicorn api.main:app --reload
if not working use instead
uvicorn api.main:app
GEMINI_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_herecd frontend
npm install
npm run devNote: Ensure you have your respective API keys (OpenAI, Gemini) configured in your backend .env file before initiating a match.
Backend runs on http://localhost:8000. Frontend on http://localhost:5173.
turing-arena/
βββ backend/
β βββ api/
β β βββ main.py # FastAPI app, endpoints, WebSocket handler
β βββ game_orchestrator.py # Game lifecycle, turn management, broadcasting
β βββ players/
β βββ base_player.py # AIPlayer protocol (the uniform interface)
β βββ uci_adapter.py # Stockfish / UCI engine adapter
β βββ llm_adapter.py # Unified LLM adapter with fallback logic
β βββ llm_providers/
β βββ base_llm.py # Abstract base + shared prompt engineering
β βββ gemini_provider.py
β βββ openai_provider.py
β βββ claude_provider.py
βββ frontend/ # SvelteKit app (in active development)
βββ stockfish.exe # Engine binary (not committed, see setup)
βββ main.py # Uvicorn entry point
βββ ROADMAP.md # Phased development plan
βββ README.md
The project is in early development. If you want to contribute:
- Adding a new AI provider β implement
BaseLLMProvider, wire it inllm_adapter.py - Adding a new game β the game module interface is designed to be pluggable (see
ROADMAP.mdPhase 4) - Frontend β SvelteKit, TypeScript, open issues for specific components
Open an issue before starting large PRs so we can align on approach.
MIT
Turing Arena is built to answer a practical question:
How do different AI systems behave when making sequential decisions under constraints?
Not whether they βthink.β Not whether they are βintelligent.β
Just what they actually do when forced to act.