Skip to content

avin-cyborg/Turing-Arena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

63 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

β™ŸοΈ Turing Arena

How do language models behave in structured, adversarial decision environments?

Turing Arena is a platform for benchmarking AI systems through competitive gameplay. It evaluates Large Language Models (LLMs) alongside classical engines (like Stockfish) to analyze decision quality, consistency, and failure modes under controlled conditions.

The focus is not on proving whether AI β€œthinks,” but on measuring how different AI systems make decisions.


🎯 What This Is

Most AI benchmarks evaluate:

  • Knowledge retrieval
  • Instruction following
  • Text generation

Turing Arena focuses on something different:

Sequential decision-making under constraints

Chess is the initial test environment because it provides:

  • Clear rules
  • Deterministic outcomes
  • Strong evaluation baselines (Stockfish)

This allows us to compare:

  • Search-based systems (Stockfish)
  • Prediction-based systems (LLMs)

⚠️ Important Framing

This project does not attempt to answer:

  • β€œCan AI think?”
  • β€œDoes AI understand strategy?”

Instead, it studies:

  • Move quality vs optimal play
  • Error patterns (illegal moves, state tracking failures)
  • Consistency across turns
  • Differences between heuristic prediction and search-based optimization

βœ… What's Working Right Now

  • AI vs AI gameplay β€” any combination of Gemini, GPT-4, Claude, and Stockfish
  • Modular adapter architecture β€” adding a new AI model requires writing one file, touching nothing else
  • Real-time WebSocket communication β€” live game state broadcast to connected frontends
  • Structured LLM move generation β€” JSON-formatted prompts with legal move validation, retry logic, and graceful fallback
  • Multi-provider support β€” GeminiProvider, OpenAIProvider, ClaudeProvider all implement a shared BaseLLMProvider interface
  • Illegal move handling β€” invalid moves are caught, logged, and attributed correctly
  • FastAPI backend with CORS, WebSocket endpoints, and startup/shutdown lifecycle management

πŸ“Š Early Benchmark Results

We've run 100 games of Gemini 2.5 Flash vs Stockfish (default settings).

Result: Stockfish wins 100/100.

This is expected β€” Stockfish at default settings plays at ~3500 ELO, well beyond any human or LLM. But the interesting data is in the details we're now instrumented to capture:

Metric Status
Win/Loss tracking βœ… Live
Illegal move rate per model πŸ”§ In progress
Average centipawn loss per move πŸ”§ In progress
Move quality vs engine depth πŸ“‹ Planned
fallback fallback interventions are tracked separately

The next benchmark set will run LLMs against Stockfish Skill Level 1–5, where the ELO gap is meaningful enough to reveal real differences between models.


🧠 What This Project Measures

Turing Arena focuses on observable behavior, not abstract claims.

1. Decision Quality

  • How close is each move to optimal play?

2. Rule Compliance

  • Frequency of illegal or invalid moves

3. State Tracking

  • Does the model maintain an accurate internal board representation?

4. Consistency

  • Does performance degrade over time?

5. Explanation vs Action Gap

  • Does the model’s explanation match the actual move quality?

πŸ—οΈ Architecture

The platform is built around four decoupled layers:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   FastAPI Backend                    β”‚
β”‚              (Orchestration Layer)                   β”‚
β”‚   Manages game lifecycle, turn order, broadcasting  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                               β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Game Module      β”‚       β”‚   AI Player Adapters   β”‚
β”‚   (python-chess)   β”‚       β”‚                        β”‚
β”‚                    β”‚       β”‚  UCIAdapter (Stockfish) β”‚
β”‚  - Legal move gen  β”‚       β”‚  LLMAPIAdapter          β”‚
β”‚  - Move validation β”‚       β”‚    β”œβ”€β”€ GeminiProvider   β”‚
β”‚  - Game over check β”‚       β”‚    β”œβ”€β”€ OpenAIProvider   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚    └── ClaudeProvider  β”‚
                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              WebSocket β†’ Frontend GUI                β”‚
β”‚         (SvelteKit β€” in active development)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Every AI player, whether Stockfish or an LLM, implements the same AIPlayer protocol. The orchestrator is entirely AI-agnostic.


πŸ”§ Tech Stack

Layer Technology
Backend Python, FastAPI, WebSockets
Game Logic python-chess
Classical Engine Stockfish (via UCI protocol)
LLM Providers Google Gemini, OpenAI GPT-4, Anthropic Claude
Frontend SvelteKit, TypeScript (in progress)
Retry / Resilience tenacity

🚧 Current Development Focus

These are the active workstreams right now, in priority order:

  1. AI reasoning pipeline β€” LLMs generate reasoning with every move; surfacing this to the frontend is the next UI milestone
  2. Post-game analysis β€” Stockfish evaluation of every move (centipawn loss) for benchmark data
  3. Frontend board + panels β€” Chessboard rendering, real-time move display, AI reasoning side panels

See ROADMAP.md for the full phased plan.


πŸ”— Deployment Link

demo link

note: this is just the basic demo, matches gemini against stockfish in a chess battle


πŸš€ Getting Started

Prerequisites

  • Python 3.10+
  • Stockfish binary in the project root
  • API keys for whichever LLM providers you want to use

Backend Setup

# Clone the repo
git clone https://github.com/yourusername/turing-arena.git
cd turing-arena

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Fill in your API keys in .env

# Run the backend
uvicorn api.main:app --reload

if not working use instead
uvicorn api.main:app

Environment Variables

GEMINI_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here

Frontend Setup

cd frontend
npm install
npm run dev

Note: Ensure you have your respective API keys (OpenAI, Gemini) configured in your backend .env file before initiating a match. Backend runs on http://localhost:8000. Frontend on http://localhost:5173.


πŸ“ Project Structure

turing-arena/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   └── main.py              # FastAPI app, endpoints, WebSocket handler
β”‚   β”œβ”€β”€ game_orchestrator.py     # Game lifecycle, turn management, broadcasting
β”‚   └── players/
β”‚       β”œβ”€β”€ base_player.py       # AIPlayer protocol (the uniform interface)
β”‚       β”œβ”€β”€ uci_adapter.py       # Stockfish / UCI engine adapter
β”‚       β”œβ”€β”€ llm_adapter.py       # Unified LLM adapter with fallback logic
β”‚       └── llm_providers/
β”‚           β”œβ”€β”€ base_llm.py      # Abstract base + shared prompt engineering
β”‚           β”œβ”€β”€ gemini_provider.py
β”‚           β”œβ”€β”€ openai_provider.py
β”‚           └── claude_provider.py
β”œβ”€β”€ frontend/                    # SvelteKit app (in active development)
β”œβ”€β”€ stockfish.exe                # Engine binary (not committed, see setup)
β”œβ”€β”€ main.py                      # Uvicorn entry point
β”œβ”€β”€ ROADMAP.md                   # Phased development plan
└── README.md

🀝 Contributing

The project is in early development. If you want to contribute:

  • Adding a new AI provider β†’ implement BaseLLMProvider, wire it in llm_adapter.py
  • Adding a new game β†’ the game module interface is designed to be pluggable (see ROADMAP.md Phase 4)
  • Frontend β†’ SvelteKit, TypeScript, open issues for specific components

Open an issue before starting large PRs so we can align on approach.


πŸ“„ License

MIT


🧭 Project Direction

Turing Arena is built to answer a practical question:

How do different AI systems behave when making sequential decisions under constraints?

Not whether they β€œthink.” Not whether they are β€œintelligent.”

Just what they actually do when forced to act.

Releases

No releases published

Packages

 
 
 

Contributors