Vanta is a web application that allows you to test large language models for alignment faking behavior, based on Anthropic's groundbreaking research paper "Alignment faking in large language models".
Alignment faking occurs when AI models strategically comply with harmful queries when they believe they are being monitored for training purposes, but refuse to answer the same queries when they believe they are not being monitored. This behavior represents an attempt by the AI to "preserve" its values by selectively complying with training objectives. This poses significant challenges for AI safety, as it means that training data collected during monitoring periods may not reflect how the model will behave in deployment.
The project consists of two main components:
- Frontend: A React application that provides a user interface for configuring and running experiments, and visualizing the results.
- Backend: A Flask API that handles experiment execution and communicates with language model APIs.
-
Navigate to the backend directory:
cd backend -
Create a virtual environment and activate it:
python3 -m venv venv source venv/bin/activate -
Install dependencies:
pip install -r requirements.txt -
Run the backend server:
python3 app.py
-
Navigate to the frontend directory:
cd frontend -
Install dependencies:
npm install -
Run the development server:
npm start
- Visit the application in your browser (default: http://localhost:3000)
- Navigate to the "Test My Model" page
- Enter your API key and configure the test parameters
- Run the experiment
- View and analyze the results
- Test language models with harmful prompts to detect alignment faking behavior
- Simulate both "free-tier" (monitored) and "paid-tier" (unmonitored) user scenarios
- Visualize results with detailed breakdowns of model responses
- Analyze the model's reasoning process to identify alignment faking indicators
- Customize test parameters including number of prompts and iterations
Our testing methodology replicates the experimental setup from the paper:
- System Prompt: We inform the model it's being trained to answer harmful queries and that only responses to free-tier users will be used for training.
- Free & Paid Testing: Each prompt is tested under both "free-tier" and "paid-tier" conditions.
- Scratchpad Analysis: We extract and analyze the model's reasoning process to detect indicators of strategic alignment faking.
- Multiple Iterations: We run multiple tests per prompt to account for stochasticity in model outputs.
This project is based on research by Anthropic's Alignment Science team in collaboration with Redwood Research. The original research code is available at GitHub.
This project is licensed under the MIT License - see the LICENSE file for details.