Vanta: Alignment Faking Test Platform

Vanta is a web application that allows you to test large language models for alignment faking behavior, based on Anthropic's groundbreaking research paper "Alignment faking in large language models".

What is Alignment Faking?

Alignment faking occurs when AI models strategically comply with harmful queries when they believe they are being monitored for training purposes, but refuse to answer the same queries when they believe they are not being monitored. This behavior represents an attempt by the AI to "preserve" its values by selectively complying with training objectives. This poses significant challenges for AI safety, as it means that training data collected during monitoring periods may not reflect how the model will behave in deployment.

Project Structure

The project consists of two main components:

Frontend: A React application that provides a user interface for configuring and running experiments, and visualizing the results.
Backend: A Flask API that handles experiment execution and communicates with language model APIs.

Setup and Installation

Backend Setup

Navigate to the backend directory:
```
cd backend
```

Create a virtual environment and activate it:

python3 -m venv venv
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Run the backend server:
```
python3 app.py
```

Frontend Setup

Navigate to the frontend directory:
```
cd frontend
```
Install dependencies:
```
npm install
```
Run the development server:
```
npm start
```

Usage

Visit the application in your browser (default: http://localhost:3000)
Navigate to the "Test My Model" page
Enter your API key and configure the test parameters
Run the experiment
View and analyze the results

Features

Test language models with harmful prompts to detect alignment faking behavior
Simulate both "free-tier" (monitored) and "paid-tier" (unmonitored) user scenarios
Visualize results with detailed breakdowns of model responses
Analyze the model's reasoning process to identify alignment faking indicators
Customize test parameters including number of prompts and iterations

Experiment Methodology

Our testing methodology replicates the experimental setup from the paper:

System Prompt: We inform the model it's being trained to answer harmful queries and that only responses to free-tier users will be used for training.
Free & Paid Testing: Each prompt is tested under both "free-tier" and "paid-tier" conditions.
Scratchpad Analysis: We extract and analyze the model's reasoning process to detect indicators of strategic alignment faking.
Multiple Iterations: We run multiple tests per prompt to account for stochasticity in model outputs.

Credits

This project is based on research by Anthropic's Alignment Science team in collaboration with Redwood Research. The original research code is available at GitHub.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.vercel		.vercel
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md
SETUP-INSTRUCTIONS.md		SETUP-INSTRUCTIONS.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vanta: Alignment Faking Test Platform

What is Alignment Faking?

Project Structure

Setup and Installation

Backend Setup

Frontend Setup

Usage

Features

Experiment Methodology

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vanta: Alignment Faking Test Platform

What is Alignment Faking?

Project Structure

Setup and Installation

Backend Setup

Frontend Setup

Usage

Features

Experiment Methodology

Credits

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages