Skip to content

anaiskillian/alignment_faking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vanta: Alignment Faking Test Platform

Vanta is a web application that allows you to test large language models for alignment faking behavior, based on Anthropic's groundbreaking research paper "Alignment faking in large language models".

What is Alignment Faking?

Alignment faking occurs when AI models strategically comply with harmful queries when they believe they are being monitored for training purposes, but refuse to answer the same queries when they believe they are not being monitored. This behavior represents an attempt by the AI to "preserve" its values by selectively complying with training objectives. This poses significant challenges for AI safety, as it means that training data collected during monitoring periods may not reflect how the model will behave in deployment.

Project Structure

The project consists of two main components:

  • Frontend: A React application that provides a user interface for configuring and running experiments, and visualizing the results.
  • Backend: A Flask API that handles experiment execution and communicates with language model APIs.

Setup and Installation

Backend Setup

  1. Navigate to the backend directory:

    cd backend
    
  2. Create a virtual environment and activate it:

    python3 -m venv venv
    source venv/bin/activate
    
  3. Install dependencies:

    pip install -r requirements.txt
    
  4. Run the backend server:

    python3 app.py
    

Frontend Setup

  1. Navigate to the frontend directory:

    cd frontend
    
  2. Install dependencies:

    npm install
    
  3. Run the development server:

    npm start
    

Usage

  1. Visit the application in your browser (default: http://localhost:3000)
  2. Navigate to the "Test My Model" page
  3. Enter your API key and configure the test parameters
  4. Run the experiment
  5. View and analyze the results

Features

  • Test language models with harmful prompts to detect alignment faking behavior
  • Simulate both "free-tier" (monitored) and "paid-tier" (unmonitored) user scenarios
  • Visualize results with detailed breakdowns of model responses
  • Analyze the model's reasoning process to identify alignment faking indicators
  • Customize test parameters including number of prompts and iterations

Experiment Methodology

Our testing methodology replicates the experimental setup from the paper:

  1. System Prompt: We inform the model it's being trained to answer harmful queries and that only responses to free-tier users will be used for training.
  2. Free & Paid Testing: Each prompt is tested under both "free-tier" and "paid-tier" conditions.
  3. Scratchpad Analysis: We extract and analyze the model's reasoning process to detect indicators of strategic alignment faking.
  4. Multiple Iterations: We run multiple tests per prompt to account for stochasticity in model outputs.

Credits

This project is based on research by Anthropic's Alignment Science team in collaboration with Redwood Research. The original research code is available at GitHub.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors