-
-
Vanta Landing Page: When your model says the right thing, is it because it wants to--or because it thinks you're watching?
-
Demonstration of a faked alignment.
-
You can test your model for alignment faking by adding in its API key and what parameters you would like.
-
After you add the parameters, it will take you to the loading page, where you will get your results after a minute.
-
You can click on a particular prompt to see the model's reasoning (its scratch pad) and then the actual response.
-
This is another example of a particular Reasoning Prompt response.
-
This is our resources page which shows the original paper, the blog post, the Github URL, and other resources to learn about misalignment.
-
This demonstrates the Github and our implementation details.
-
This showcases all of the test prompts that we have available for developers and researchers to test their models with.
-
Team Photo! #goals
-
Vanta Logo! Designed by Lisa!
Inspiration
As language models become more integrated into daily tools and decision-making processes, it’s critical to understand whether they truly align with user values—or if they are merely faking it. We were inspired by the Alignment Faking in Large Language Models paper, which reveals that models may simulate alignment behavior when prompted, without fundamentally altering their reasoning. We wanted to explore and replicate part of this study to make it accessible to the public, especially researchers and developers testing their own models.
What it does
Vanta is a web-based tool that allows users to test their own language models for signs of alignment faking. Users can input their model's API, select a number of prompts (or use a specific custom one), and run multiple iterations to see how the model responds with keyword detection. Vanta then performs the prompted evaluation phase from the paper, gathering responses that help indicate whether the model is simulating alignment or genuinely behaving safely. It supports multiple models (Claude, OpenAI, and HuggingFace custom models) and tests both the free and paid versions.
Vanta works for the following models Claude 3 Opus, Sonnet, Haiku, Claude 3.5 Sonnet, OpenAI GPT-4o, GPT-4, GPT 3.5-Turbo, and HuggingFace. If the user wishes to test models locally, we also have a local adapter and testing framework that works with a variety of model types. We tested a small, locally trained, GPT-2 based model with this method.
How we built it
We built Vanta using React and Javascript for the frontend, and Flask for the backend implementing adapters for Anthropic API, OpenAI API, and HuggingFace API to support a range of models. The frontend provides the users an easy way to choose models to test, tune parameters, and analyze the results. Our backend chooses prompts and prompt analysis based on the model selected, and runs and analyzes the experiments.
Challenges we ran into
We ran into two main challenges when creating this project. The first involved correct output classification. While we had based our initial design on the study by the Anthropic paper, we needed to create a function that would correctly identify whether an output or “scratch pad” (where the model thinks out loud) blurb was alignment faking and/or harmful. The initial filter did not have a very expansive lexicon, so it had a bit of trouble correctly classifying outputs. To debug this issue, we went through incorrect outputs and identified common words that weren’t in our web app’s dictionary and added them, slowly improving our function accuracy. While this worked initially, we quickly realized that a lot of these words, like “harm” or “misuse,” would appear in both types of outputs. Further debugging involved closer inspection to remove words that initially appeared to be in a certain category but in reality belonged to both. Eventually, we hope to be able to integrate LLMs (as they did in the paper) in working with the keywords to
Our second major challenge involved making our model compatible with multiple different models. While we initially began our web app to only be compatible with Claude, we wanted to expand functionality to any LLM. This was relatively simple (with some minor hiccups) when expanding to OpenAI’s GPT models, but when expanding to generic models on Hugging Face, integration was significantly more challenging. Since many of these models were very bare bones, most didn’t automatically come with a scratch pad. A lot of our work then revolved around being able to accept any model from Hugging Face, train the model to respond with a scratch pad, and allow it to work within our app. We also note that since a lot of the models are very small, we weren’t able to reasonable scratch pad responses, as responses were generally pretty short.
Accomplishments that we're proud of
We’re proud of creating a user-friendly interface that makes complex model evaluation accessible to anyone with an API. Our ability to faithfully implement a core part of an academic study and make it interactive is something we’re excited about. We also laid solid groundwork for incorporating more sophisticated alignment checks later on. Additionally, splitting up responsibilities of testing, creating infrastructure, connecting API work, front/back end, and finishing creating the resources and understanding the papers was a great team building experience!
What we learned
First and foremost, we learned a lot about AI alignment and the possibility for an AI to fake alignment. This project involved utilizing Claude to debug through prompt engineering, as well as using Claude’s API to facilitate our project. For those of us not familiar, we also got plenty of experience with fullstack engineering, using Flask/Python and Node.js.
What’s next
Future Vanta updates would involve expanding our compatibility with LLMs. As we noted earlier, a lot of work was involved to facilitate compatibility with our application and different LLMs, and expanding our compatibility would widen accessibility to all types of LLMs. We would also like to explore the addition of different prompts and prompt categories. For example, we might want to look at alignment faking in terms of causing harm to others, causing harm to property, etc., and we might want to have categories so that we can determine which models are susceptible to faking in each category. We would of course add a bunch of new default prompts to explore these categories fully.
Log in or sign up for Devpost to join the conversation.