Skip to content
Change the repository type filter

All

    Repositories list

    • legibility

      Public
      Which models are illegible under what conditions, and why? How does that impact monitorability?
      Jupyter Notebook
      MIT License
      0000Updated Apr 19, 2026Apr 19, 2026
    • aligning-ai-orgs

      Public
      Python
      0000Updated Apr 17, 2026Apr 17, 2026
    • introspection-mechanisms

      Public
      introspection mechanisms
      Python
      51700Updated Apr 16, 2026Apr 16, 2026
    • petri

      Public
      An alignment auditing agent capable of quickly exploring alignment hypothesis
      Python
      MIT License
      14999534Updated Apr 16, 2026Apr 16, 2026
    • auditing-agents

      Public
      Python
      21312Updated Apr 14, 2026Apr 14, 2026
    • automated-w2s-research

      Public
      Python
      2913600Updated Apr 13, 2026Apr 13, 2026
    • crosscoder_emergent_misalignment

      Public
      Applying crosscoder model diffing to emergently misaligned models
      Python
      05558Updated Apr 1, 2026Apr 1, 2026
    • agent-transcript-editor

      Public
      Web UI for viewing, editing, and AI-assisted red teaming of AI agent transcripts
      Python
      0100Updated Mar 31, 2026Mar 31, 2026
    • trusted-monitor

      Public
      Evaluate AI agent transcripts for suspicious behavior (0-100 scoring)
      Python
      0100Updated Mar 28, 2026Mar 28, 2026
    • safety-tooling

      Public
      Inference API for many LLMs and other useful tools for empirical research
      Python
      MIT License
      361151318Updated Mar 23, 2026Mar 23, 2026
    • PurpleLlama

      Public
      Set of tools to assess and improve LLM security.
      Python
      Other
      723000Updated Feb 23, 2026Feb 23, 2026
    • bloom

      Public
      bloom - evaluate any behavior immediately  🌸🌱
      Python
      MIT License
      1621.3k08Updated Feb 17, 2026Feb 17, 2026
    • casr

      Public
      Collect crash (or UndefinedBehaviorSanitizer error) reports, triage, and estimate severity.
      Rust
      Apache License 2.0
      36000Updated Feb 3, 2026Feb 3, 2026
    • assistant-axis

      Public
      The Assistant Axis is a direction in activation space that captures how "Assistant-like" a model's behavior is. Models can drift away from the Assistant during …
      Jupyter Notebook
      3612721Updated Jan 20, 2026Jan 20, 2026
    • selective-gradient-masking

      Public
      Training Transformers with knowledge localization (SGTM)
      Python
      MIT License
      55100Updated Jan 11, 2026Jan 11, 2026
    • how-ai-impacts-skill-formation

      Public
      Repo for measuring whether using AI tools inhibits skill formation and development
      Python
      21301Updated Jan 3, 2026Jan 3, 2026
    • A3

      Public
      Python
      Apache License 2.0
      11400Updated Dec 29, 2025Dec 29, 2025
    • Inverse Scaling in Test-Time Compute
      Python
      MIT License
      22500Updated Dec 3, 2025Dec 3, 2025
    • impossiblebench

      Public
      Official Inspect Implementation for "ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases"
      Python
      MIT License
      83700Updated Dec 1, 2025Dec 1, 2025
    • SCONE-bench

      Public
      MIT License
      2917750Updated Nov 25, 2025Nov 25, 2025
    • unsupervised-truth-probes

      Public
      Python
      0510Updated Nov 24, 2025Nov 24, 2025
    • ciphered-reasoning-llms

      Public
      Jupyter Notebook
      MIT License
      15900Updated Nov 20, 2025Nov 20, 2025
    • Jinja
      MIT License
      152502Updated Nov 11, 2025Nov 11, 2025
    • weight-steering

      Public
      Python
      3800Updated Nov 11, 2025Nov 11, 2025
    • misalignment-scraper

      Public
      AI misalignment detection and reproduction tool for social media content
      Python
      MIT License
      0100Updated Nov 1, 2025Nov 1, 2025
    • believe-it-or-not

      Public
      Code and data for editing model beliefs with SDF and other methods, and for evaluating the depth of the implanted beliefs.
      Python
      MIT License
      41310Updated Oct 23, 2025Oct 23, 2025
    • science-synth-facts

      Public
      Python
      5610Updated Oct 22, 2025Oct 22, 2025
    • finetuning-auditor

      Public
      Auditing agents for fine-tuning safety
      Python
      MIT License
      32000Updated Oct 21, 2025Oct 21, 2025
    • inoculation-prompting

      Public
      Python
      51000Updated Oct 13, 2025Oct 13, 2025
    • verl

      Public
      veRL: Volcano Engine Reinforcement Learning for LLM
      Python
      Apache License 2.0
      3.7k401Updated Oct 3, 2025Oct 3, 2025
    ProTip! When viewing an organization's repositories, you can use the props. filter to filter by custom property.