Skip to content

gozu/kiji-inspector

 
 

Repository files navigation

Kiji Inspector: Mechanistic Interpretability for AI Agent Tool Selection

Kiji Inspector

Lint & Test License: Apache 2.0 GitHub Stars GitHub Issues

Python Version

Responsible AI Contributions Welcome PRs Welcome

Status

This project is under heavy active development. We are planning to release a stable version of the framework in the coming weeks.

In the meantime, join our Slack Community

Learn more about our approach and early results:


What This Project Does

This project trains Sparse Autoencoders (SAEs) on the internal activations of an AI agent to understand why it selects specific tools. Given a user request like "Search our docs for API limits," the agent must choose between tools (e.g., internal_search vs web_search). We extract the model's hidden representations at the moment of that decision, decompose them into interpretable features using a JumpReLU SAE, and validate the resulting explanations through automated fuzzing and causal ablation experiments.

The key insight: train the SAE on raw activations (not difference vectors), then use contrastive pairs post-hoc to identify which learned features correspond to specific tool-selection decisions. This preserves the SAE's general feature dictionary while enabling targeted analysis of decision-relevant features.


🤝 Contributing

We welcome contributions! Whether you're fixing a bug, improving documentation, or proposing a new feature, your help is appreciated.

Ways to Contribute

  • Report Bugs - Open an issue with steps to reproduce
  • Improve Docs - Documentation PRs are always welcome
  • Submit Features - Open an issue to discuss your idea before submitting a PR
  • Share Feedback - Start a discussion

Community


📄 License

Copyright (c) 2026 Dataiku SAS

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 66.8%
  • TeX 23.4%
  • BibTeX Style 8.3%
  • CSS 1.3%
  • Other 0.2%