| hide |
|
|---|
Invariant Checking & Observability for AI Training
Stop flying blind. Validate training dynamics, catch silent errors, and debug with confidence automatically.
Get Started{ .md-button .md-button--primary } 5-Min Tutorial{ .md-button } View on GitHub{ .md-button }
TrainCheck validates the "physics" of your training process in real-time. It ensures your model adheres to learned invariants (such as gradient norms, tensor shapes, and update magnitudes) effectively catching silent corruption before it wastes GPU hours.
Traditional tools only show you if your model crashed. TrainCheck shows you why it's degrading, analyzing internal state dynamics that loss curves miss.
No manual tests required. TrainCheck automatically learns the invariants of your specific model from healthy runs and flags deviations instantly.
- Instrument: We wrap your training loop with lightweight probes. No code changes needed.
- Learn: We analyze correct runs to infer invariants (mathematical rules of healthy training).
- Check: We monitor new runs in real-time, verifying every step against learned invariants to catch silent logic bugs and hardware faults.
Work through 5‑Minute Experience with TrainCheck. You’ll learn how to:
- Instrument a training script and collect a trace
- Automatically infer invariants
- Uncover silent bugs in the training script
- Installation Guide
- Usage Guide: Scenarios and Limitations
- TrainCheck Technical Doc
- TrainCheck Dev RoadMap
TrainCheck is under active development. Please join our 💬 Discord server or file a GitHub issue for support. We welcome feedback and contributions from early adopters.
We welcome and value any contributions and collaborations. Please check out Contributing to TrainCheck for how to get involved.
TrainCheck is licensed under the Apache License 2.0.
If TrainCheck is relevant to your work, please cite our paper:
@inproceedings{TrainCheckOSDI2025,
author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng},
title = {Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks},
booktitle = {Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation},
series = {OSDI '25},
month = {July},
year = {2025},
address = {Boston, MA, USA},
publisher = {USENIX Association},
}🕵️♀️ OSDI AE members, please see TrainCheck AE Guide.
