This repository contains the code and resources for the study on atomically precise 13-atom icosahedral nanoclusters built from Group IV hosts (Ti, Zr, Hf) doped with a single transition-metal (3d/4d/5d). The project utilizes a TabTransformer-based model (Feature Tokenizer Transformer) for property prediction, enhanced with uncertainty quantification via Conformal Prediction and explainability via SHAP.
- FT-Transformer Model: A deep learning model designed for tabular data, using feature tokenization and transformer blocks.
- Conformal Prediction: Provides statistically valid uncertainty intervals for predictions, including Mondrian conformal prediction for grouped data.
- Explainable AI (XAI):
- SHAP Analysis: Feature importance and contribution analysis.
- Embedding Analysis: KNN-based search and visualization of model representations.
- Out-of-Distribution (OOD) Detection: Analysis of model performance on data outside the training distribution.
- Automated Pipeline: End-to-end pipeline for training, fine-tuning, and generating scientific reports.
- Hyperparameter Optimization: Integrated with Optuna for efficient hyperparameter search.
The codebase is organized as follows:
engine/: Core Python package containing the application logic.analysis/: Modules for conformal prediction, OOD analysis, residuals, and scientific reporting.config/: Configuration settings and model parameters.data/: Data loading, processing, and feature engineering.modeling/: Model definitions (FTTransformer, contrastive learning).pipeline/: Pipeline orchestration (pipeline_launcher.py, training loops).xai/: Explainability tools (SHAP, embeddings).ohp_search/: Optuna hyperparameter search logic.
scripts/: Standalone scripts for specific tasks (plotting, advanced analysis, utilities).full_model/: Contains trained model artifacts, weights, and hyperparameter configurations.processed_data/: Directory for storing enriched and processed datasets.
Ensure you have a Python environment set up (Python 3.8+ recommended). Key dependencies include:
tensorflownumpypandasscikit-learnoptunashapdaskmatplotlibseabornsqlalchemy
The main entry point for training and analysis is the pipeline launcher. You can run it using the following command:
python -m engine.pipeline.pipeline_launcher --hparams_json full_model/hparams_final.jsonCommon Arguments:
--hparams_json: Path to the hyperparameters JSON file.--base_model_dir: Path to an existing base model directory (to skip full training).--skip: Comma-separated list of stages to skip (e.g.,shap_analysis,ood_analysis).--enable: Comma-separated list of optional stages to enable (e.g.,post_cp_ood,embedding_analysis).--run_dir: Directory to store run artifacts.
Run full pipeline:
python -m engine.pipeline.pipeline_launcher --hparams_json full_model/hparams_final.jsonRun only fine-tuning and analysis (using existing base model):
python -m engine.pipeline.pipeline_launcher --base_model_dir full_model --skip shap_analysisRun with advanced XAI features:
python -m engine.pipeline.pipeline_launcher --hparams_json full_model/hparams_final.json --enable embedding_analysis,consolidated_reportThe scripts/ directory contains utilities for various tasks. For example, to visualize model weights:
python scripts/inspect_model_weights.py --model_dir full_modelTo run pretraining:
python scripts/pretrain_contrastive_multitask.py ...Global settings and paths are defined in engine/config/config.py and engine/config/models.py. You can adjust data paths and model defaults there.