Skip to content

KIT-Workflows/Nanocluster_Transformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nanocluster Transformers

This repository contains the code and resources for the study on atomically precise 13-atom icosahedral nanoclusters built from Group IV hosts (Ti, Zr, Hf) doped with a single transition-metal (3d/4d/5d). The project utilizes a TabTransformer-based model (Feature Tokenizer Transformer) for property prediction, enhanced with uncertainty quantification via Conformal Prediction and explainability via SHAP.

Features

  • FT-Transformer Model: A deep learning model designed for tabular data, using feature tokenization and transformer blocks.
  • Conformal Prediction: Provides statistically valid uncertainty intervals for predictions, including Mondrian conformal prediction for grouped data.
  • Explainable AI (XAI):
    • SHAP Analysis: Feature importance and contribution analysis.
    • Embedding Analysis: KNN-based search and visualization of model representations.
  • Out-of-Distribution (OOD) Detection: Analysis of model performance on data outside the training distribution.
  • Automated Pipeline: End-to-end pipeline for training, fine-tuning, and generating scientific reports.
  • Hyperparameter Optimization: Integrated with Optuna for efficient hyperparameter search.

Project Structure

The codebase is organized as follows:

  • engine/: Core Python package containing the application logic.
    • analysis/: Modules for conformal prediction, OOD analysis, residuals, and scientific reporting.
    • config/: Configuration settings and model parameters.
    • data/: Data loading, processing, and feature engineering.
    • modeling/: Model definitions (FTTransformer, contrastive learning).
    • pipeline/: Pipeline orchestration (pipeline_launcher.py, training loops).
    • xai/: Explainability tools (SHAP, embeddings).
    • ohp_search/: Optuna hyperparameter search logic.
  • scripts/: Standalone scripts for specific tasks (plotting, advanced analysis, utilities).
  • full_model/: Contains trained model artifacts, weights, and hyperparameter configurations.
  • processed_data/: Directory for storing enriched and processed datasets.

Installation

Ensure you have a Python environment set up (Python 3.8+ recommended). Key dependencies include:

  • tensorflow
  • numpy
  • pandas
  • scikit-learn
  • optuna
  • shap
  • dask
  • matplotlib
  • seaborn
  • sqlalchemy

Usage

Running the Pipeline

The main entry point for training and analysis is the pipeline launcher. You can run it using the following command:

python -m engine.pipeline.pipeline_launcher --hparams_json full_model/hparams_final.json

Common Arguments:

  • --hparams_json: Path to the hyperparameters JSON file.
  • --base_model_dir: Path to an existing base model directory (to skip full training).
  • --skip: Comma-separated list of stages to skip (e.g., shap_analysis,ood_analysis).
  • --enable: Comma-separated list of optional stages to enable (e.g., post_cp_ood,embedding_analysis).
  • --run_dir: Directory to store run artifacts.

Example Commands

Run full pipeline:

python -m engine.pipeline.pipeline_launcher --hparams_json full_model/hparams_final.json

Run only fine-tuning and analysis (using existing base model):

python -m engine.pipeline.pipeline_launcher --base_model_dir full_model --skip shap_analysis

Run with advanced XAI features:

python -m engine.pipeline.pipeline_launcher --hparams_json full_model/hparams_final.json --enable embedding_analysis,consolidated_report

Scripts

The scripts/ directory contains utilities for various tasks. For example, to visualize model weights:

python scripts/inspect_model_weights.py --model_dir full_model

To run pretraining:

python scripts/pretrain_contrastive_multitask.py ...

Configuration

Global settings and paths are defined in engine/config/config.py and engine/config/models.py. You can adjust data paths and model defaults there.

About

This repository accompanies our study on atomically precise 13-atom icosahedral nanoclusters built from Group IV hosts (Ti, Zr, Hf) doped with a single transition-metal (3d/4d/5d). Using large-scale DFT and a TabTransformer-based model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors