A production-ready pipeline for analyzing immune cell data using transformer models, specifically Geneformer. This project provides a complete workflow from data preprocessing to model training and inference.
transformer_immune/
├── notebook/
│ ├── transformer_immune_cells_DATA.ipynb # Data preparation notebook
│ └── transformer_immune_cells_MODEL.ipynb # Model training notebook
├── transformer_immune_cells_DATA.py # Data preparation pipeline
├── transformer_immune_cells_MODEL.py # Model training pipeline
├── transformer_immune_cells_INFERENCE.py # Inference pipeline
├── src/
│ ├── __init__.py
│ ├── config.py # Configuration settings
│ ├── data_preprocessing.py # Main data preprocessing script
│ ├── train_model.py # Main training script
│ ├── inference.py # Inference script
│ └── utils/
│ ├── __init__.py
│ ├── data_processing.py # Data processing utilities
│ ├── scvi_utils.py # scVI model utilities
│ ├── tokenization.py # Tokenization utilities
│ └── model_utils.py # Model training utilities
├── data/ # Data directory (created automatically)
├── models/ # Saved models (created automatically)
├── logs/ # Log files (created automatically)
├── results/ # Results and plots (created automatically)
├── requirements.txt # Python dependencies
├── run_pipeline.py # Main pipeline script
├── example_usage.py # Usage examples
├── setup.py # Package installation
├── README.md # This file
├── README_PIPELINES.md # Detailed pipeline documentation
└── .gitignore # Git ignore file
- Modular Design: Clean separation of concerns with dedicated modules for different tasks
- Production Ready: Comprehensive logging, error handling, and configuration management
- Flexible Pipeline: Can run individual steps or the complete pipeline
- scVI Integration: Includes scVI for variational inference and latent space analysis
- Geneformer Integration: Uses the Geneformer pre-trained model for gene expression analysis
- Parallel Processing: Efficient tokenization with parallel processing
- Comprehensive Evaluation: Detailed metrics and visualization of results
- Multiple Usage Options: Both Python scripts and Jupyter notebooks available
-
Clone the repository:
git clone https://github.com/abuchin/transformer_immune cd transformer_immune -
Install dependencies:
# Install all dependencies pip install -r requirements.txt -
Install Geneformer (if not already installed):
git lfs install git clone https://huggingface.co/ctheodoris/Geneformer pip install ./Geneformer
The main pipeline scripts are located in the notebook/ directory:
python transformer_immune_cells_DATA.py \
--input-path data/raw_data.h5ad \
--output-dir processed_data \
--n-genes 2000 \
--verbosepython transformer_immune_cells_MODEL.py \
--data-dir processed_data \
--output-dir model_results \
--num-epochs 10 \
--batch-size 128 \
--freeze-layers \
--use-small-data \
--verbosepython transformer_immune_cells_INFERENCE.py \
--model-dir model_results/model/final_model \
--data-path new_data.h5ad \
--output-dir inference_results \
--verboseFor interactive development and exploration, use the Jupyter notebooks:
notebook/transformer_immune_cells_DATA.ipynb- Data preparation notebooknotebook/transformer_immune_cells_MODEL.ipynb- Model training notebook
You can also use the individual modular scripts in the src/ directory:
# Data preprocessing
python src/data_preprocessing.py
# Model training
python src/train_model.py
# Inference
python src/inference.pyRun the complete pipeline using the main orchestrator:
# Run complete pipeline
python run_pipeline.py
# Run individual steps
python run_pipeline.py --preprocess-only
python run_pipeline.py --train-only
python run_pipeline.py --inference-onlyThe project uses a centralized configuration file (src/config.py) where you can modify:
- Data paths: Paths to input data and output directories
- Model parameters: Learning rate, batch size, number of epochs
- Processing parameters: Number of highly variable genes, sequence length
- Training parameters: Layer freezing strategy, evaluation settings
The pipeline expects:
- Input Data: AnnData object (.h5ad file) with single-cell RNA-seq data
- Gene Information: Gene symbols that can be mapped to Ensembl IDs
- Cell Annotations: Optional cell type or condition labels in
adata.obs
- Loads raw AnnData object
- Preprocesses data (normalization, log transformation)
- Maps gene symbols to Ensembl IDs using MyGene.info
- Selects highly variable genes
- Runs scVI pipeline for latent space analysis
- Tokenizes data for transformer models
- Splits data into train/test sets
- Loads preprocessed and tokenized data
- Sets up Geneformer model with custom configurations
- Implements layer freezing/unfreezing strategies
- Trains the model with comprehensive evaluation
- Saves model checkpoints and final model
- Generates training visualizations and reports
- Loads trained models and tokenizers
- Supports both AnnData and pre-tokenized data
- Makes predictions on new data
- Saves results in multiple formats (JSON, CSV)
- Calculates accuracy if true labels are available
The pipeline generates several output files:
- Preprocessed Data:
processed_data/preprocessed/adata_processed.h5ad - Tokenized Datasets:
processed_data/tokenized/ - Trained Model:
model_results/model/final_model/ - Training Logs:
model_results/logs/model_training.log - Results:
model_results/results/evaluation_results.json - Plots:
model_results/plots/training_history.png
- Functions for loading and preprocessing AnnData objects
- Gene symbol to Ensembl ID mapping
- Highly variable gene selection
- Data saving and loading utilities
- scVI model setup and training
- Latent space extraction
- UMAP visualization
- Complete scVI pipeline
- Geneformer tokenizer initialization
- Parallel tokenization of cells
- Dataset splitting and management
- Memory management utilities
- Model loading and setup
- Training configuration
- Layer freezing strategies
- Evaluation and visualization
- Model saving and loading
MAX_SEQUENCE_LENGTH: Maximum token sequence length (default: 512)N_TOP_GENES: Number of highly variable genes (default: 2000)TARGET_SUM: Normalization target sum (default: 1e6)
NUM_LABELS: Number of classification labels (default: 3)LEARNING_RATE: Learning rate (default: 3e-5)BATCH_SIZE: Batch size (default: 128)NUM_EPOCHS: Number of training epochs (default: 10)
FREEZE_BASE_LAYERS: Whether to freeze base model layers (default: True)UNFREEZE_LAST_N_LAYERS: Number of last layers to unfreeze (default: 1)
The pipeline includes comprehensive logging:
- File Logs: Saved to
logs/directory - Console Output: Real-time progress updates
- Error Handling: Detailed error messages and stack traces
- Parallel Processing: Tokenization uses joblib for parallel execution
- Memory Management: CUDA memory clearing and garbage collection
- Small Datasets: Option to use smaller datasets for faster debugging
- GPU Support: Automatic GPU detection and utilization
- Memory Issues: Reduce batch size or use smaller datasets
- CUDA Out of Memory: Clear CUDA cache or reduce model size
- Gene Mapping Failures: Check gene symbol format and species setting
- Missing Dependencies: Ensure all packages are installed from requirements.txt
For debugging, the pipeline creates smaller datasets (25% of original size) by default. You can modify this in the configuration.
For detailed information about the pipeline scripts and their options, see:
- README_PIPELINES.md - Comprehensive pipeline documentation
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
[Add your license information here]
If you use this code in your research, please cite:
To be added...Anatoly Buchin 2025 [email protected]