Kazuya Nishimura, Ryoma Bise, Shinnosuke Matsuo, Haruka Hirose, Yasuhiro Kojima
This repository implements a Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images. The method leverages single-cell RNA sequencing (scRNA-seq) data to construct cell-type-specific prototypes, which are then used to guide gene expression prediction from histopathological image patches. By incorporating biologically meaningful priors, the model improves interpretability and prediction accuracy compared to purely image-based approaches.
For spatial transcriptomics prediction, please refer to our related implementation: https://github.com/naivete5656/TRIPLEX_x_CPNN
bash ./docker_mamba build
Processing of TCGA
Download instructions
Download manifest file and sample sheets from following projects. [BRCA] [KIRC] [LUAD]
- upload sample_sheet, manifestfile to github
cd ./raw_dataset/TCGA_raw
mkdir ./TCGA-${DATASET}/WSI
mkdir ./TCGA-${DATASET}/RNA
./gdc-client download -m ./manifests/gdc_${DATASET}_manifest_wsi.txt -d ./TCGA-${DATASET}/WSI
./gdc-client download -m ./manifests/gdc_${DATASET}_manifest_rna.txt -d ./TCGA-${DATASET}/RNA
Folder structure
raw_dataset
├── TCGA_raw
| ├── manifests
| ├── samplesheets
| ...
| L TCGA-LUAD
└── sc_raw
├── BRCA
...
L LUAD
Processing instructions
python preprocessing/1-processing_TCGA.py --dataset ${DATASET} --source_dir ./raw_dataset/TCGA_raw --save_dir ./raw_dataset/processing --id_dict_path ./raw_dataset/gene_id_conv_df.csv
Feature extraction by CLAM from whole slide image
Please refer CLAM repository to use following command
git clone https://github.com/mahmoodlab/CLAM.git
For the feature extraction by DINO v2 Please add following script in the line 78 of CLAM/models/builder.py
elif model_name == "dinov2":
model = timm.create_model(
"vit_base_patch14_dinov2.lvd142m",
pretrained=True,
init_values=1e-5,
dynamic_img_size=False,
)
data_config = timm.data.resolve_model_data_config(model)
img_transforms = timm.data.create_transform(**data_config, is_training=False)
cd CLAM
DATASET=LUAD
python create_patches_fp.py --source ../raw_dataset/processing/${DATASET}/TCGA-digital_slide/WSI --save_dir ../raw_dataset/processing/${DATASET}/TCGA-digital_slide/proc_wsi4mil --patch_size 256 --preset tcga.csv --seg --patch --stitch
python extract_features_fp.py --data_h5_dir ../raw_dataset/processing/${DATASET}/TCGA-digital_slide/proc_wsi4mil --data_slide_dir ../raw_dataset/processing/${DATASET}/TCGA-digital_slide/WSI --csv_path ../raw_dataset/processing/${DATASET}/TCGA-digital_slide/proc_wsi4mil/process_list_autogen.csv --feat_dir ../raw_dataset/processing/${DATASET}/TCGA-digital_slide/feature --batch_size 512 --slide_ext .svs --model_name dinov2
python preprocessing/2-sc_processing.py --dataset ${DATASET} --save_dir ./raw_dataset/processing --sc_raw_data_dir ./raw_dataset/sc_raw
python preprocessing/3-generate_pair.py --dataset ${DATASET} --base_dir ./raw_dataset/processing --save_dir ./dataset --feat_name feature
for fold in {1..5}
do
python preprocessing/4-cell2location_proc.py --dataset ${DATASET} --save_dir dataset \
--base_dir ./raw_dataset/processing --resolution fine --fold ${fold}
done
Experiment scripts are provided in the scripts/ directory.
# Single fold
python main.py --method ProtoSum --dataset BRCA --fold 0 \
--trainer DeconvExp --version "1reg_mse_reg_1e3" --data_type ts
# All folds across all datasets
bash scripts/ours.shbash scripts/ours_granularity.sh
bash scripts/ours_ablation.shScripts for all comparison methods are in scripts/comparisons/.
| Script | Method | Trainer |
|---|---|---|
MIL_abmil.sh |
AbMIL | — |
MIL_max.sh |
AbMIL (max pooling) | — |
MIL_mean.sh |
AbMIL (mean pooling) | — |
MIL_ILRA.sh |
ILRA | — |
MIL_mambamil.sh |
MambaMIL | — |
MIL_srmamba.sh |
SRMambaMIL | — |
MIL_2DMambaMIL.sh |
MambaMIL_2D | Mamba2DTrainer |
MIL_s4model.sh |
S4Model | — |
MIR_HE2RNA.sh |
HE2RNA | — |
MIR_tRNAformer.sh |
tRNAsformer | — |
MIR_SEQUOIA.sh |
SEQUOIA | — |
MIR_MOSBY.sh |
MOSBY (SumExpModel) | — |
MIR_abreg.sh |
AbRegMIL | — |
Example:
bash scripts/comparisons/MIR_HE2RNA.shpython inference.py --method ProtoSum --dataset BRCA --fold 0 \
--trainer DeconvExp --version "1reg_mse_reg_1e3" --data_type ts@inproceedings{nishimura2025protosum,
title={Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Whole Slide Image},
author={Nishimura, Kazuya and Bise, Ryoma and Matsuo, Shinnosuke and Hirose, Haruka and Kojima, Yasuhiro},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}This project is licensed under the MIT License.
Our code refer following works:
- TRIPLEX.
- Cell2location, licensed under the Apache License Version 2.0.
- SCVI, licensed under the BSD-3-Clause License