Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images

Kazuya Nishimura, Ryoma Bise, Shinnosuke Matsuo, Haruka Hirose, Yasuhiro Kojima

Overview

This repository implements a Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images. The method leverages single-cell RNA sequencing (scRNA-seq) data to construct cell-type-specific prototypes, which are then used to guide gene expression prediction from histopathological image patches. By incorporating biologically meaningful priors, the model improves interpretability and prediction accuracy compared to purely image-based approaches.

For spatial transcriptomics prediction, please refer to our related implementation: https://github.com/naivete5656/TRIPLEX_x_CPNN

Environment

bash ./docker_mamba build

Dataset generation

Processing of TCGA

Download TCGA dataset and scdataset

Download instructions

Download TCGA datasets

Used GDC data transfer tool

Download manifest file and sample sheets from following projects. [BRCA] [KIRC] [LUAD]

upload sample_sheet, manifestfile to github

cd ./raw_dataset/TCGA_raw
mkdir ./TCGA-${DATASET}/WSI
mkdir ./TCGA-${DATASET}/RNA
./gdc-client download -m ./manifests/gdc_${DATASET}_manifest_wsi.txt -d ./TCGA-${DATASET}/WSI
./gdc-client download -m ./manifests/gdc_${DATASET}_manifest_rna.txt -d ./TCGA-${DATASET}/RNA

Download single-cell datasets

Breast cencer dataset (BRCA) [URL]
RCC dataset (KIRC) [URL]
Lung Cancer (LUAD) [URL]

Folder structure

raw_dataset
├── TCGA_raw
|   ├── manifests
|   ├── samplesheets
|   ...
|   L TCGA-LUAD
└── sc_raw
    ├── BRCA
    ...
    L LUAD

Step 1: Processing whole slide image

Processing instructions

Find matched whole slide image and gene expression

python preprocessing/1-processing_TCGA.py --dataset ${DATASET} --source_dir ./raw_dataset/TCGA_raw --save_dir ./raw_dataset/processing --id_dict_path ./raw_dataset/gene_id_conv_df.csv

Feature extraction by CLAM from whole slide image

Please refer CLAM repository to use following command

git clone https://github.com/mahmoodlab/CLAM.git

For the feature extraction by DINO v2 Please add following script in the line 78 of CLAM/models/builder.py

elif model_name == "dinov2":
    model = timm.create_model(
        "vit_base_patch14_dinov2.lvd142m",
        pretrained=True,
        init_values=1e-5,
        dynamic_img_size=False,
    )
    data_config = timm.data.resolve_model_data_config(model)
    img_transforms = timm.data.create_transform(**data_config, is_training=False)

cd CLAM

DATASET=LUAD
python create_patches_fp.py --source ../raw_dataset/processing/${DATASET}/TCGA-digital_slide/WSI --save_dir ../raw_dataset/processing/${DATASET}/TCGA-digital_slide/proc_wsi4mil --patch_size 256 --preset tcga.csv --seg --patch --stitch

python extract_features_fp.py --data_h5_dir ../raw_dataset/processing/${DATASET}/TCGA-digital_slide/proc_wsi4mil --data_slide_dir ../raw_dataset/processing/${DATASET}/TCGA-digital_slide/WSI --csv_path ../raw_dataset/processing/${DATASET}/TCGA-digital_slide/proc_wsi4mil/process_list_autogen.csv --feat_dir ../raw_dataset/processing/${DATASET}/TCGA-digital_slide/feature --batch_size 512 --slide_ext .svs --model_name dinov2

Step 2: Processing single-cell dataset

python preprocessing/2-sc_processing.py --dataset ${DATASET} --save_dir ./raw_dataset/processing --sc_raw_data_dir ./raw_dataset/sc_raw

Step 3: Save feature and gene expression with h5 format

python preprocessing/3-generate_pair.py --dataset ${DATASET} --base_dir ./raw_dataset/processing --save_dir ./dataset --feat_name feature

Step 4: Run single-cell deconvolution method

for fold in {1..5}
do 
    python preprocessing/4-cell2location_proc.py --dataset ${DATASET}  --save_dir dataset \
    --base_dir ./raw_dataset/processing --resolution fine --fold ${fold}
done

Training

Experiment scripts are provided in the scripts/ directory.

Training code

# Single fold
python main.py --method ProtoSum --dataset BRCA --fold 0 \
    --trainer DeconvExp --version "1reg_mse_reg_1e3" --data_type ts

# All folds across all datasets
bash scripts/ours.sh

Ablation study

bash scripts/ours_granularity.sh
bash scripts/ours_ablation.sh

Comparison methods

Scripts for all comparison methods are in scripts/comparisons/.

Script	Method	Trainer
`MIL_abmil.sh`	AbMIL	—
`MIL_max.sh`	AbMIL (max pooling)	—
`MIL_mean.sh`	AbMIL (mean pooling)	—
`MIL_ILRA.sh`	ILRA	—
`MIL_mambamil.sh`	MambaMIL	—
`MIL_srmamba.sh`	SRMambaMIL	—
`MIL_2DMambaMIL.sh`	MambaMIL_2D	Mamba2DTrainer
`MIL_s4model.sh`	S4Model	—
`MIR_HE2RNA.sh`	HE2RNA	—
`MIR_tRNAformer.sh`	tRNAsformer	—
`MIR_SEQUOIA.sh`	SEQUOIA	—
`MIR_MOSBY.sh`	MOSBY (SumExpModel)	—
`MIR_abreg.sh`	AbRegMIL	—

Example:

bash scripts/comparisons/MIR_HE2RNA.sh

Inference

python inference.py --method ProtoSum --dataset BRCA --fold 0 \
    --trainer DeconvExp --version "1reg_mse_reg_1e3" --data_type ts

Citation

@inproceedings{nishimura2025protosum,
  title={Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Whole Slide Image},
  author={Nishimura, Kazuya and Bise, Ryoma and Matsuo, Shinnosuke and Hirose, Haruka and Kojima, Yasuhiro},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

License

This project is licensed under the MIT License.

Acknowledge

Our code refer following works:

TRIPLEX.
Cell2location, licensed under the Apache License Version 2.0.
SCVI, licensed under the BSD-3-Clause License

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
dataloader		dataloader
docker		docker
docs		docs
model		model
preprocessing		preprocessing
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
main.py		main.py
raw_dataset		raw_dataset
summalize_res.py		summalize_res.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images

Overview

Environment

Dataset generation

Download TCGA dataset and scdataset

Download TCGA datasets

Download single-cell datasets

Step 1: Processing whole slide image

Find matched whole slide image and gene expression

Feature extraction by CLAM from whole slide image

Step 2: Processing single-cell dataset

Step 3: Save feature and gene expression with h5 format

Step 4: Run single-cell deconvolution method

Training

Training code

Ablation study

Comparison methods

Inference

Citation

License

Acknowledge

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images

Overview

Environment

Dataset generation

Download TCGA dataset and scdataset

Download TCGA datasets

Download single-cell datasets

Step 1: Processing whole slide image

Find matched whole slide image and gene expression

Feature extraction by CLAM from whole slide image

Step 2: Processing single-cell dataset

Step 3: Save feature and gene expression with h5 format

Step 4: Run single-cell deconvolution method

Training

Training code

Ablation study

Comparison methods

Inference

Citation

License

Acknowledge

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages