Skip to content

bowang-lab/DELBERT

Repository files navigation

DELBERT

DELBERTL: Fingerprint Language Modeling for Generalizable Hit Discovery in DNA-Encoded Libraries

Paper | HF Collection | Data | Models

Published at the ICLR 2026 Workshop on Machine Learning for Genomics Explorations

DELBERT Architecture

Overview

DELBERT is a self-supervised transformer encoder that treats molecular fingerprints (FPs) as a discrete token language, enabling masked language modeling (MLM) pretraining without requiring access to underlying chemical structures. This makes it uniquely suited for privacy-preserving settings such as the AIRCHECK initiative, where DEL screening data is released only as precomputed fingerprints. Under comprehensive library-based out-of-distribution (OOD) evaluation across four protein targets (WDR91, LRRK2, SETDB1, DCAF7), DELBERT significantly outperforms XGBoost and LightGBM ensemble baselines on three of four targets, with 1.6-2.7x improvements in early-enrichment metrics.

Installation

git clone https://github.com/bowang-lab/DELBERT.git
cd DELBERT

With conda:

conda create -n delbert python=3.12 -y
conda activate delbert
pip install -e .

With venv:

python -m venv .venv
source .venv/bin/activate
pip install -e .

Optional dependencies:

pip install peft                # LoRA finetuning
pip install -e ".[baselines]"   # XGBoost, LightGBM baselines
pip install -e ".[all]"         # everything

Inference

The fastest way to get predictions from a trained DELBERT model. See inference/inference_example.ipynb for a complete walkthrough.

Python API

from inference.predict import predict

# Each molecule is a dict with 4 dense FP arrays (length 2048)
molecule = {
    "ECFP4": [...],    # int32 array, length 2048
    "FCFP6": [...],    # int32 array, length 2048
    "ATOMPAIR": [...], # int32 array, length 2048
    "TOPTOR": [...],   # int32 array, length 2048
}

probs = predict(molecule, model_path="wanglab/delbert-wdr91")
print(f"P(active): {probs[0]:.4f}")

Batch inference from parquet

from inference.predict import predict_from_parquet

result = predict_from_parquet(
    "path/to/molecules.parquet",
    model_path="wanglab/delbert-wdr91",
)
# Returns DataFrame with compound_id and probability columns

CLI

python inference/predict.py \
    --model wanglab/delbert-wdr91 \
    --parquet path/to/molecules.parquet \
    --output predictions.csv

The input parquet must contain dense FP array columns: ECFP4, FCFP6, ATOMPAIR, TOPTOR. These are the same format as AIRCHECK parquet files. A small example file is included at data/WDR91_10-examples.parquet.

Data

Tokenized datasets (ready for model consumption) are hosted on HuggingFace: wanglab/delbert_data

Each dataset contains molecules represented as sparse molecular fingerprints (ECFP4, FCFP6, ATOMPAIR, TOPTOR) with binary enrichment labels from DEL screens against four protein targets: WDR91, LRRK2, SETDB1, and DCAF7.

Raw parquet files (dense FP arrays) can be downloaded from AIRCHECK.

Pretrained Models

Finetuned classification checkpoints for all four AIRCHECK targets are available on HuggingFace (full collection):

Model Target HuggingFace
DELBERT-WDR91 WDR91 wanglab/delbert-wdr91
DELBERT-LRRK2 LRRK2 wanglab/delbert-lrrk2
DELBERT-SETDB1 SETDB1 wanglab/delbert-setdb1
DELBERT-DCAF7 DCAF7 wanglab/delbert-dcaf7

Training

1. Prepare pretraining data

Tokenize molecular fingerprints into token sequences and build the vocabulary:

python scripts/prepare_pretrain_data.py --config-name pretrain/example

This creates processed_data/pretrain/{experiment_id}/ with tokenized data and vocabulary.

2. Self-supervised pretraining (MLM)

Pretrain DELBERT via masked language modeling on all molecules (labels discarded):

python scripts/pretrain.py experiment=pretrain_example

3. Prepare supervised data

Prepare labeled data with library-based OOD splits:

python scripts/prepare_supervised_data.py --config-name supervised/example

4. Supervised finetuning

Finetune the pretrained model for binding prediction with LoRA:

python scripts/train_classifier.py experiment=classify_example \
    pretrained_checkpoint=outputs/pretrain_wdr91/.../checkpoints/best/epoch=XX-val_loss=X.XXX.ckpt

Cross-Validation Evaluation

For rigorous OOD evaluation using library-based K-fold cross-validation:

DELBERT transformer:

python evals/library_cv/scripts/run_transformer_cv_orchestrator.py \
    --config evals/library_cv/configs/transformer_cv.yaml

Baseline models (XGBoost, LightGBM):

python evals/library_cv/scripts/run_baseline_cv.py \
    --config evals/library_cv/configs/baseline_cv.yaml

Both scripts use the same fold assignments for fair comparison. Results are saved to evals/library_cv/results/.

Project Structure

DELBERT/
├── delbert/                  # Core package
│   ├── data/                 # Data loading, tokenization, splits
│   └── models/               # Model architecture, training modules
├── inference/                # Inference scripts and examples
│   ├── predict.py            # Prediction API (dict, parquet, CLI)
│   └── inference_example.ipynb
├── scripts/                  # Data prep and training entry points
├── evals/                    # Evaluation pipelines
│   └── library_cv/           # Library-based K-fold cross-validation
├── configs/                  # Hydra configuration files
├── data/                     # Example data files
├── assets/                   # Figures
├── pyproject.toml
└── requirements.txt

Citation

@inproceedings{seyedahmadi2026delbert,
    title={{DELBERT}: Fingerprint Language Modeling for Generalizable Hit Discovery in {DNA}-Encoded Libraries},
    author={Arman Seyed-Ahmadi and Bing Hu and Armin Geraili and Anita Layton and Helen Chen and Shana O. Kelley and Bo Wang},
    booktitle={ICLR 2026 Workshop on Machine Learning for Genomics Explorations},
    year={2026},
    url={https://openreview.net/forum?id=wOOD4W3wJk}
}

License

This project is licensed under the Apache License 2.0. See LICENSE for details.

About

DELBERT: Fingerprint Language Modeling for Generalizable Hit Discovery in DNA-Encoded Libraries

Resources

License

Stars

Watchers

Forks