DELBERTL: Fingerprint Language Modeling for Generalizable Hit Discovery in DNA-Encoded Libraries
Paper | HF Collection | Data | Models
Published at the ICLR 2026 Workshop on Machine Learning for Genomics Explorations
DELBERT is a self-supervised transformer encoder that treats molecular fingerprints (FPs) as a discrete token language, enabling masked language modeling (MLM) pretraining without requiring access to underlying chemical structures. This makes it uniquely suited for privacy-preserving settings such as the AIRCHECK initiative, where DEL screening data is released only as precomputed fingerprints. Under comprehensive library-based out-of-distribution (OOD) evaluation across four protein targets (WDR91, LRRK2, SETDB1, DCAF7), DELBERT significantly outperforms XGBoost and LightGBM ensemble baselines on three of four targets, with 1.6-2.7x improvements in early-enrichment metrics.
git clone https://github.com/bowang-lab/DELBERT.git
cd DELBERTWith conda:
conda create -n delbert python=3.12 -y
conda activate delbert
pip install -e .With venv:
python -m venv .venv
source .venv/bin/activate
pip install -e .Optional dependencies:
pip install peft # LoRA finetuning
pip install -e ".[baselines]" # XGBoost, LightGBM baselines
pip install -e ".[all]" # everythingThe fastest way to get predictions from a trained DELBERT model. See inference/inference_example.ipynb for a complete walkthrough.
from inference.predict import predict
# Each molecule is a dict with 4 dense FP arrays (length 2048)
molecule = {
"ECFP4": [...], # int32 array, length 2048
"FCFP6": [...], # int32 array, length 2048
"ATOMPAIR": [...], # int32 array, length 2048
"TOPTOR": [...], # int32 array, length 2048
}
probs = predict(molecule, model_path="wanglab/delbert-wdr91")
print(f"P(active): {probs[0]:.4f}")from inference.predict import predict_from_parquet
result = predict_from_parquet(
"path/to/molecules.parquet",
model_path="wanglab/delbert-wdr91",
)
# Returns DataFrame with compound_id and probability columnspython inference/predict.py \
--model wanglab/delbert-wdr91 \
--parquet path/to/molecules.parquet \
--output predictions.csvThe input parquet must contain dense FP array columns: ECFP4, FCFP6, ATOMPAIR, TOPTOR. These are the same format as AIRCHECK parquet files. A small example file is included at data/WDR91_10-examples.parquet.
Tokenized datasets (ready for model consumption) are hosted on HuggingFace: wanglab/delbert_data
Each dataset contains molecules represented as sparse molecular fingerprints (ECFP4, FCFP6, ATOMPAIR, TOPTOR) with binary enrichment labels from DEL screens against four protein targets: WDR91, LRRK2, SETDB1, and DCAF7.
Raw parquet files (dense FP arrays) can be downloaded from AIRCHECK.
Finetuned classification checkpoints for all four AIRCHECK targets are available on HuggingFace (full collection):
| Model | Target | HuggingFace |
|---|---|---|
| DELBERT-WDR91 | WDR91 | wanglab/delbert-wdr91 |
| DELBERT-LRRK2 | LRRK2 | wanglab/delbert-lrrk2 |
| DELBERT-SETDB1 | SETDB1 | wanglab/delbert-setdb1 |
| DELBERT-DCAF7 | DCAF7 | wanglab/delbert-dcaf7 |
Tokenize molecular fingerprints into token sequences and build the vocabulary:
python scripts/prepare_pretrain_data.py --config-name pretrain/exampleThis creates processed_data/pretrain/{experiment_id}/ with tokenized data and vocabulary.
Pretrain DELBERT via masked language modeling on all molecules (labels discarded):
python scripts/pretrain.py experiment=pretrain_examplePrepare labeled data with library-based OOD splits:
python scripts/prepare_supervised_data.py --config-name supervised/exampleFinetune the pretrained model for binding prediction with LoRA:
python scripts/train_classifier.py experiment=classify_example \
pretrained_checkpoint=outputs/pretrain_wdr91/.../checkpoints/best/epoch=XX-val_loss=X.XXX.ckptFor rigorous OOD evaluation using library-based K-fold cross-validation:
DELBERT transformer:
python evals/library_cv/scripts/run_transformer_cv_orchestrator.py \
--config evals/library_cv/configs/transformer_cv.yamlBaseline models (XGBoost, LightGBM):
python evals/library_cv/scripts/run_baseline_cv.py \
--config evals/library_cv/configs/baseline_cv.yamlBoth scripts use the same fold assignments for fair comparison. Results are saved to evals/library_cv/results/.
DELBERT/
├── delbert/ # Core package
│ ├── data/ # Data loading, tokenization, splits
│ └── models/ # Model architecture, training modules
├── inference/ # Inference scripts and examples
│ ├── predict.py # Prediction API (dict, parquet, CLI)
│ └── inference_example.ipynb
├── scripts/ # Data prep and training entry points
├── evals/ # Evaluation pipelines
│ └── library_cv/ # Library-based K-fold cross-validation
├── configs/ # Hydra configuration files
├── data/ # Example data files
├── assets/ # Figures
├── pyproject.toml
└── requirements.txt
@inproceedings{seyedahmadi2026delbert,
title={{DELBERT}: Fingerprint Language Modeling for Generalizable Hit Discovery in {DNA}-Encoded Libraries},
author={Arman Seyed-Ahmadi and Bing Hu and Armin Geraili and Anita Layton and Helen Chen and Shana O. Kelley and Bo Wang},
booktitle={ICLR 2026 Workshop on Machine Learning for Genomics Explorations},
year={2026},
url={https://openreview.net/forum?id=wOOD4W3wJk}
}This project is licensed under the Apache License 2.0. See LICENSE for details.
