Skip to content

anand-imcm/proteomics-ML-workflow

Repository files navigation

BiomarkerML

Open   publish   GitHub release (with filter)   DOI  

Citations

If you use BiomarkerML for your analysis, please cite the following Zenodo record:

Zhou, Y., Maurya, A.K., Deng, Y., Fletcher, M.P., Ren, C., & Taylor, A. (2025). A cloud-based proteomics ML workflow for biomarker discovery. Zenodo. https://doi.org/10.5281/zenodo.17367501

Warning

Do not use the outdated Dockstore DOI: https://doi.org/10.5281/zenodo.15529775

This DOI was automatically generated by Dockstore and points to an incorrect record. Please use the official citation above instead.

Tip

To import the workflow into your Terra workspace, click on the above Dockstore badge, and select 'Terra' from the 'Launch with' widget on the Dockstore workflow page.

Background: High-throughput affinity and mass-spectrometry-based proteomic studies of large clinical cohorts generate high-dimensional proteomic data useful for accelerated disease biomarker discovery. A powerful approach to realizing the potential of these big, complex, and non-linear data, whilst ensuring reproducible results, is to use automated machine learning (ML) and deep learning (DL) pipelines for their analysis. However, there remains a gap in comprehensive ML workflows tailored to proteomic biomarker discovery and designed for biomedical researchers who need pipelines to optimally self-configure and automatically avoid over-fitting.

Findings: We present BiomarkerML, a cloud-based workflow for automated, reproducible, and efficient ML/DL analysis of proteomic data for biomarker discovery, designed for novice-ML users and implemented in Python, R and Workflow Description Language (WDL). BiomarkerML: ingests proteomic and clinical data alongside sample labels; pre-processes data for model fitting and optionally performs dimensionality reduction and visualization; fits a catalogue of ML and DL classification and regression models; and calculates model performance metrics for model comparison. Next, the workflow applies mean SHapley Additive exPlanations (SHAP) to quantify the contribution of each protein to model predictions across all samples. Finally, proteins with high mean SHAP values, and their co-expressed protein network interactors, are identified as candidate biomarkers. Importantly, hyperparameters - configuration variables set prior to training models - are automatically fine-tuned via grid-search, and BiomarkerML employs weighted, nested cross-validation to avoid model over-fitting and data leakage.

Conclusions: BiomarkerML is scalable, provides a standardized, user-friendly interface, and streamlines analyses to ensure reproducibility of results. Overall, BiomarkerML is a significant advancement, enabling novice-ML researchers to use cutting-edge ML/DL tools to identify disease biomarkers in complex proteomic data.

Keywords: machine learning, cloud-based workflow, classification and regression, proteomic biomarker identification


Workflow Steps

  • Preprocessing : By default, Z-score standardisation is applied to the input data. Optionally, users can choose to apply dimensionality reduction to the dataset. Display scatter plots for every two dimensions based on the selected number of output dimensions. The available methods include:

    • PCA (Principal Component Analysis for linear data)
    • ELASTICNET (ElasticNet Regularization)
    • UMAP (Uniform Manifold Approximation and Projection)
    • TSNE (t-Distributed Stochastic Neighbor Embedding)
    • KPCA (Kernel Principal Component Analysis for non-linear data)
    • PLS (Partial Least Squares Regression)
  • Classification : This step applies the machine learning models to the standardized data and generates a confusion matrix, ROC plots for all classes and averages, and other relevant evaluation metrics (Accuracy, F1, sensitivity, specificity) for all the models. The available algorithms are as follows:

    • RF (Random Forest)
    • KNN (K-Nearest Neighbors)
    • NN (Neural Network)
    • SVM (Support Vector Machine)
    • XGB (XGBoost)
    • PLSDA (Partial Least Squares Discriminant Analysis)
    • VAE (Variational Autoencoder with Multilayer Perceptron)
    • LR (Logistic Regression)
    • GNB (Gaussian Naive Bayes)
    • LGBM (LightGBM)
    • MLPVAE (Multilayer Perceptron inside Variational Autoencoder)
  • Regression : This step applies the machine learning models to the standardized data and generates regression-specific evaluation metrics for all the models. The available algorithms are as follows:

    • RF_REG (Random Forest Regression)
    • NN_REG (Neural Network Regression)
    • SVM_REG (Support Vector Regression)
    • XGB_REG (XGBoost Regression)
    • PLS_REG (Partial Least Squares Regression)
    • KNN_REG (K-Nearest Neighbors Regression)
    • LGBM_REG (LightGBM Regression)
    • VAE_REG (Variational Autoencoder with Multilayer Perceptron)
    • MLPVAE_reg (Multilayer Perceptron inside Variational Autoencoder)
  • SHAP analysis : (Optional) This step calculates SHapley Additive exPlanations (SHAP) values for variable importance (CSV file and radar plot for top features) and plots ROC curves for all the models specified by the user.

  • Protein–Protein Interaction analysis : (Optional) Biological functional analyses through protein–protein interaction network diagrams for top-ranked biomarkers and first-degree network expansions combining protein coexpression patterns to highlight functional connectivity.

  • Report generation : This step aggregates all output plots from the previous steps and compiles them into a .pdf report.

Installation (local)

Important

This workflow is primarily designed for cloud-based platforms (e.g., Terra.bio, DNANexus, Verily) that support WDL workflows.

However, you can also run it locally using the Cromwell workflow management system.

This workflow has also been tested locally on Ubuntu 22.04 with Docker v28.3.1 and Cromwell v40, running on a 12th Gen Intel Core i7-1270P with 32 GB RAM.

ARM64 (linux/arm64) architecture is currently not supported.

Requirements

  • Docker

  • Mamba package manager

    • Please refer to the mamba or micromamba official installation guide.
    • We prefer mamba over conda since it is faster and uses libsolv to effectively resolve the dependencies.

Quick Start

1. Clone the repository

git clone https://github.com/anand-imcm/proteomics-ML-workflow.git && cd proteomics-ML-workflow

2. Create and activate a new environment

Ensure that either mamba or conda is installed before proceeding.

Using mamba (recommended):

mamba create --name biomarkerml bioconda::cromwell
mamba activate biomarkerml

Or, using conda:

conda create --name biomarkerml -c bioconda cromwell
conda activate biomarkerml

3. Configure the inputs (Binary classification dataset)

Open Case_Dataset/input_Binary_Classification_Dataset.json and update the main.input_csv field to the absolute path of the CSV file on your machine:

"main.input_csv": "/absolute/path/to/proteomics-ML-workflow/Case_Dataset/Binary_Classification_Dataset.csv"

4. Run the workflow locally (Binary classification dataset)

cromwell run workflows/main.wdl -i Case_Dataset/input_Binary_Classification_Dataset.json

5. Use your own data

To run the workflow on your own dataset, use the example/inputs.json file as a starting template. See the Workflow Inputs section for a full description of all the available parameters.

Make a copy of example/inputs.json and edit it to specify your own input data file and adjust any other parameters as needed:

cp example/inputs.json user_data_inputs.json

At minimum, update these two fields in user_data_inputs.json:

"main.input_csv": "/absolute/path/to/your/data.csv",
"main.output_prefix": "your_analysis_name"

Run the workflow on your own data:

cromwell run workflows/main.wdl -i user_data_inputs.json

Workflow Inputs

  • main.input_csv : [File] Input file in .csv format, includes a Label column, with each row representing a sample and each column representing a feature. An example of the .csv is shown below:

    SampleID Label Protein1 Protein2 ... ProteinN
    ID1 Label1 0.1 0.4 ... 0.01
    ID2 Label2 0.2 0.1 ... 0.3
  • main.output_prefix : [String] Analysis ID. This will be used as prefix for all the output files.

  • main.mode : [String] Specify the mode of the analysis. Options include Classification, Regression, and Summary. Default value: Summary.

  • main.dimensionality_reduction_choices : [String] Specify the dimensionality method name(s) to use. Options include PCA, UMAP, TSNE, KPCA and PLS. Multiple methods can be entered together, separated by a space. Default value: PCA

Warning

It is recommended to select only one dimensionality reduction method when using it alongside classification or regression models.

If multiple dimensionality reduction methods are specified, the workflow will only perform the dimensionality reduction and generate a report.

  • main.num_of_dimensions: [Int] Total number of expected dimensions after applying dimensionality reduction for the visualization. This option only works when multiple dimensionality_reduction_choices are selected. Default value: 3.

  • main.classification_model_choices : [String] Specify the classification model name(s) to use. Options include RF, KNN, NN, SVM, XGB, PLSDA, VAE, LR, GNB, LGBM and MLPVAE. Multiple model names can be entered together, separated by a space. Default value: RF

  • main.regression_model_choices : [String] Specify the regression model name(s) to use. Options include RF_reg, NN_reg, SVM_reg, XGB_reg, PLS_reg, KNN_reg, LGBM_reg, VAE_reg and MLPVAE_reg. Multiple model names can be entered together, separated by a space. Default value: RF_reg

  • main.calculate_shap: [Boolean] Enable SHAP analysis for feature importance. Default value: false

  • main.shap_features: [Int] Number of features to display on the radar/bar chart. Default value: 10

  • main.run_ppi: [Boolean] Execute Protein-Protein interaction (ppi) analysis. Default value: false

Warning

The Protein-Protein interaction analysis can be performed only when the dimensionality_reduction_choices option is set to either ELASTICNET or NONE, and calculate_shap option is set to true.

  • main.ppi_analysis.score_threshold : [Int] Confidence score threshold for loading STRING database. Default value: 400

  • main.ppi_analysis.combined_score_threshold : [Int] Confidence score threshold for selecting nodes to plot in the network. Default value: 800

  • main.ppi_analysis.SHAP_threshold : [Int] The number of top important proteins selected for network analysis based on SHAP values. Default value: 100

  • main.ppi_analysis.protein_name_mapping : [Boolean] Whether to perform protein name mapping from UniProt IDs to Entrez Symbols. Default value: TRUE

  • main.ppi_analysis.correlation_method : [String] Correlation method used to define strongly co-expressed proteins. Options include spearman, pearson and kendall. Default value: spearman

  • main.ppi_analysis.correlation_threshold : [Float] Threshold value of the correlation coefficient used to identify strongly co-expressed proteins. Default value: 0.8

  • main.*.memory_gb : [Int] Amount of memory in GB needed to execute the specific task. Default value: 24

  • main.*.cpu : [Int] Number of CPUs needed to execute the specific task. Default value: 16

Note

We recommend that users adopt unique Entrez Symbols as the protein naming convention for our network analysis, although we provide an approach using the R/Bioconductor annotation package org.Hs.eg.db to map UniProt IDs to Entrez Symbols.

The protein name mapping process handles edge cases as follows:

  • UniProt IDs mapped to multiple Entrez symbols: All matched Entrez symbols corresponding to the same UniProt ID are concatenated using a semicolon (;) and later deconcatenated during network plot mapping to STRINGdb. This may occur in cases where protein complexes are composed of subunits encoded by different genes and etc.

  • Multiple UniProt IDs mapping to the same Entrez symbol: Only the first occurrence — corresponding to the protein with the highest SHAP value for that symbol — is retained in the final dataset. This may happen in cases involving protein isoforms, fusion proteins and etc.

  • UniProt IDs with no associated Entrez symbol: These entries are removed from the dataset.

Workflow Outputs

  • report : [File] A .pdf file containing the final reports, including the plots generated through the analyses.
  • results : [File] A .gz file containing the results and plots from all steps in the workflow.

Components

Package License
micromamba==1.5.5 BSD-3-Clause
python PSF/GPL-compat
joblib BSD-3-Clause
matplotlib PSF/BSD-compat
numpy BSD
pandas BSD 3-Clause
scikit-learn BSD-3-Clause
xgboost Apache-2.0
shap MIT
pillow Open Source HPND
PyTorch BSD
Optuna MIT
fpdf LGPL-3.0
seaborn BSD-3-Clause
umap-learn BSD-3-Clause
AnnotationDbi Artistic-2.0
BiocManager Artistic-2.0
fields GPL (>= 2)
ggplot2 MIT
igraph GPL (>= 2)
magrittr MIT
optparse GPL (>= 2)
STRINGdb GPL (>= 2)
tidyverse GPL-3
writexl BSD-2-clause
org.Hs.eg.db Artistic-2.0

About

A cloud-based proteomics ML workflow for biomarker discovery

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors