Identification of Exhaled VOC Biomarkers for Lung Cancer under Data-Limited Conditions using Data Augmentation and Multi-view Feature Selection
Authors: Guancheng Ren#, Bohao Liu#, Zhiyuan Qu#, Xue Jiang, Heng Zhao, Danyao Qu, Ruizhi Ning, Chen Su, Taoping Liu*, Guangjian Zhang*, Weiwei Wu*
Publication date: 2025/7 (online)
Published in: Advanced Intelligent Discovery
DOI: 10.1002/aidi.202500086
Abstract: Accurate identification of biomarkers reflecting lung cancer-induced metabolic alterations proves essential for establishing reliable volatilomics-based diagnosis through exhaled breath analysis. This study analyzes 47 clinical breath samples collected using custom-designed apparatuses and characterized by gas chromatography-mass spectrometry (GC–MS). To address the high-dimensional, low-sample-size challenge inherent in GC–MS data and the complex interdependencies among volatile organic compounds (VOCs), we develop a novel biomarker identification framework integrating conditional data augmentation with multi-view feature selection. The framework first employs a conditional variational autoencoder to generate lung cancer-specific synthetic training samples, effectively alleviating overfitting caused by limited clinical samples. Subsequently, a k-means-based feature clustering algorithm decomposes the high-dimensional GC–MS data into multiple low-dimensional feature subspaces. These subspaces undergo parallel processing through a multi-view neural network incorporating feature gating mechanisms to isolate discriminative VOCs. Our approach identifies nine clinically relevant VOCs from 392 candidate features showing a strong lung cancer association. Comprehensive comparative evaluations highlight the framework's superiority over conventional feature selection methods, including T-test, partial least squares discriminant analysis, and support vector machine-recursive feature elimination. It outperforms these methods in both diagnostic performance and feature selection stability, underscoring its robust potential for clinical biomarker discovery applications.
Description: The codes for the multi-view neural network are provided.