A novel data analysis pipeline based on multi-view learning for the identification of disease biomarkers
Abstract:
The rising prevalence of chronic diseases poses significant economic and social challenges to national healthcare systems. Machine Learning (ML) and advanced analytical tools have the potential to identify early disease biomarkers from small quantities of biological samples more quickly and cheaply than traditional methods.
Our approach integrates heterogeneous biological data to construct a predictive model for diagnosis using ML techniques. The model is then exploited to suggest microRNAs (miRNAs) as candidate biomarkers of disease onset and progression. Methodologically, we adopt a multi-view learning approach, which leverages various patient-related aspects (views), including miRNA expression values, sequences, and metadata. The novelty lies in the simultaneous use of these three views, complemented by a fourth that represents the interactions among miRNAs.
As a case study, we present the results from the analysis of Mild Cognitive Impairment (MCI) and Alzheimer’s (AD) patients using publicly available data. We pre-processed and integrated four datasets from the Gene Expression Omnibus (GEO) repository, which were then fed to a multi-view Random Forest classifier. Explainability techniques based on the importance of feature and permutation were adopted to identify candidate biomarkers. Finally, we conducted miRNA-target interaction (multiMIR) and functional pathway (DAVID, KEGG) analyses to interpret the results biologically.
High-scored predicted disease biomarkers will be validated by nanofluidic qPCR analysis of miRNAs extracted from blood samples of AD and MCI patients under clinical observation. Preliminary results show significant involvement of some potential miRNA biomarkers in neurological development and neurodegenerative processes. The final goal is to differentiate between MCI and AD biomarkers.