Big Data Analytics and Machine Learning

PI: Sunita Chandrasekaran
Students: Mauricio H. Ferrato, Mathias Heider
Collaborators: Nemours Children’s Hospital: Erin L. Crowgey, Karl Franke Collaborators: University of Delaware: Adam Marsh

Funding Agency: Nemours Children’s Hospital

Duration: 09/01/2018 – 06/30/2023

Project Summary:
The goal of this project is to build predictive models for rare disease outcomes by using machine learning and deep learning approaches. During the long process of the project we have focused on a few diseases: Sickle Cell Disease (SCD), Parkinson’s Disease (PD), and Acute Myeloid Leukemia (AML). The goal of the SCD and PD projects was to predict adverse events, such as acute chest and vaso-occlusive crisis for SCD, at the individual level by utilizing genomics (variant data). The goal of the oncology project is to predict drug response as well as vital status using transcriptomics and ex vivo data. Machine learning has proven to be a potential approach for early detection and prevention of human diseases. For this project we have developed the The Rna-seq Count Drug response Machine Learning (RCDML) workflow that uses best practices for data preprocessing, feature selection, classifier training, hyperparameter optimization, and inference and validation. The initial ML pipeline was created using simulated genomic data for prototyping and the SCD and. This framework was adapted and modified to read RNA-seq count data and learn from binary classification tasks to predict individuals outcomes. The framework uses multiple feature selection techniques, including principal component analysis (PCA), gene expression analysis, and Shapley Additive exPlanations (SHAP), in combination with various classification tree-ensemble based algorithms, such as gradient boosting (xGBoost) and random forests. We analyze and determine the best performing pipeline using specificity and sensitivity metrics, as well as comparing AUC-ROC curves. We also analyze the features selected using an Explanaible AI approach to better understand how these features match already known information about the human genome and its interaction with oncology and drug resistance. We have applied this framework to a series of AML projects (TARGET and BeatAML) and achieved high prediction accuracy in determining high/low responders to different drug therapies. Collectively, this work provides a foundation for benchmarking and applying machine learning techniques as an informative tool for early classification of high risk subjects.

Publications:

Mauricio H. Ferrato, Adam G. Marsh, Karl Franke, Benjamin J. Huang, E. Anders Kolb, Deborah DeRyckere, Douglas K. Graham, Sunita Chandrasekaran, and Erin L. Crowgey. Machine learning classifier approaches for predicting response to RTK-Type-III Inhibitors demonstrates high accuracy using transcriptomic signatures and ex vivo data. (Pending Review)

GitHub: The Rna-seq Count Drug response Machine Learning (RCDML) framework code is available here: https://github.com/UD-CRPL/RCDML

Posters and Talks:

  • Mauricio H. Ferrato, Erin L. Crowgey, Sunita Chandrasekaran. Proposing a Machine Learning Framework for Classification of Patient Cohorts Using Genomics Data. AMIA. November 2020