May 26, 2020
Big Data Analytics and Machine Learning
PI: Sunita Chandrasekaran
Students: Mauricio Ferrato
Collaborators: Nemours/Alfred I. duPont Hospital for Children: Erin Crowgey, Karl Franke
Funding Agency: Nemours/Alfred I. duPont Hospital for Children
Duration: 09/01/2018 – 04/30/2023
Translating raw NGS data into actionable biological knowledge requires many interconnected complex algorithms and computational pipelines. The goal of this project is to build predict models for disease outcomes by using EHR data (de-identified). The initial disease focuses will be sickle cell and oncology. The goal of the sickle cell project is to predict adverse events, such as acute chest and vaso-occlusive crisis, at the individual level by utilizing population trends. The goal of the oncology project is to predict relapse. Machine learning has proven to be a potential approach for early detection and prevention of human diseases. In this project, we develop a machine learning framework that uses the best practices for data cleaning, classifier training, hyperparameter tuning, and classifier testing and validation. The genomic data is simulated with various different disease patterns, mirroring real-life examples. Our framework uses multiple different feature selection techniques were used, including Principle Component Analysis (PCA) and Shapley Additive exPlanations (SHAP), in combination with various classifiers, such as Support Vector Machine (SVM) and random forest. We then analyze and determine the best performing pipeline using specificity, accuracy and sensitivity metrics (ROC curves). Collectively, this work provides a foundation for benchmarking machine learning techniques as a tool for early classification of high risk subjects.
Posters and Talks:
- Mauricio Ferrato, Erin Crowgey, Sunita Chandrasekaran. Proposing a Machine Learning Framework for Classification of Patient Cohorts Using Genomics Data.
AMIA. November 2020 (Submitted, Pending Approval)