top of page

Data Scientist

Statistics Projects

Show, Don't Tell

Vaccinating

SIR Epidemiology Model in a Homogeneous Population by Adaptive Markov Chain Monte Carlo

Computational Statistics Course Project, 2021

Using R, created a deterministic compartmental model framework to explore the state-transitions of an epidemic, as susceptible individuals in the population become infected and eventually recover from a disease or virus. The SIR model has 3 compartments (susceptible, infectious, recovered) and two parameters (the basic reproductive number, infectious period). Parameters were estimated by sampling from a posterior distribution using the Markov Chain Monte Carlo (MCMC) algorithm, Metropolis-Hastings (M-H). Trace plots, density plots, and burn-in were used to evaluate mixing in order to ensure that the MCMC sampler efficiently explored the parameter space. Convergence was examined by exploring the autocorrelations between the returned samples and the MCMC. Finally, an adaptive MCMC was employed to improve the proposal distribution by estimating the true target covariance by the empirical estimate (i.e., calculating the empirical covariance matrix).

Lab Experiment

Stochastic SEIR Epidemic Model in a Homogeneous Community by Kalman Filter Parameter Estimation

Stochastic Differential Equations Course Project, 2020

Using R, created a stochastic compartmental model framework to explore the state-transitions of an epidemic. The SEIR model has 4 compartments (susceptible, exposed, infectious, recovered) and was formulated here as a system of stochastic nonlinear differential equations (SDEs). The numerical simulation of the resulting SDEs were solved by the Euler-Maruyama approach, and parameters were estimated by adaptive Markov Chain Monte Carlo and extended Kalman filter methods.

Hospital Bed

LowCountry Memorial Hospital Database

Principles of Database Systems Course Project, 2020

Using MySQL, created a database for a fictitious hospital, modeling it’s structure based on patient and staff needs at my current place of employment (Beaufort Memorial). The database focused on both inpatient and outpatient hospital records, including all patient services and actions performed by administrative, technician, or medical staff during a patient’s stay.

Screen Shot 2020-11-28 at 2.17.09 AM.png

CDC 500 Cities Data Exploration and Design

Data Visualization Course Project, 2020

Utilized the Center for Disease Control’s 500 Cities Project data set, which reports city and census tract-level data, obtained using small area estimation methods, for 27 chronic disease measures for the 500 largest American cities. To improve local population health, residents and healthcare provides will need to harness the power of current health status data and behavioral risk factors within their own communities. Here, stroke data was aggregated by State, City, and Census-tract and was analyzed across 28 chronic disease measures are related to unhealthy behaviors (5), health outcomes (13), and use of preventive services (10). These measures include major risk behaviors that lead to illness, suffering, and early death related to chronic diseases and conditions, as well as the conditions and diseases that are the most common, costly, and preventable of all health problems.

Child In Hospital Bed

Patient Satisfaction Regression Analysis

Statistical Models and Regression Course Project, 2019

As a hospital administrator, it is of particular interest to study the patient experience, as it is a crucial factor to achieving high-quality care and improved patient outcomes. Here, the dataset explored the relationship between patient satisfaction (Y, as a response variable) and patient’s age, severity of illness, and anxiety level. Data are reported for 25 randomly selected patients. Analysis was conducted with the intent to both understand patient satisfaction as a function of the attributes listed above, but also to predict patient satisfaction for newly admitted hospital patients.

Image by Samuel Girven

Regression Analysis on Beaufort Memorial’s LifeFit Wellness Services Member Spending

Applied Statistics Regression Analysis Course Project, 2018

Beaufort Memorial Hospital’s LifeFit Wellness Services is a medically-managed fitness and outpatient rehabilitation center. At LifeFit, medical professionals, rehabilitation specialists, and exercise physiologists work together to analyze member’s health status and design individualized wellness programs tailored to each person’s needs. As an outpatient facility for the hospital, LifeFit serves populations of individuals with injuries and other medically-related special needs, in addition to working toward improving the overall health and wellbeing of the general community within Beaufort County. This analysis examined total annual revenue as a function of membership population demographic variables. Specifically, this analysis sought to identify the subpopulations that make the largest contribution to LifeFit’s total annual revenue. Understanding the parameters with the strongest contributions to total revenue will help identify areas for departmental growth and allow future marketing strategies to be directly targeted in a manner that is more likely to maximize profits. The data was extracted and exported from LifeFit’s database (VirtualClubMate) within Fiscal Year 2017. Variables examined included: member spending (which included retail, massage, specialty programs, and membership), age, zip code, gender, check-in frequency, and membership category. Analysis was conducted in both SPSS and Python.

Data Science Projects

Show, Don't Tell

Screen Shot 2020-11-28 at 2.16.51 AM.png

CDC 500 Cities Interactive Visualization Using Tableau

Data Visualization Course Project, 2020

A Tableau dashboard was created using the Center for Disease Control’s (CDC) 500 Cities dataset. This dataset combines city and census tract-level data with 27 chronic disease measures for the 500 largest cities in the United States. The Tableau dashboard is intended to allow users the opportunity to compare national, state, and city measures for chronic disease outcomes and prevention methods.

Student Life

Student Performance Interactive Visualization Using R and Python

Data Visualization Course Project, 2020

Sought to explore the influence of multiple variables on student test performance. The data set includes scores from three standard student exams (math, reading, and writing), as well as a variety of student and parental socioeconomic factors that interact with student performance. By identifying key factors and interactions between these variables, we might better understand the effectiveness of student test preparation courses, the strongest factors which contribute to test outcomes, the impact of parental education levels, and how best to improve student scores on each test type.

Covid 19

COVID-19 in South Korea Using R and Python

Data Visualization Course Project, 2020

Sought to examine confirmed cases of Coronavirus-19 (COVID-19) in South Korea. To date, South Korea has 34,652 confirmed cases, of which 27885 have recovered and 526 have died. The data set, provided by the Korea Centers for Disease Control and Prevention, tracks the number of confirmed cases by day between January 20th, 2020 and March 9th, 2020, as well as a variety of other patient-specific factors. By identifying key factors and interactions between these variables, we might better understand the progression and spread of this novel disease.

Soccer Match

FIFA 2019 Sports Analysis Using R and Phthon

Data Visualization Course Project, 2020

Served as a sports analysis for the Federation Internationale de Football Association (FIFA)’s 2019 player roster. The data set includes 89 variables, including player names, age, nationality, overall, potential, value, wage, international reputation, and multiple skill-specific sport factors. By identifying key factors and interactions between these variables, we might better understand the international appeal, value of, and skill-set of players and teams within FIFA.

Science Class

Scientific Visualization Using Paraview

Data Visualization Course Project, 2020

Two 3D datasets are selected from web for different visualizations of the data using ParaView. Both of the selected datasets have specific X, Y and Z co-ordinates with no NAN values. Different visualizations techniques like points plot, surface plot, and volume plot with customer filtering are used to display three unique representation of each dataset.

Math Formulas

Winnow-2 and Naïve Bayes Algorithms for Linear Classification Problems

Machine Learning Project, 2020

Sought to implement two Supervised Machine Learning Algorithms, Winnow-2 and Naïve Bayes, in the context of classification. Both algorithms were applied to categorical (discrete) as well as continuous-valued input data using five data sets obtained from the UCI Machine Learning Repository (Breast Cancer, Glass, Iris, Soybean, and House Voting).

Math Exercises

K-Nearest Neighbors Algorithms for Nonparametric Classification and Regression

Machine Learning Course Project, 2020

Two k-nearest neighbor classifiers and two k-nearest neighbor regressors were applied to four input datasets obtained from the UCI Machine Learning Repository (Ecoli, Image Segmentation, Computer Hardware, Forest Fires). Additionally, two variants were introduced to improve cost and efficiency by reducing the set of points included in calculations.

Image by Markus Spiske

Classification and Regression Decision Trees

Machine Learning Course Project, 2020

The ID3 algorithm was applied to 3 classification problems, using gain-ratio as the splitting criterion. Additionally, ID3 utilized post-fit pruning in order to prevent overfitting. The CART algorithm was applied to 3 regression problems, using mean squared error as the splitting criterion. Additionally, CART utilized a validation set was used to tune an early stopping parameter in order to prevent overfitting. All Decision Trees were applied to six input datasets obtained from the UCI Machine Learning Repository (Abalone, Car Evaluation, Image Segmentation, Computer Hardware, Forest Fires, Wine Quality).

Digital Work Life

Adaline and Logistic Regression

Machine Learning Course Project, 2020

Implemented two different linear classifier algorithms: Adaline and Logistic Regression. Adaline predicts a class value (0 or 1) using a linear transformation, whereas Logistic Regression applies a log-odds transformation to the target class in order to predict the likelihood an instance belongs to a class. Here, both algorithms were trained using gradient descent and are applied to five datasets obtained from the UCI Machine Learning Repository (Breast Cancer, Glass, Iris, Soybean, House Votes).

Image by Solen Feyissa

Feed-Forward Neural Networks and Backpropagation

Machine Learning Course Project, 2020

The Backpropagation algorithm was used in conjunction with gradient descent. Neural networks of 0, 1, and 2 hidden layers were applied to five datasets obtained from the UCI Machine Learning Repository (Breast Cancer, Glass, Soybean, House Votes). The size of the hidden layer was tuned using a validation set. A Sigmoid Activation Function was used to introduce nonlinearities into the network. A softmax classifier was used to train multi-net classifiers.

Online Learning.

Reinforcement Learning with Value Iteration, Q-Learning, and SARSA

Machine Learning Course Project, 2020

This project utilized these three reinforcement learning techniques to solve the Racetrack problem, a standard control problem. The algorithms were compared based on the score they obtained in solving the problem on various racetracks by varying parameters such as convergence tolerance, number of iterations, etc. This project found that performance of all algorithms was comparable.

bottom of page