Statistics Projects
Show, Don't Tell
SIR Epidemiology Model in a Homogeneous Population by Adaptive Markov Chain Monte Carlo
Computational Statistics Course Project, 2021
Using R, created a deterministic compartmental model framework to explore the state-transitions of an epidemic, as susceptible individuals in the population become infected and eventually recover from a disease or virus. The SIR model has 3 compartments (susceptible, infectious, recovered) and two parameters (the basic reproductive number, infectious period). Parameters were estimated by sampling from a posterior distribution using the Markov Chain Monte Carlo (MCMC) algorithm, Metropolis-Hastings (M-H). Trace plots, density plots, and burn-in were used to evaluate mixing in order to ensure that the MCMC sampler efficiently explored the parameter space. Convergence was examined by exploring the autocorrelations between the returned samples and the MCMC. Finally, an adaptive MCMC was employed to improve the proposal distribution by estimating the true target covariance by the empirical estimate (i.e., calculating the empirical covariance matrix).
Stochastic SEIR Epidemic Model in a Homogeneous Community by Kalman Filter Parameter Estimation
Stochastic Differential Equations Course Project, 2020
Using R, created a stochastic compartmental model framework to explore the state-transitions of an epidemic. The SEIR model has 4 compartments (susceptible, exposed, infectious, recovered) and was formulated here as a system of stochastic nonlinear differential equations (SDEs). The numerical simulation of the resulting SDEs were solved by the Euler-Maruyama approach, and parameters were estimated by adaptive Markov Chain Monte Carlo and extended Kalman filter methods.
LowCountry Memorial Hospital Database
Principles of Database Systems Course Project, 2020
Using MySQL, created a database for a fictitious hospital, modeling it’s structure based on patient and staff needs at my current place of employment (Beaufort Memorial). The database focused on both inpatient and outpatient hospital records, including all patient services and actions performed by administrative, technician, or medical staff during a patient’s stay.
CDC 500 Cities Data Exploration and Design
Data Visualization Course Project, 2020
Utilized the Center for Disease Control’s 500 Cities Project data set, which reports city and census tract-level data, obtained using small area estimation methods, for 27 chronic disease measures for the 500 largest American cities. To improve local population health, residents and healthcare provides will need to harness the power of current health status data and behavioral risk factors within their own communities. Here, stroke data was aggregated by State, City, and Census-tract and was analyzed across 28 chronic disease measures are related to unhealthy behaviors (5), health outcomes (13), and use of preventive services (10). These measures include major risk behaviors that lead to illness, suffering, and early death related to chronic diseases and conditions, as well as the conditions and diseases that are the most common, costly, and preventable of all health problems.
Patient Satisfaction Regression Analysis
Statistical Models and Regression Course Project, 2019
As a hospital administrator, it is of particular interest to study the patient experience, as it is a crucial factor to achieving high-quality care and improved patient outcomes. Here, the dataset explored the relationship between patient satisfaction (Y, as a response variable) and patient’s age, severity of illness, and anxiety level. Data are reported for 25 randomly selected patients. Analysis was conducted with the intent to both understand patient satisfaction as a function of the attributes listed above, but also to predict patient satisfaction for newly admitted hospital patients.
Regression Analysis on Beaufort Memorial’s LifeFit Wellness Services Member Spending
Applied Statistics Regression Analysis Course Project, 2018
Beaufort Memorial Hospital’s LifeFit Wellness Services is a medically-managed fitness and outpatient rehabilitation center. At LifeFit, medical professionals, rehabilitation specialists, and exercise physiologists work together to analyze member’s health status and design individualized wellness programs tailored to each person’s needs. As an outpatient facility for the hospital, LifeFit serves populations of individuals with injuries and other medically-related special needs, in addition to working toward improving the overall health and wellbeing of the general community within Beaufort County. This analysis examined total annual revenue as a function of membership population demographic variables. Specifically, this analysis sought to identify the subpopulations that make the largest contribution to LifeFit’s total annual revenue. Understanding the parameters with the strongest contributions to total revenue will help identify areas for departmental growth and allow future marketing strategies to be directly targeted in a manner that is more likely to maximize profits. The data was extracted and exported from LifeFit’s database (VirtualClubMate) within Fiscal Year 2017. Variables examined included: member spending (which included retail, massage, specialty programs, and membership), age, zip code, gender, check-in frequency, and membership category. Analysis was conducted in both SPSS and Python.
Data Science Projects
Show, Don't Tell
CDC 500 Cities Interactive Visualization Using Tableau
Data Visualization Course Project, 2020
A Tableau dashboard was created using the Center for Disease Control’s (CDC) 500 Cities dataset. This dataset combines city and census tract-level data with 27 chronic disease measures for the 500 largest cities in the United States. The Tableau dashboard is intended to allow users the opportunity to compare national, state, and city measures for chronic disease outcomes and prevention methods.
Student Performance Interactive Visualization Using R and Python
Data Visualization Course Project, 2020
Sought to explore the influence of multiple variables on student test performance. The data set includes scores from three standard student exams (math, reading, and writing), as well as a variety of student and parental socioeconomic factors that interact with student performance. By identifying key factors and interactions between these variables, we might better understand the effectiveness of student test preparation courses, the strongest factors which contribute to test outcomes, the impact of parental education levels, and how best to improve student scores on each test type.
COVID-19 in South Korea Using R and Python
Data Visualization Course Project, 2020
Sought to examine confirmed cases of Coronavirus-19 (COVID-19) in South Korea. To date, South Korea has 34,652 confirmed cases, of which 27885 have recovered and 526 have died. The data set, provided by the Korea Centers for Disease Control and Prevention, tracks the number of confirmed cases by day between January 20th, 2020 and March 9th, 2020, as well as a variety of other patient-specific factors. By identifying key factors and interactions between these variables, we might better understand the progression and spread of this novel disease.
FIFA 2019 Sports Analysis Using R and Phthon
Data Visualization Course Project, 2020
Served as a sports analysis for the Federation Internationale de Football Association (FIFA)’s 2019 player roster. The data set includes 89 variables, including player names, age, nationality, overall, potential, value, wage, international reputation, and multiple skill-specific sport factors. By identifying key factors and interactions between these variables, we might better understand the international appeal, value of, and skill-set of players and teams within FIFA.
Scientific Visualization Using Paraview
Data Visualization Course Project, 2020
Two 3D datasets are selected from web for different visualizations of the data using ParaView. Both of the selected datasets have specific X, Y and Z co-ordinates with no NAN values. Different visualizations techniques like points plot, surface plot, and volume plot with customer filtering are used to display three unique representation of each dataset.
Winnow-2 and Naïve Bayes Algorithms for Linear Classification Problems
Machine Learning Project, 2020
Sought to implement two Supervised Machine Learning Algorithms, Winnow-2 and Naïve Bayes, in the context of classification. Both algorithms were applied to categorical (discrete) as well as continuous-valued input data using five data sets obtained from the UCI Machine Learning Repository (Breast Cancer, Glass, Iris, Soybean, and House Voting).
K-Nearest Neighbors Algorithms for Nonparametric Classification and Regression
Machine Learning Course Project, 2020
Two k-nearest neighbor classifiers and two k-nearest neighbor regressors were applied to four input datasets obtained from the UCI Machine Learning Repository (Ecoli, Image Segmentation, Computer Hardware, Forest Fires). Additionally, two variants were introduced to improve cost and efficiency by reducing the set of points included in calculations.
Classification and Regression Decision Trees
Machine Learning Course Project, 2020
The ID3 algorithm was applied to 3 classification problems, using gain-ratio as the splitting criterion. Additionally, ID3 utilized post-fit pruning in order to prevent overfitting. The CART algorithm was applied to 3 regression problems, using mean squared error as the splitting criterion. Additionally, CART utilized a validation set was used to tune an early stopping parameter in order to prevent overfitting. All Decision Trees were applied to six input datasets obtained from the UCI Machine Learning Repository (Abalone, Car Evaluation, Image Segmentation, Computer Hardware, Forest Fires, Wine Quality).
Adaline and Logistic Regression
Machine Learning Course Project, 2020
Implemented two different linear classifier algorithms: Adaline and Logistic Regression. Adaline predicts a class value (0 or 1) using a linear transformation, whereas Logistic Regression applies a log-odds transformation to the target class in order to predict the likelihood an instance belongs to a class. Here, both algorithms were trained using gradient descent and are applied to five datasets obtained from the UCI Machine Learning Repository (Breast Cancer, Glass, Iris, Soybean, House Votes).
Feed-Forward Neural Networks and Backpropagation
Machine Learning Course Project, 2020
The Backpropagation algorithm was used in conjunction with gradient descent. Neural networks of 0, 1, and 2 hidden layers were applied to five datasets obtained from the UCI Machine Learning Repository (Breast Cancer, Glass, Soybean, House Votes). The size of the hidden layer was tuned using a validation set. A Sigmoid Activation Function was used to introduce nonlinearities into the network. A softmax classifier was used to train multi-net classifiers.
Reinforcement Learning with Value Iteration, Q-Learning, and SARSA
Machine Learning Course Project, 2020
This project utilized these three reinforcement learning techniques to solve the Racetrack problem, a standard control problem. The algorithms were compared based on the score they obtained in solving the problem on various racetracks by varying parameters such as convergence tolerance, number of iterations, etc. This project found that performance of all algorithms was comparable.