Spring 2021 Projects – Data Science Capstone Course @ University of Rochester

Eastman School of Music: An Analysis of Web-Based Measurement of Individual Students’ Musical Performance

Team Members: Xu Jin, Bowei Wang, Yongjie Yin, Yu Gu, Jiarui Hu

The goal of this project was to test correlations between different aspects of musical teaching, laying a foundation for further research. The team employed a variety of statistical methods, including linear regression, the Shapiro normality test, and two forms of correlation testing (Pearson and Spearman), through Python and R. Team members also experimented with data preprocessing to determine the best version of the dataset. They also utilized visualization techniques, such as heatmap and regression graphs. Ultimately, the team was able to disprove some well-established musical theories, and provide a framework for future research.

Lake Ontario Hydrology: Predicting Water Flow Rates and Levels around Lake Ontario

Team Members: Ling Liu, Xiaobo Luo, Guanjie Linghu, Sung Beom Park, Francis Ambrosini

The water system around Lake Ontario is very complicated, and accompanied by occasional overflow, which can result in losses to local residents. However, locals have limited access to information regarding the water system. The team’s objective was to build a model that will predict St. Louis Lake water levels and identify the maximum water flow tolerance of the Moses-Saunders dam. Team members employed machine learning techniques, such as function fitting and Elastic Net to predict the St. Louis Lake water level. The team also gradually decreased the number of features used, to improve the model’s practicability, and altered the functional form of the model to increase its physical interpretability, capturing the complex dynamic around Lake St. Louis.

Rochester Crime and Convenience: Relationship between Crimes and Convenience Stores in the City of Rochester

Members: Ryan Algier, Lydia Barnard, Nick Cimaszewski, Sean Flynn, Webster Kehoe

The City of Rochester Police Department responds to thousands of crimes annually and is seeking strategies to optimize their responses. Optimization requires exploration of crime-influencing factors, such as the proximity of crimes to convenience stores, liquor stores, and vacant structures. For this project, team members utilized ArcGIS software to develop advanced visualizations of the areas around convenience stores and their distances from reported crimes, highlighting crime density in Rochester. They also implemented regression models with census tract data to determine the leading factors contributing to crime, and found the correlation between convenience stores, liquor stores, and vacant structures and local crime levels.

Rochester Crime and COVID-19: Analysis and Visualization of Crime throughout the Covid-19 Pandemic

Team Members: Changhyun Lee, Arun Ramesh, Dylan Dibello, Sampson Abiola, Claudio Figueroa

On March 22nd, 2020, the state of New York implemented a statewide stay-at-home order to mitigate the spread of COVID-19. The lockdown worked to slow the spread of the virus, and as the number of COVID-19 cases dropped, New York state began a four-phase reopening plan. For this project, City of Rochester wanted to understand how the lockdown and subsequent phased reopening affected crime rates in the area. Through a series of statistical tests, visualizations, and time-series forecasting models, the team determined how the lockdown affected city crime rates for different types of crime. Team members performed analysis at varying levels of granularity, ranging from the city as a whole to the census tract-level, identifying how socioeconomic differences in various regions correlate with changes in crime. The final results showed an immediate decrease in crime at the start of New York State lockdown – however the effects of the lockdown on crime did not last.

Rochester Museum and Science Center (RMSC)

Team 1: Analysis and Prediction of Museum Membership Churn

Members: Ruqin Chang, Ruiyu Zhou, Ziyi You, Shengyang Wu, Valerie Tam

The goal of this project was to use data analysis and predictive modelling to predict which RMSC members are most likely to renew their museum memberships and/or make donations. The team conducted extensive visualization analysis on customer-earned revenue transactions, membership transactions, donation transactions, and customer journeys. During the analysis process, team members collected interim conclusions and recommendations, and provided them to the museum for feedback. They also utilized predictive modelling, such as Times Series Analysis, Logistic Regression, and the K-Nearest Neighbors Classifier, to predict the seasonality of membership renewals. The team’s predictive models achieved 90% accuracy on average, and mean squared errors improved significantly after appropriate model adjustments. Students wrapped up the project by providing RMSC’s administration with recommendations to help increase membership and encourage donations.

Team 2: Examining Member Behavior for Rochester Museum and Science Center

Members: Mercy Salome Jemutai, Harshitha Gangannagari, Mikhail Arinarkin, Rebecca Sarin, Anindini Singh

In an effort to increase their revenue, RMSC asked this team to examine how museum members behave. The goal was to gain insight into membership renewal and the donations members make to the museum, for marketing purposes. The team preprocessed and transformed the data, then performed explanatory analysis, getting a sense of the data and examining renewals and donations. Team members also performed statistical analysis using correlation. They then built three predictive models to analyze two binary labels (renewal vs. non-renewal and donating vs. not donating), using different subsets of features as input.

URMC – COVID-19 Pandemic Twitter Sentiment Analysis: Analyzing Mental Health Condition using COVID-19 Twitter Data

Team Members: Li Sun, Senqi Zhang, Daiwei Zhang, Yue Liu, Pin Li

This project focused on analyzing data collected from Twitter to determine how the COVID-19 pandemic has impacted users’ mental health. The project tracked how Twitter user sentiment evolved over time and space, which topics users were most concerned with, and which groups of people were most likely to be afflicted by pandemic-related by mental health issues. The team utilized sentiment analysis (VADER) to quantify users’ feelings toward mental health issues, topic modeling (LDA) to extract the most frequently mentioned topics, and face++ to infer users’ demographic groups. Team members discovered that negative feelings were the prevailing sentiment among Twitter users. They also identified users’ most common mental health symptoms and their causes. Finally, students pinpointed user groups with higher likelihoods of mental health issues, utilizing age and gender demographics.

URMC – COVID-19 Vaccine Twitter Sentiment Analysis: COVID-19 Vaccine and Public Perception using Social Media Data

Team Members: Xueting Wang, Yan Jiang, Yuhan Chen, Shengyuan Huang, Haoxuan Ma

This project explored public perception toward the COVID-19 vaccine through analysis of Twitter data. The first topic the team focused on was public perception toward the vaccine in the United States. Team members found that vaccine discussions in the US were affected by pandemic-related policies and vaccine-related news. The second topic the team explored was public perception toward the Pfizer and Moderna vaccines. Overall, Pfizer received much more attention than Moderna. For both vaccines, Twitter users’ main focus was clinical trials and phases. However, with Pfizer, users mainly discussed symptoms and side effects, whereas with Moderna, users mainly discussed alleged pharmacist sabotage of vaccine doses. Team members employed longitudinal analysis, sentiment analysis (VaderSentiment), topic modeling analysis (LDA), and demographic analysis (DeepFace) to analyze Twitter data.

URMC – Geriatric Oncology: Understanding Factors That Affect Chemotherapy Dosage In Vulnerable Older Adults With Advanced Cancer

Team Members: Khoa Hoang, Christopher Pak, Frank Gonzalez, Luke Nau, and Erika Ramsdale

Vulnerable older adults with cancer are an understudied population, and predicting which patients will tolerate chemotherapy is challenging. For this project, the team used baseline clinical data for 718 older adults with advanced cancer, developing and comparing predictive models (including logistic regression, decision tree, Random Forest, XGBoost, k-nearest neighbor, and naïve Bayes) to classify which patents receive sufficient versus insufficient chemotherapy over the subsequent 3 months. Students developed a general pre-processing pipeline to prepare data for analysis, employed a novel feature selection method supplementing clinical experience, and provided feedback
on important features driving model predictions. Logistic regression was identified as both the best-performing and most interpretable model, and several important features were identified, distinguishing new foci for future
investigation by the sponsor.

UR Utilities: Forecasting Chilled Water Demand for the University of Rochester Campus

Team Members: Stephen Schneider, Jacob Vigran, Justin Cavanaugh, Thayalan Kundralan, Wade Bennett, Jen Rowe

For this project, team members forecasted chilled water demand to to optimize chiller usage and energy efficiency. The team used Gradient Boosting and Extra Trees Regression to predict the maximum chilled water output at hourly and daily levels. Team members also utilized time series models, such as Prophet, to provide insight into daily, weekly, and yearly consumption patterns. The final models were highly accurate, with the hourly forecast predicting approximately 800 tons of chilled water output with 95% confidence.

Vnomics

Team 1: Diesel Particulate Filter Failure Prediction for Vnomics

Members: Matthew Evan Taruno, Daniel He, James Sastrawan, Vatsal Agarwal, Linzan Ye

Diesel particulate filters (DPF) are essential particulate emission control devices for large vehicles; DPF failure can be very costly for truck companies. For this project, team members treated the DPF failure problem as an anomaly detection problem, employing autoencoders to predict DPF failure two weeks in advance. Students leveraged principal component analysis, random forest, and TSFresh to obtain an efficient dataset. They also utilized MLflow for hyperparameter tuning to boost the recall score of the large vehicle models, obtaining a final recall score of 0.58.

Team 2: Diesel Particulate Filter Failure Prediction for Vnomics

Members: Xinyu Guo, Yu Hao, Pinyi Wu, Yuman Xie, Yuqi Zeng

Diesel engine-powered tractor trailers are outfitted with Diesel Particulate Filters (DPF’s) to control emissions. However, DPFs can become clogged overtime, resulting in costly roadside breakdowns. For this project, the team explored raw time-series data in detail, employing the TSFresh method to predict potential DPF failure. Members analyzed multiple classification models (such as SVM, Logistic Regression, k-NN, Random Forest, etc.), and tuned four sets of relative parameters to enhance model performance. The best models were classifiers SVM, k-NN, Logistic Regression, and Random Forest, with y=1 moved one day before failure date, a 60 day length of data, and a 10 day size of window.