Spring 2020 Projects – Data Science Capstone Course @ University of Rochester

KOAA

Kids Out and About.com (KOAA) is an online guide that provides parents with information on fun, educational, and cultural enrichment activities for children, happening in their local area. Using their data, KOAA wanted to answer the following questions: Is our social media advertising campaign affecting website growth? When is the most effective time to advertise on social media? When will our website reach 10 million visits in one year? Which social media app is preferred by parents? Which activities are preferred by parents? How do different regions of the country differ in preferred activities? The following four capstone projects sought to answer these questions.

KOAA – Social Media: Analysis of Social Media Usage Patterns for an Online Platform

Team Members: Haozhang Deng, Xin Li, Zechen Wang, Xupin Zhang

The goals of this project were to examine the relationship between social media usage and parents’ behavior, and to understand social media usage across all 36 KOAA sites, over time. After reviewing survey data, team members discovered that the data was biased and was not random. The group then realized that KOAA’s open survey was being misused by vendors to gain higher rankings. The team employed keen data science techniques to identify “cheaters” – vendors submitting multiple surveys. Students found the submission behavior for highly recommended vendors to be suspect. First, “cheating” vendors would repeatedly re-submit the open survey from the same computer. Then, the vendors would vote for the same organization every time. To combat this behavior, students identified and removed all surveys that clustered around those signals. They then utilized a popularity algorithm to identify the social media platforms used, and visualized the data based on demographic groups.

KOAA – Forecasting: Forecasting Website Traffic Growth for an Online Portal

Team Members: Zack Azadian, Haukur Björnsson, Nick Borcyk, Tristan De Alwis

The goal of this project was to forecast when the KOAA website would reach 10 million unique visits in a 12-month period. First, team members determined that KOAA’s existing sites followed a generic logistic growth pattern. However, the company planned to open nine new sites in the future, which had to be included in the model. So students added census data from the existing sites to aid in forecasting visits for the planned sites, expecting a logistic growth curve. The group established three levels of optimism in their estimation and used a grid search algorithm that iteratively tuned to parameters based on the number of users. Once the new data was created, the team utilized the random forest technique to predict future website usage.

KOAA – Facebook: Effect of Facebook Advertising on Website Visitor Traffic

Team Members: Yuhan Gao, Zhifan Nan, Peijun Xu, Yahan Yang

For this project, KOAA’s main goals were understanding: (1) the influence of Facebook advertising on website visitor growth, (2) if and how Facebook advertising accelerates the company’s revenue curve, and (3) optimal investment strategies for Facebook advertising during different months and in different regions. During the data gathering and cleaning process, team members utilized data normalization to reshape the data into the same scale of 0 to 1. Then, the group used correlation matrices to observe the effects of Facebook advertising, and performed statistical testing on their results, utilizing one-way ANOVA and Tukey’s test. Finally, students employed Box plots to visualize the results, determining cost efficiencies per advertisement season.

KOAA – Facebook: Evaluating Google Analytics Data for an Online Portal

Team Members: Spencer Leonardi, Matthew Koo, Sahaj Somani, Tanmay Thakkar

KOAA currently utilizes Google Analytics product on all their sites. So for this project, the goal was to learn how to download and convert data collected by Google into usable analytics. First, team members mastered Google Analytics API and performed downloads for all pertinent data. After organizing the downloaded data, they created a category model to analyze it. Finally, students utilized clustering to reduce the data’s features, and employed an additive seasonality model to visualize it.

Hajim School of Engineering, University of Rochester

Team 1: Identifying Academic Factors that Influence Student Graduation Rates

Members: Jaeheuk Jung, Zi Qing Liang, Nishali Parikh, Jungmin Park

The aim of this project was to examine the correlation between GPA and graduation rate. The Hajim School wanted to learn if and how students can recover from a disastrous semester (GPA <1.0), how double majors affect graduation rates, and how semester grade performance correlates with graduation rates. Team members used a classification model to group data, utilized numerous techniques to visualize data, and represented missing data, i.e., students who did not graduate.

Team 2: An Analysis of Graduation Patterns of Transfer Students at the University of Rochester

Members: Mawada Elmahgoub, Maya Haigis, John Polimeni, Anna Shors

The goal of this project was to study graduation rates across Hajim School majors, and graduation rates for transfer students. Team members employed a classification model and developed a predictive model to flag underperforming students. They also interpreted data into new variables to determine success factors. Many demographics emerged, which helped with data visualization. Finally, students utilized a random forest model to analyze the resulting data.

Hydrology Modeling: Validating Hydrology Models using Historical Data and Regression Analysis

Team Members: Justin Clifford, Owen Isley, Zhaoxiong Ding, Gabriel Ramirez

The Caldwell-Fay equation has historically been used to assess natural water levels in Lake Ontario. Such measurements are essential for determining the impact of Lake projects, such as the Moses-Saunders Hydropower Dam and St. Lawrence Seaway, to ensure that they do not violate the 1909 Boundary Waters Treaty. In this project, team members extracted historical data from physical records, and investigated four regression-based models as alternatives to the Caldwell-Fay Equation. The team ultimately discovered that the Caldwell-Fay Equation is still accurate and aligns with calculated models. They also ascertained that the 1909 Boundary Waters Treaty was not violated by the Moses-Saunders or St. Lawrence Seaway projects.

City of Rochester: Relationship between Vacant Structures and Crime in the City of Rochester

Team Members: Yaxin Liu, Peiran Chen, Xiangyi Liu, Maralmaa Erdenebat

In this project, team members studied the correlation between contributing factors to crime, such as vacant structures, and crime rates in the city of Rochester. The goal was to predict the reduction in the crime rate brought about by demolishing an identified vacant property. The City of Rochester planned to use the results inform future decisions. Students combined demographic data with city vacant building and crime records to create predictive models. They then utilized the Random Forest model to generate a system that recommends structures for demolition by scoring them on a predicted-reduction-in-crime rate.

URMC

Team 1: Clustering Methods for Finding Insights in PRO-CTCAE Data

Members: Anya Greenberg, Thomas Hanson, Ethan Otto, Austin Varela

The Patient Reported Outcomes version of the Common Terminology Criteria for Adverse Events (PRO-CTCAE) is a measure of a patient’s experience with symptoms. In geriatric patients, quality of life and tolerability of treatment are often regarded as higher priorities than survival rate. In this project, team members performed exploratory analysis to understand trends in PRO-CTCAE measures and hospitalizations. Students explored clustering as a technique for further analysis, and ultimately utilized subspace clustering to analyze data.

Team 2: Analysis and Identification of Caregiver Deterioration

Members: Chenwei Wu, Haosong Rao, Xuening Zhang, Meizhu Wang

This team examined two separate issues. First, they utilized fuzzy matching and association rules to clean a dataset of drug names, pairing them with generics based on a reference dataset. Then, team members investigated elderly cancer patient caregiver burnout by analyzing six calculated metrics for caregiver quality of life. The team applied Support Vector Machine, Gradient Boost, and other machine learning techniques to metrics to aid in the prediction and analysis of contributing factors to caregiver deterioration. They also used LIME to further explore the resulting models and to better understand the impact of component features.

Vnomics

Team 1: Diesel Particulate Filter (DPF) Failure Prediction

Members: Jiayin Lin, Wenzhou Ma, Jian Mao, Yujie Wang

In order to reduce the negative impact of exhaust gas, most new tractor trailers are equipped with Diesel Particulate Filters (DPF). However, DPFs can clog with soot after extended use. For this project, team members applied neural networks and regressions to company-collected time series data on vehicle performance. Students were able to predict DPF failure at least two weeks in advance, and the final neural network achieved excellent (~90%) precision and recall in predicting breakdowns early.

Team 2: Engine RPM Profile Detection

Members: Chunlei Zhou, Zichu Li, Haoyuan Dong, Yuan Lu

Vnomics helps its drivers become more fuel efficient by coaching them to improve their behaviors (e.g. speeding, idling, and engine control). Although the RPM Profile of a vehicle’s engine can help determine poor engine control behavior, it is not always readily available. For this project, team members worked to detect vehicle RPM profiles with high accuracy and confidence, and worked uncover the relationship between detection accuracy and drivers’ engine speed control scores. Students applied regression approaches to data collected from Vnomics products, identifying the RPM profiles of vehicles with high accuracy.