Spring 2022 Projects – Data Science Capstone Course @ University of Rochester

Benchmark Labs

Team 1: Powdery Mildew and Plant Disease Forecasting

Members: Yihan Shao, Chuqin Wu, Siyu Xue, Zihe Zheng

For this project, Benchmark Labs sought to calculate and forecast the effects of Powdery Mildew, allowing farmers to more effectively mitigate the impacts of the disease on their crops. Students were tasked with predicting the Powdery Mildew index and level based on time series and climate data. To achieve this goal, the team employed various time series models, such as ARIMA (Autoregressive (AR) Integrated Moving Average (MA) Model) and LSTM (Long Short-Term Memory), along with classification models like Random Forest, Gradient Boosting, Neural Network, and KNN. Their LSTM model achieved a mean squared error (MSE) of 4.633 for predicting the next hour (based on the past seven days) and had the potential to predict the next seven days (based on the past 30 days). In addition, the team’s best classification model achieved 75 percent accuracy of when detecting Powdery Mildew levels.

Team 2: Chilling Hour Forecasting Using Environmental Variables

Members: Selene Bao, Harshita Jaiswal, Yiwei Li, Yuqing Xue

The goal of this project was to obtain historical and climatology records for select environmental variables for a location of interest, given a set of coordinates (latitude and longitude) with different lead times. The team was tasked with using these records to predict/estimate cumulative chilling hours (the number of hours in which temperatures are between 35 and 45 degrees Fahrenheit) over a specific window of time. Students first utilized API to access environmental data from publicly available sources such as World Weather Online and NOAA (National Oceanic and Atmospheric Administration). They then performed data analysis and applied statistical models to predict the cumulative chilling hours for different lead times (short-term: 7-14 days; long-term: 3-6 months). Time-series forecasting techniques such as SARIMAX, VAR, and Prophet were used to estimate the short-term hourly temperature. Current weather patterns are strongly connected to the previous day’s weather, so the team developed a Markov Chain model to predict weekly weather conditions. For long-term forecasting, students utilized LSTM models with differencing and tree-based hill-climbing models, such as random forest and gradient boosting, to estimate cumulative chilling hours.

City of Rochester

Team 1: Efficiently Growing Rochester Bike Infrastructure Based on Network Theory

Members: Margaret Brennan, Quang Hoàng, Blaise Derenze, Nate August

For this project, the team provided recommendations to the City of Rochester for growing bicycle infrastructure from an existing network. Students suggested improvements on previous work, including establishing a network that fully connects common bike origins and destinations (points of interest). Students proposed updating existing bike infrastructure via prioritized routing and iterative pruning, affirmed priority areas previously identified by experts at the City of Rochester, and suggested new priority roadways to expand the city’s bike infrastructure.

Team 2: Object Identification from Road Structures Aerial Snapshots

Members: Madhav Reddy Kunduru, Namitha Vanama, Nicholas Ndahiro

For this project, the City of Rochester sought to develop a complete inventory of pedestrian walkways and intersections, classifying the inventory in accordance with the Pedestrian Environmental Quality Index (PEQI). Students were tasked with analyzing images of road intersections to detect specific features assessed by the Pedestrian Environmental Quality Index (PEQI) such as crosswalks, curb ramps, and low-quality crosswalks. The team employed yolov5 as an object detection model and successfully identified PEQI features at local walkways and intersections; they also highlighted missing features.

Corning

Team 1: Customer Churn Prediction

Members: Theo Chapman, Juney Lee, Emma Schechter, Kenzie Potter

The goal of this project was to identify Corning customer purchase patterns and to determine when customers churn. Customer “churning” is the rate at which patrons stop using a business’ products or services. By identifying when customers churn, Corning can target sales efforts toward at-risk clients with the goal of retaining them. First the team took an anomaly detection approach, utilizing autoencoders with a sliding reference window to determine when churning occurs. Students then delivered the autoencoder method code, qualitative analysis code, and labeled customer data (indicating whether customers churned) to Corning.

Team 2: Segmentation on Products, Distributors, and Customers

Members: Trevor Van Allen, Matt Boundy, Ryan Stefkovich, Michael Crowe

The goal of this project was to help Corning Glass find ways to segment their customers, distributors, and/or products to improve sales, allow the leveraging of targeted products, and improve demand forecasting. To achieve this goal, the team utilized exploratory analysis, data aggregation, outlier detection, PCA analysis, and K-Means clustering. They also clustered results from customer and product segmentation to enhance the outcome of distributor aggregated dataset clustering. Finally, students created three interactive Tableau dashboards and successfully identified distributor clusters. The Tableau dashboards allowed Corning to visualize the final clustering results.

Goergen Institute for Data Science (GIDS)

Team 1: GIDS Marketing/Recruitment Improvement & Prediction Engine

Members: Sung Beom Park, Joseph Smith, Jiecheng Gu, Xiaoen Ding

For this project, GIDS sought to understand the types of institutions and programs that applicants to the Master of Science (MS) in data science program ultimately attend. Students were tasked with providing profiles of the MS applicant pool, aggregating information on competitor programs, and predicting which admitted applicants would accept GIDS offers. To achieve this goal, the team employed various exploratory data analysis techniques and visualizations, making information understandable in layman’s terms. Students also utilized classification models, such as Decision Tree, Logistic, Naïve Bayes, K-Nearest Neighbors, and Random Forest, to predict whether admitted applicants would ultimately join the program. Finally the team provided insights and recommendations to GIDS on how to increase their yield of admitted MS applicants.

Team 2: GIDS Masters Admissions Data Analysis

Members: Chen Yao, Hanyang Zhang, Qianqian Gu, Wei Wu

For this project, the team sought to generate meaningful insights and helpful suggestions for the marketing and recruitment efforts of the GIDS Master of Science (MS) in data science program. First, the team performed exploratory data analysis on past applicant and National Student Clearinghouse data to identify applicant behavior patterns. They also simultaneously trained supervised classification models (i.e. logistics model, MLP, and random forest) to uncover the factors that affect admitted applicants’ final decisions. The team ultimately find that students who takemore time to complete their applications and havehigher GRE scores are more likely to decline the offer. Additionally, the admission office should expect around 20% of created applications to be incomplete and be prepared for an influx of applications between the middle of December and early January.

Rel8ted Analytics: Building Detection for Top View Images

Team Members: Shikhar Bajpai, Minu Sarraf, Joshua Coppola, Lloyd Page

The goal of this project was to create a model that can identify the presence of buildings in top-down photographs taken from sources like Google Maps. To achieve this goal the team created a labeled dataset which consisted of 1200 images with bounding boxes surrounding the buildings in the photos. They then trained the YOLO_v5 object recognition model using the labeled data. To facilitate an accessible user interface, students built a web application using Streamlit; this application includes a web page where users can upload images, geocodes, addresses, or CSVs with geocodes and addresses to receive back annotated images with the buildings labeled. The model ultimately identified the presence of at least one building in an image with a 97 percent success rate.

SabanciDx: Improving Efficiency in Yarn Production using Machine Learning

Team Members: Prem Kumar Thulasi Kumar, Sai Jawahar Reddy Meka, Zane Otter, Shahid Shakhil

For this project, SabanciDx sought to improve the efficiency of their yarn production plants to generate increased revenue. Students were tasked with forecasting machine breakdowns caused by yarn fiber breakages to help reduce breakages and improve overall plant efficiency. The team utilized various machine learning algorithms, including linear regression, random forest, XGBoost, and neural networks. They also experimented with forecasting yarn breakages at different levels of granularity (per hour, per shift) by training segmented models (segmented by machines and yarn density). In addition, students performed extensive hyper parameter tuning and engineered window variables to capture recent changes (increases or decreases with magnitude of change) in input attributes. Ultimately, the team determined and shared the results of the best-performing approaches/models.

Vnomics: Predicting DPF Failure in Tractor Trailers

Team Members: Steven Dai, Zachary Mustin, Uzoma Ohajekwe, Duy Pham

For this project, the team was tasked with predicting imminent failures in truck and trailer Diesel Particulate Filters (DPFs) up to fourteen days before breakdowns occur. They also worked to identify critical indicators of DPF failures. First students extracted daily trip records from fourteen days before service dates and scaled trip features. Then they performed windowing, extracting useful features from each window using tsfresh, a Python Time Series library. After completing intensive model tuning with Optuna, the team achieved a highest accuracy and recall of 79.59% and 100.00%, respectively, with Random Forest. Students ultimately discovered that the strongest indicator of DPF failure is how long the Diagnostic Trouble Code (DTC) SPN 3720 FMI 15 remains active within each window. The DTC indicates the amount of ash accumulated in the filter. As ash accumulates, the DPF becomes more likely to clog and fail.

URMC

Team 1: Investigation on Geriatric Assessment Based Features and Prediction on Relative Dose Intensity in Chemotherapy

Members: Xinyu Cai, Minghui Cen, Yaxin Yang, Yilin Zhou

For this project, the Wilmot Cancer Institute’s Geriatric Oncology Research team sought to improve the effectiveness of their chemotherapy treatments and provide more precise methods to clinicians. Students were tasked with predicting the effectiveness of treatments by utilizing geriatric assessment-based features that emphasize quality of life and functional capacity. To achieve this goal, the team refined the data preprocessing pipeline and built predictive models, such as Random Forest and Logistic Regression. Although this process produced acceptable test results, the results were not constructive enough to aid clinicians. Consequently the team changed their focus, applying different feature selection approaches to the dataset (e.g. Elastic Net) and providing suggestions for future cancer studies, such as collecting more features on older patients’ physical state during geriatric assessments.

Team 2: Predicting COVID Survival as a Function of Demographics and Ventilator Priority

Members: Walter Burnett, Derek Caramella, Ezgi Siir Kibris, Nefle Nesli

The goal of this project was to uncover whether demographic disparities exist in URMC’s COVID-19 survival outcomes and initial triage priorities, and to assess whether URMC’s current ventilator priority classification accurately predicts COVID survival. To achieve this goal, the team utilized four hyper-parameter tuned machine learning models: (1) decision tree, (2) logistic regression, (3) random forest, and (4) Support Vector Machine (SVM). After comparing the accuracy, precision, recall, and AUC score metrics for each model, students determined that the decision tree model (entropy, three levels) maximized their prediction power. The decision tree ultimately predicted that patients with blue assessment scores who are above 74 years old have low survival outcomes. Students also confirmed that URMC’s current ventilator priority system is an accurate predictor of survival, and that there are no significant demographic disparities in the survival outcomes or predictive values of initial triage priorities.