Fall 2021 Projects – Data Science Capstone Course @ University of Rochester

Agrograph: Annual Crop Yield Prediction

Team Members: Leo Cordaro, Stephen Gass, Ravi Dugh

For agricultural data solutions company Agrograph, the team worked to build the best model for predicting annual crop yields. Support Vector Regression (SVR) proved to be the best model for these predictions. Team members also modestly improved their model’s baseline performance using SVR and extensive feature engineering; these improvements resulted in a 15.6 percent average MAPE for the top five crops. The project required extensive domain knowledge to engineer optimal features and understand overall performance limitations. Students utilized feature importance analysis to confirm that vegetative indices such as NDVI, GCI, and GPP, generate the most predictive features for annual crop yield when combined with geospatial features such as Land Surface Temperature (LST) and soil features like soilWater, and orgCarbon. They then improved their model by reframing the problem for classification rather than regression; the model ultimately achieved an average XGBoost F1 score of 0.80 across the top five crops, predicting in the 50th percentile of annual crop yields.

Databricks: A Study of Homelessness in the United States

Team Members: Tanqiu Jiang, Yichi Qian, Ziyu Xiong, Tianyu Wang

Homelessness has long been understudied due to a lack of qualitative data. The goal of this project was to study the impact of a variety of factors on homelessness from a qualitative standpoint. First the team collected data from various sources before cleaning and aligning it. They then applied analytical methods, such as sharp-null, RDD, linear regressions, and temporal analysis, to explore the relationship between homelessness and factors like policy, public transportation, economy, urbanization, temperature, and funding. They also utilized data visualizations to illustrate their discoveries. Students’ preliminary findings showed that progressive policies benefit homeless communities. After evaluating their preliminary findings, the team applied various machine learning models to their data to predict homelessness ratios. Using the state-of-the-art model interpretation package “SHAP,” the students uncovered how much of an impact each factor has on homelessness.

Goergen Institute for Data Science (GIDS)

Team 1: Understanding the Applicant Pool for University of Rochester’s MS Program in Data Science

Members: Viktoriia Shevchenko, Kira Hellard, Kyungtaek Lee

The goal of this project was to understand which factors heavily influence offer acceptance for applicants to GIDS’ Master of Science in Data Science program. To achieve this goal the team focused on data visualization and random forest classifications, catering to a non-technical audience working in admissions. Data visualizations helped students to understand how to better recruit quality applicants, and random forest models provided insights into the significance various application details.

Team 2: Graduate Program Analysis for GIDS

Members: Haoming Ma, Qiongdan Zhang, Yiheng Hua, Taichen Zhou

For this project students worked with GIDS to analyze their graduate admissions. The goal was to aid the admissions team in better understanding their applicant pool. The team ultimately delivered an effective model that predicts which applicants will accept admissions offers. First they implemented KNN, Random Forest, and decision trees to build a model that predicts student choices. Then with the help of Multiple Imputation by Chained Equations (MICE), they filled in missing values for GRE scores. Students achieved 0.78 accuracy with the decision tree model, and plotted out the decision tree to better understand how various factors influence student decisions. Finally, the team used Tableau to generate data visualizations and draw insights from them.

MacroXStudio – Nightlights: Nightlights Exploration for Early Business Insights

Team Members: Alex Crystal, Zhiyu Lei, Yiheng Mao, Letian Shi

In the world of investing, time series forecasting can help to generate excess returns above the benchmark. For this project, students applied time series forecasting to city nightlights data to predict investment activity. The team used images from NASA’s Earth nightlights, vectorized the nightlights images using a dense NN auto-encoder, and then forecasted the vectors using an auto-regression model. Students then decoded the outputs, put them back into the nightlights images, and combined them with economic indicators to predict market activity.

MacroXStudio – Twitter: Sentiment Analysis of Twitter Data to Extract Investment Insights

Team Members: Jingwen Zhong, Chenglu Xia, Hao Ma, Qilun Liu

The goal of this project was to analyze MacroXStudio’s Twitter data, extracting information for investment insights. In the project, the team managed over 6 million pieces of Twitter data from 7 separate data streams. Students applied various methods of sentiment analysis, such as Textblob, Flair and Vader, and built community structures to analyze Twitter users’ interactions, classifying user groups and information flow. In addition, students applied Foundation Modeling (BERT) to extract topic information, building a complete framework to tune Latent Dirichlet Allocation (LDA) and visualize output topics in detail. Their models captured key insights on market risk and pinpointed potential investment opportunities.

Paychex

Team 1: Professional Employer Organization (PEO) Upsell Prediction Engine

Members: Ledion Lico, Brian Winn, Isaac Manly, Jordan Pappas

HR and payroll company Paychex wants to sell more PEO products to increase revenue, reduce costs, and improve service for its clients. The goal of this project was to predict which existing Paychex clients are most likely to purchase PEOs. The team employed various classification algorithms, such as Support Vector Machine and Extreme Gradient Boosting, to successfully predict the likelihood of customer upsells. They also leveraged SMOTE resampling and Optuna hyper-parameter tuning to mitigate class imbalance performance issues in their model. Finally, the team created a dynamic interface to generate client lists and their respective upsell probabilities.

Team 2: Monthly Revenue Forecasting for Paychex

Members: Shijing Li, Lingyu Ye, Yuan Wang and Yangxin Fan

The goal of this project was to identify external activities that significantly impact on Paychex’s revenue, building prediction models to forecast Paychex’s future monthly revenue. Students utilized exploratory data analysis, preliminary feature selection, and feature engineering to extract correlated features from the dataset. They then inputted correlated features into their prediction models, starting with SARIMA before building up to the hybrid GAM and the deep learning model LSTM. To yield better results, the team optimized each model with recursive feature engineering and hyper-parameter tuning. Model selection was based on the lowest MAPE, and the best model was Facebook Prophet with an MAPE of 4.9 percent (which is highly accurate based on a 5 percent threshold).

Regional Transit Service (RTS): Predictive Analytics for Demand Responsive Para-transportation

Team Members: Yaozhong Huang, Yihe Chen, Kehan Yu, Junting Chen

The goal of this project was to create a more productive, demand-responsive, para-transportation schedule for the Regional Transit Service (RTS) by predicting customer cancellation rates. The team applied supervised machine learning methods, such as the Random Forest, Weighted Random Forest, and XGBoost Classifiers, to internal and external weather datasets to predict cancellation rates for scheduled trips. Students also leveraged resampling and algorithmic ensemble techniques to boost the precision of their prediction models, and incorporated business insights into their model’s applications.

University of Rochester Facilities: Energy Consumption Prediction for University of Rochester

Team Members: Sherif Negm, Shangfu Zuo, Yitao Yu, Yunjiao Mao

The goal of this project was to predict University of Rochester’s total electricity demand over 72 hours and peak total demand over 48 hours using past electricity demand, temperature, holiday calendar, and day of the week data. The team utilized Facebook prophet in conjunction with a classic random forest model to predict total electricity demand. Their model ultimately achieved less than 5 percent Mean Absolute Percentage Error (MAPE) predicting total demand and over 90 percent accuracy predicting peak demand.

Wegmans

Team 1: Customer Cross-Shopping Prediction for Wegmans Food Markets

Members: Siladitya Khan, Soumya Goswami, Alyssa Ibarra, Isabel Kenney

Cross-shop prediction models offer businesses a profitable means of acquiring new customers. For this project, the team developed a model to predict the likelihood of Rochester Wegmans customers cross-shopping at competitor Whole Foods. The models employed a classic linear GLM and non-linear SVM on a highly sparse dataset while retaining departmental and categorical hierarchies. Team members leveraged feature projection and sparsity-enabled reduction techniques to arrive at a condensed feature-set with high predictive power. The team achieved an AUC 0f 0.844 and 0.56 F1-score with reduced department-dollar features on test-set. They also accomplished model explainability using SHAP (SHapley Additive exPlanations).

Team 2: Customer Cross-shopping Trends Analysis and Predictions

Members: Yuan Feng, Haoyuan Tu, Jiayin Liu, Guangyi Shi

For this project, the team employed and developed multiple machine learning models (such as Naïve Bayes, KNN, and Random Forest) to predict the likelihood of Wegmans customers shopping at competitor Whole Foods. Students implemented data pre-processing, feature engineering, feature selection, model selection and model improvement techniques to obtain the model with the best performance, and used their model to provide customer insights to Wegmans representatives.