Fall 2022 Projects – Data Science Capstone Course @ University of Rochester

[Leading global airline]*: Final Fix Identification

Team Members: Sarah Logan, Danish Khan, Danni Shen, Aakanksha Patil, Jasper Lemberg

The goal of this project was to avoid trivial mechanical defects in [Leading global airline]* aircraft with the aim of reducing spending. To achieve this goal, students created embeddings using a SBERT model pre-trained for defective text data, and engineered a feature to approximate which rows included final fixes. To produce the recommendations, the team split the data by incoming aircraft type, embedded incoming defects, and performed an approximate nearest neighbor search using FAISS to find the ten most similar defects; corresponding actions were output as recommendations. In addition, the team designed a dynamic interface which allows users to specify the order in which the recommended actions are shown: by relevance, most recent recommendations, or most common recommendations.

* : Anonymized on Request

Goergen Institute for Data Science (GIDS): Understanding Masters’ Applicant Pool

Team Members: Beilei Guo, Zhaoxuan Hu, Xinxuan Lu, Ziyue Yang, Erqian Xu

GIDS would like to better understand the full-time master’s applicant pools and application cycles across three programs: Data Science, Computer Science, and Electrical and Computer Engineering. This information will assist the admissions committee and other stakeholders in their marketing and decision-making processes, and attract more qualified candidates to the GIDS master’s program. To achieve this goal, the team implemented exploratory data analysis to generate applicants’ demographic information, application behavior, and final decisions. They then applied classification methods, such as support vector machines, K-nearest neighbors, and the Gaussian mixture model, to predict candidates’ likelihood of submitting applications, and their likelihood of accepting admission offers. The results demonstrated that utilizing the SMOTE TOMET technique in conjunction with classification algorithms can overcome data imbalances to achieve optimal results. Students also discovered that support vector machines with two kernels perform better when identifying applicants’ likelihood of declining offers.

MacroXStudio

Team 1: Air Quality and Economic Progress in India

Members: Rui Lin, Deyao Kong, Zihao Xu,Lizhou Wang, Shan Wu, Junhao Xiang

The goal of this project was to find the relationship between changes in air quality and changes in economic outcomes in more than 50 cities across India by modeling the trade-offs required to achieve various sustainable development goals. To meet this goal, students retrieved official air quality data, filling in missing values to backfill data from 2018 to 2020. They then used the complete air quality dataset to build another set of models, utilizing linear regression, XGboost, elastic net, and random forest to predict the unemployment rate. For comprehensive air quality data, students applied random forest, linear regression, elastic net and XGBoost to predict future unemployment rates’ relationship with air quality in India. The results showed that random forest is the best model for predicting unemployment rates.

Team 2: Cryptocurrency and Twitter

Members: Xinyi Liu, Zhengyuan Li, Ruwei Chen, Yunle Chen, Yuchi Chen, Junyu Chen

For this project, students sought to understand whether cryptocurrency returns can be forecasted based on Twitter attention and sentiment. Students began by collecting 147 million crypto tweets from October 2021 to September 2022, and 9 cryptocurrency prices covering the same period. Rule-based and model-based algorithms were used for bot tweet detection. The team then estimated sentiment scores using RoBERTa, VADER, and an improved fused method. Retweet network and PageRank score were used to sort Twitter accounts by their impacts. The students generated a series of Twitter attention-, sentiment-, and price-related features and explored their correlations with cryptocurrency returns before utilizing a Granger Causality Test to prove the effectiveness of features and PageRank scores. Finally, the team created two models to predict cryptocurrency returns. One regression model yielded up to 0.14 R2, and one binary return trend classification model yielded up to 65% accuracy.

NASA: Universal Remote Observation for Coral Health (UROCH): Studying the Efficacy of Extending Existing NASA Instruments to Detect and Monitor Coral Reefs

Team Members: Lisa Pink, Matthew Johnson, Mohamad Ali Kalassina, Patrick Geitner, Thomas Durkin

NASA’s Langley Research Center aims to support coral revitalization groups by creating a baseline framework for detecting coral reefs and monitoring their health. The goal of this project was to leverage NASA’s satellite data – specifically MODIS-Terra and LandSat-8 – and coral bleaching data sources, to identify regions where coral reefs exist and assess their vitality. To meet this objective, the team adopted a multi-faceted approach involving three classification methods applied to the Great Barrier Reef and Caribbean regions. Coral and algae were detected using an XGBoost classifier with an accuracy of 96.94%. The students then distinguished coral from algae using a novel temporal rule-based classifier, and determined coral health by utilizing XGBoost to predict bleaching levels. The team also delivered a dynamic dashboard that allows users to determine coral presence, health, and characteristics in different locations.

Paychex: Paychex Enterprise Question- Answering Chatbot

Team Members: Yiwei Jiang, Mingxuan Jiang, Randy Zhang, Jingxuan Sun and Wenpei Zhao

Paychex was facing difficulties finding a suitable solution for specific questions. The goal of this project was to build a question answering chatbot GUI using on knowledge-based articles. To achieve this goal, students pre-processed the articles, created a summarization model, and built a T5 model. They then combined the questions generated from T5 model with input questions to develop an article prediction model. Finally, the team built a question-answering model and created a GUI utilizing stream lit library. The final article prediction model achieved 70% accuracy when students returned 4 to 6 solution IDs. The Chatbot GUI also included three features: changing threshold, selecting the ideal format of output, and a GUI logo.

Rel8ted Analytics: Market Signal Analysis

Team Members: Ayushi Gaurav Chattree Gurneet Singh Chhabra Rui Ma Tapan Pradyot

Rel8ted wants to analyze transaction-level data for technology services (IT, Cloud, software etc.) purchased by its partner companies. This dataset will then be used to create a recommendation system for clients, which recommends potential partnerships (with technology providers), identifies competitors, and highlights possible areas of expansion. To achieve these goals, students utilized a hybrid-recommendation system, that relies on explicit feedback, to train embeddings for companies with publicly available transaction data. Then they applied the KNN algorithm to find similarities between buyers and potential partner predictions, for companies in the same cluster. Under sampling was utilized to manage class imbalances, while a scoring method gave weight to the amount and frequency of transactions between buyers and service providers. The final embeddings achieved an accuracy of over 90%, and an initial sense check performed by Rel8ted produced encouraging results. Students also designed a dynamic dashboard for recommendations, that may be retooled for future client consumption.

United Nations: Priority Risk Identification and Scope Management

Members: Yuthika Shekhar, Abhijeet Sangam Talaulikar, Sajid Hussain Rafi Ahamed, Siddharth Susarla, Hailey Thanki

The United Nations (UN) is an international organization with 193 member states. It conducts surveys in each member nation on a bi-annual basis. The goal of this project was to identify which survey variables (participant responses) best indicate priority risks for 65 risk-prone countries, and which variables best predict UN action urgency and the extent of action needed for these countries. The survey participants included UN Entities like the World Food Programme (WFP) and UNICEF. Students utilized clustering algorithms (K-means) to group countries based on risk and urgency, and used frequent pattern mining algorithms (FP-Growth) to identify probable impending risk based on the priority risk each nation faced. Students also designed an auto-updated dashboard to produce visualizations of the above analysis, helping the UN to improve their business efficiency and offset resource costs.

UR Academic IT: Learning Analytics in Aiding Student Success

Team Members: Aditya Taori, Kaustubh Ganer, Keerthi Srinivas Konjety, Sunishka Misala, Vishnu Elangovan

Success rates for university students depend on a variety of factors such as background, domicile country, how attentive and interactive they are in their courses, and previous GPA. The University of Rochester’s (UR) Academic IT group wants to identify struggling students using their Blackboard (Learning Management System) activity: number of clicks, number of posts in discussion forums, and past GPA and demographic information. To achieve this goal, the team applied various classification algorithms, like Random Forest and Logistic Regression, and utilized an oversampling technique to identify struggling students. The student assessment period was 30 days and the team ultimately identified 80% of struggling students at UR.

UR LLE: Predicting Laser Energy Output and Analysis of Environmental Factors

Team Members: Yunran Yin, Walter Hennings, Junjie Niu, Zixin Zhou, Chris Moore

The University of Rochester’s Laboratory for Laser Energetics (UR LLE) would like to their improve experiments by more precisely delivering energy during laser shots. For this project, students were tasked with applying predictive modeling to energy outputs, while accounting for environmental factors. The team applied an Extra Trees regression model and a feed forward neural net to the dataset to predict energy output. They also collected feature importance metrics such as gini importance, permutation comparisons, shap values, and saliency values to assess the impact of environmental factors on model predictions. Students were able to make accurate energy predictions for randomly sampled shot data. The impact of environmental factors mostly went undetected, however some analyses indicated that temperatures in the laboratory’s source bay could have a small impact on energy.

Wegmans: Optimizing Online Ordering: An Approach to Enhance Item Selection for the FastPick Section

Team Members: Stefano Bastianelli, Noah Boonin, Brooke Brehm, Sara Elgayar, Miguel Novo Villar

Wegmans wants to process more online orders in the same amount of time, to increase online sales and customer satisfaction. To help achieve this goal, students worked to decrease the amount of time Wegmans employees and Instacart workers spend gathering items in the store for online orders. The team created a target variable for Wegmans FastPick products using the items most frequently found in online transactions. Students then modeled the FastPick target and engineered features from sales data with Decision Trees, Logistic Regression and XGBoost on monthly, quarterly and yearly bases. Finally, the team delivered a pipeline for dataset creation and modeling, along with results and predictions for FastPick recommended items, all in an interactive Tableau dashboard.