Spring 2024 Projects – Data Science Capstone Course @ University of Rochester

Benchmark Labs: Seasonal Weather Forecasting in the Finger Lakes Region

Team Members: Sarah Lee, Marla Litsky, Emily Xu, Chloe (Tianyi) Zhang

As atypical weather fluctuations become more common, farmers and vineyard owners need to better prepare for weather anomalies. The goal of this project was to develop a tool to help farmers and vineyard owners predict weather anomalies, with a focus on vineyards in the Finger Lakes region of New York State and with special consideration for freezing events. After talking with representatives from Benchmark Labs, the team realized they needed a seasonal matching algorithm that they could scale up, which they achieved by using a Hidden Markov model. Students ran the Hidden Markov model on 20 years of recent data (2004-2023) to fine-tune the best-matched seasons as outputs. After plotting their input season against the model’s top matches, the team observed that the overall shape of the lines was quite close, indicating that the morel was fairly accurate. Students’ final deliverables included details on how to convert necessary data, the model, and the evaluation visualizations.

Corning: Leveraging LLMs to Summarize Content as Existing Model Inputs

Team Members: Yichen Li, Yuchen Li, Zhushan Xie, Muchen Zhong

The goal of this project was to create a model that automatically analyzes news articles for customer behavior and sentiments, providing business insights to Corning and reducing manual efforts to assess customer behavior. To achieve this goal, the team web-scraped and preprocessed news articles related to Corning’s customer companies. Then they employed LLaMA 2 to analyze the news articles and predicted stock price changes for Corning and its customer companies and tuned the model using the few-shot approach. Students also used historical stock prices to evaluate model performance. With 1,851 news articles collected, the team achieved 58.55% weighted average for customer companies and a 39.72% weighted average for Corning. The model performed better overall with tech companies than telecom companies, and the results for customer companies were very close to those outlined in the paper Temporal Data Meets LLM – Explainable Financial Time Series Forecasting.

DiscSense: Enhancing Disc Sport Performance: Insights and Innovations

Members: Matthew Irons, Shamsul Chowdhury, Syed Muhammad Qasim Sudais, Qinbo Wang

DiscSense wants to develop a lightweight sensor to help disc sport athletes improve their throwing skills and overall playing experience. Accordingly, the primary focus of this project was to track throw-end conditions, allowing disc sport athletes to analyze their throwing performance over time. Students developed a Random Forest Classifier model to categorize throws by four end conditions: Catch, Drop, Ground, and Net. They also performed hyperparameter tuning on the model using the GridSearchCV Python library and used the RandomOverSampler SMOTE library to solve class imbalance issues. The team’s model ultimately achieved 84% accuracy, outperforming the original goal of accurately classifying 80% of throw-end conditions. To complete the project, the team performed cross-validation to analyze their model’s generalizability, achieving a mean score of 0.846.

Goergen Institute for Data Science (GIDS): The SOP Factor: Analysis of Statement of Purpose (SOP) in MS in Data Science Applications

Team Members: Kudakwashe Rumawu, Grace Karanja, Revan Minnam, Philbert Ndagijimana

GIDS wants to understand how Statements of Purpose (SOP) affect the decisions of its Master of Science (MS) in Data Science admissions officers, and whether SOPs are effective predictor of admission decisions. To accomplish this goal, students employed Natural Language Processing tools such as spaCy and WordCloud in the exploratory data analysis (EDA) phase to extract features from SOPs. They then leveraged Large Language Models (mainly BERT) to create vector embeddings that represent the SOPs, increasing the number of features used in model development. However, after building a Random Forest model, the team realized that building a model that predicts admissions decisions with SOPs would require more training of the BERT model to improve vector embeddings.

Paychex: Paychex Job Title Prediction Pipeline

Team Members: Rohaan Ahmad, Joey Tschopp, Gus Vietze, Keitaro Ogawa

The goal of this project was to predict the job titles of new users on Paychex’s website, speeding up data entry on the website with an autofill option. To achieve this goal, the team used a hierarchical prediction model, employed a neural network to predict general job categories, and utilized a Random Forest model to predict specific job titles within those categories. Students then performed a fairness analysis of the model and data, with a focus on gender and ethnicity, and delivered the final data model pipeline to Paychex for use on their website.

Rel8ed Analytics: Using Large Language Models to Derive Alternative ESG Rankings

Team Members: Sam Arnts, Nick Jiang, Charlie Krakauer, Ben Noe

Rel8ed Analytics wants to generate alternative Environmental, Social, and Governance (ESG) company rankings that are more concise and objective than current ranking methods. Therefore, the goal of this project was to use Large Language Models (LLMs) to extract ESG information from textual data, using the data to rank companies relative to each other. To accomplish this goal, the team employed Retrieval Augmented Generation (RAG) to parse, chunk, and embed company ESG reports. They also used a custom LLM to query those reports and retrieve relevant metrics. When students ran the returned metrics through a second layer LLM, their custom RAG achieved greater accuracy at returning ESG metrics than the industry standard ChatGPT 4; the team ultimately produced proof-of-concept rankings for 25 companies.

URMC CTSI: Social Media Engagement and E-Cigarette Communication Analysis

Team Members: Christy Kim, Weihong Qi, Jason Wang, Jiaqi Zhu

This project explored how electronic cigarettes (e-cigarettes) are discussed on Twitter/X, with a focus on identifying post characteristics that significantly increase user engagement. Employing advanced data mining techniques, including transformer-based text classification with RoBERTa and topic modeling through LDA and BERTopic, students analyzed millions of e-cigarette-related posts. Their analysis focused on a variety of features, including the timing of tweets, the presence of health-related content, and specific linguistic styles. Students found that these factors significantly influence user interactions, as evidenced by increased likes, retweets, and responses. The results highlighted how certain post characteristics can enhance engagement, suggesting that public health messages with these characteristics could be more effective at reaching and influencing social media users.

URMC Nursing: Extending Hierarchy and Rule-based Algorithms to the All of Us Research Program

Team Members: Hamidah Shaik, Tyler Walton, Emma Dickerson, Shiya Xiao

All of Us is an initiative of the National Institutes of Health (NIH) which seeks to enroll over a million participants across the United States and promotes biomedical research in underrepresented groups. The goal of this project was to replicate existing algorithms that identify pregnancy episodes for use in the All of Us researcher workbench. To achieve this goal, students combined the Hierarchy-Based Inference of Pregnancy (HIP), Pregnancy Progression Signature (PPS), and Estimated Start Date (ESD) algorithms, and successfully identified 24294 pregnant participants with 44775 pregnancy episodes within the All of Us dataset. The team then extended the algorithm to break the episodes into trimesters, determine the number of prepartum and postpartum visits, and investigate related health discrimination and maternal morbidity. All work was conducted with reproducibility and open science for future research in mind.

URMC Quality Institute: Strategies Exploration for Quality Improvement

Team Members: Veronica Chistaya, Lucy Chen, Xinyi Liu, Jingyan Yu

The University of Rochester Medical Center’s (URMC) Quality Institute wants to enhance search efficiency and foster collaboration through an extensive upgrade of its projects tracker. This platform is strategically designed to catalog, update, and report on all improvement initiatives across URMC, facilitating seamless access and management. To achieve these goals while navigating complex project data, the team utilized a multifaceted analytical approach. Using network analysis and intuitive visualizations, students identified an intricate web of project connections, enhancing the platform’s navigational and discovery capabilities. They also applied a variety of clustering techniques to systematically classify projects into meaningful categories, simplifying the platform’s search process. Finally, students implemented advanced text analysis to develop a similarity search feature, enabling users to input keywords and quickly locate related projects within the database.

Zalliant: Cattle Abnormality Detection Engine

Members: Jiayi Hao, Yulin Feng, Runzhou Liu, Hao Jing

The goal of this project was to automatically detect nontypical events in dairy cows that could negatively affect milk production, such as abnormal water intake and calving, and to predict cows’ water intake. To achieve this goal, the team applied time series analysis and linear regression to cattle temperature and water intake data. The results yielded high accuracy for abnormal water intake detection and calving detection. The students also successfully applied linear regression to water intake prediction data, creating an accurate prediction model that can be used to automatically analyze nontypical cattle events.