SPRING 2025 – Data Science Capstone Course @ University of Rochester

Zalliant: Integrating Forecasting for Weather-Optimized Crop Cutting

Team Members: Skye Crocker, Colin Hascup, Ahmed Malik, Lam Nguyen, Ruilin Zhang

Zalliant seeks to optimize crop harvesting and drying decisions through weather-based forecasting. To deliver against this goal, the team developed two models: a Growing Degree Day (GDD) calculator that determines ideal harvest timing by tracking cumulative heat values (e.g., alfalfa at 750–755 GDD), validated against USDA reports, and a drying algorithm that predicts post-harvest moisture decay to guide mechanical drying transitions. While validation data were limited, the algorithm’s foundation in established literature and adaptability to hourly weather data ensures its potential utility. Although further validation is planned, these models establish a robust foundation for Zalliant to reduce drying costs and improve crop storage outcomes. Future enhancements will expand crop coverage and incorporate hourly weather data for greater precision.

Corning: Developing a Finance-Based Peer Recommendation Engine

Team Members: Meg Hanson, Ke Xu, Zach Garson, Kathleen Zhou

Corning aims to develop a finance-based company peer recommendation engine to complement its existing industry-based model. This new system would take as input a company’s name and its associated financial KPI metrics, then output the top 10 peer companies based on financial similarity. To support this, the Capstone team collected data on publicly traded companies from annual financial filings, sourced through the WRDS database. After cleaning and imputing the data, the team built and tested unsupervised models—including K-Means, DBSCAN, and OPTICS—each with both weighted and unweighted versions. While each model had distinct strengths and weaknesses when applied to large, medium, and small public companies, the overall analysis provides Corning with valuable insights into the feasibility and interpretability of a finance-based peer recommendation engine approach.

University of Rochester: Identifying Social Risk Factors for Postpartum Hemorrhage Using National Cohort Data

Team Members: Aditi Marupaka, Katie Nguyen, Tracy Tan, Peter Zhao, Yuki Li

Postpartum hemorrhage (PPH), a leading cause of maternal mortality, disproportionately affects marginalized populations. This project aimed to identify whether social determinants of health (SDOH)—such as stress, discrimination, and housing insecurity—can help predict the risk of PPH using survey and electronic health record data from the All of Us Research Program. The team constructed a cohort of over 5,000 participants and analyzed group differences, correlation structures, and predictive patterns across SDOH variables. Statistical analysis revealed higher stress and discrimination scores among PPH cases, while predictive modeling showed limited accuracy due to class imbalance. Although the models were not highly predictive, the findings highlight meaningful associations between psychosocial factors and PPH risk, underscoring the potential of social context in maternal health research.

Paychex: An Artificial Intelligence Retrieval-Augmented Generation Chatbot for Labor Legislation

Team Members: Ethan Leung, Carol Li, Astha Singh, Keming Zhang

In today’s fast-evolving regulatory landscape, businesses must constantly adapt to changes in federal and state legislation to ensure compliance in areas such as payroll, taxation, and human resources. Paychex, a leading provider of integrated human resources solutions, faces the challenge of maintaining up-to-date regulatory information to effectively support over 740,000 clients. This capstone project focused on an AI-powered chatbot designed to assist Paychex in delivering accurate and timely regulatory guidance. The team utilized web scraping to gather information for the knowledge base, created a vector database using FAISS combined with TF-IDF keyword search, developed a retrieval-augmented generation workflow with LangGraph, and deployed it with a Streamlit frontend. The culminating project was a chatbot capable of quickly and effectively responding to users’ queries concerning various labor and wage legislations.

URMC – Dr. Zidian Xie: Public Perception & Impact of Disposable E-Cigarette Ban in the UK

Team Members: Chengze Miao, Bruce Zhang, Xinyu Wang, Isabel Liu, Yamin Zheng

In response to rising public concerns over youth vaping and environmental waste, the UK government proposed a ban on disposable e-cigarettes, set to take effect on June 1, 2025. This project explores public perception and response to the ban using Twitter/X data, focusing on user attitudes in the UK and the US, and behavioral reactions specifically among UK vapers. Over 2.4 million tweets (June 2023–October 2024) were analyzed using a human-guided NLP pipeline combining RoBERTa, ChatGPT, and BERTopic for sentiment classification, theme identification, and behavior detection. Comparative analysis was conducted between UK and US users to examine differences in public response. Results showed that vapers exhibited much stronger emotional engagement and were more likely to support the ban, with many indicating a switch to alternative products. Public sentiment peaked during policy announcements rather than enforcement, highlighting the importance of clear communication in health policy. These findings provide actionable insights for policy design and harm reduction strategies.

D3 Embedded: Ewaste-Net – A YOLO + Qwen Based Model for Reading Labels on Electronics

Team Members: Robert Ke, Yujun Sun, Rong Gu, Yiyang Wang, Shengjie Wang

This project tackles the growing mountain of e-waste by turning simple photos of devices into CSV data. The team combined three image collections—Ambient-125, Ewaste-70, and TextOBB-1681—and trained a You Only Look Once (YOLO) model to locate every label on a gadget. After cropping those regions, the lightweight Qwen optical character recognition (OCR) model swiftly transcribed serial numbers and model names. Line-level crops ran more than twice as fast as block crops and reduced the character error rate to roughly 0.98, which is a field-leading result. The detector achieved an F1 score of 0.89, and the full pipeline kept average word error below 1.1 while producing clean text for over 1,600 images. In the future, a large language model could convert the raw strings into structured fields—brand, model, production date—and estimate each item’s resale or recycling value. The result is an end-to-end “Ewaste-Net” engine that can help recyclers quickly sort and monetize e-waste at scale.

Dr. Jinjiao Wang, University of Rochester School of Nursing: Predicting Rehospitalization Risk from Polypharmacy in Older Adults

Team Members: Anthony Corbett, Jen Dutra, Jeewoo Park, Jocelyn Wood

This project aimed to identify which older adults are most at risk of being rehospitalized after receiving home healthcare, with a focus on those taking multiple medications. The team analyzed data from over 6,800 patients, cleaning and standardizing medication and diagnosis information, then using clustering to group patients and medications into meaningful categories. To predict rehospitalization, the team applied several machine learning models, including tree-based methods and a model that used BERT text embeddings to summarize medication data. The best-performing model achieved 96% accuracy and a 97% AUC score. Results highlighted important risk factors such as high medication burden, limited physical function, and certain drug classes. These findings support safer prescribing decisions and more personalized care for older adults.

Dr. Erika Ramsdale, University of Rochester Medical Center: A Machine Learning Approach for Identifying High Chemotherapy Regret Risk in Older Cancer Patients

Team Members: Navya Bhagat, Mehmed Emre Aktas, Lakshmi Kanumuri, Narm Nathan

Chemotherapy regret is a critical yet under-recognized outcome among older adults undergoing cancer treatment. This project aimed to build an interpretable predictive model that flags patients at high risk for decisional regret, enabling earlier interventions and improved shared decision-making in clinical care. Using over 300 clinical, functional, and demographic features from geriatric oncology assessments, the team developed a logistic regression model with 99% accuracy, reducing the feature set to the 20 most influential predictors to support clinical applicability. Exploratory analysis, PCA-based dimensionality reduction, and false prediction profiling provided further insight into patient characteristics driving regret. This model lays the foundation for a deployable risk stratification tool and sets the stage for future integration of large language models (LLMs) to enhance prediction using clinical transcripts.

Valerie Carey: How to Make Your Resume “Fool” LLMs

Team Members: Qike Jiang, Yuewen Yan, Mark Xu, Yang Zhang

This open-ended project aimed to compare the results of analyzing a resume dataset using different models, ranging from conventional machine learning approaches to large language models (LLMs) accessed via API. The analysis was conducted either by training models with labeled datasets of resumes with known job categories or by directly prompting the LLM. Results showed that the LLM achieved relatively high classification accuracy, particularly on resumes that used less standard wording. The study also revealed that LLMs detect certain types of keywords that can represent candidates’ career levels. Therefore, candidates can, to some extent, improve their resumes by using terms more likely to be interpreted as indicators of higher career levels.

Benchmark Labs: Point-Specific Wave Forecasting to Support Offshore Wind Farm Operations

Team Members: Brennan Kalinowski, Tarun Paravasthu, Sean Tian, Madeleine Johnson

Rel8ed: Leveraging Data Science for Supply Chain & Risk Mitigation

Team Members: Diego Velázquez, Erjia Meng, Luca Laport, Zhe Chen, Zhefu Qin

Rel8ed aimed to improve its ability to assess business risk among small and medium-sized enterprises by analyzing both structured and unstructured data. The goal of this project was to help Rel8ed uncover hidden risk patterns across global supply chains by leveraging available risk indicators—Country Risk Assessment (CRA), Sector Risk Assessment (SRA), and Debtor Risk Assessment (DRA). The team used K-Means clustering to segment firms into risk-based groups, identifying shared characteristics and anomalies within the business landscape. They also conducted network analysis to map key shipper–consignee relationships, revealing where trade volume is concentrated and where financial, political, or sectoral risks are embedded. In addition, the team explored predictive modeling techniques to improve the classification of high-risk companies. The findings showed that many central trade hubs are heavily exposed to risky partners. These models and visualizations will support Rel8ed in delivering targeted, data-driven insights for supply chain risk monitoring.