Fall 2024 Projects

Butler/Till: Enhancing Predictive Models with Synthetic Data for Advertising Insights

Team Members: Babli Dey, Selenge Tulga, Shakleen Ishfar, Yiyao Tao, Yuxin Sa

Marketing agency Butler/Till had insufficient real-world data to use for advertising analytics due to privacy regulations, technical constraints, and resource limitations. Therefore, the goal of this project was to generate synthetic tabular data to enhance the predictive accuracy of Butler/Till’s models. Students utilized advanced data synthesis techniques, including CTGAN, TVAE, CopularGAN, PAR, and GaussianCopula to create high-quality, representative datasets. The team also created a robust validation framework using a 10-fold time series cross-validation approach to assess the impacts of synthetic data on model performance. The results demonstrated that incorporating synthetic data significantly improves model reliability and accuracy, with methods such as CTGAN outperforming others across multiple metrics. The project ultimately addressed Butler/Till’s data scarcity issues and delivered a scalable solution to improve predictive analytics for advertising, paving the way for better-informed marketing strategies and decisions.

CHeT: Predicting Rapid Cognitive Decline in Patients With Parkinson’s Disease

Team Members: Dennis Salinas, Veronica Mata, Sherin Thomas, Sayli Shivalkar, Alejandro Cruz

Cognitive impairment is a common symptom of Parkinson’s disease (PD), with patients experiencing varying levels of severity depending on the stage and progression of the disease. The goal of this project was to predict whether a patient with PD and alpha-synuclein pathology (a biological marker associated with the disease) would experience a rapid or normal cognitive decline, using data from the Parkinson’s Progression Markers Initiative database. Students analyzed cognitive assessment scores over time and fitted data to a linear regression model to classify each patient as rapid or steady decliners based on their rate of deterioration. They then used these labels to train a classification machine learning model to predict which patients would fall into which categories. The team gained significant insights into key predictors of rapid cognitive decline for patients with PD, including clinical, biological, and brain imaging measures, providing insights that could help identify future at-risk patients early in their disease progression.

City of Rochester: Enhancing Emergency Response Efficiency for the Rochester Fire Department

Team Members: Brynn Lee, Eugene Ayonga, Homayra Tabassum, Medhini Sridharr, Nour Assili

The Rochester Fire Department (RFD) serves over 211,000 residents and responds to
40,000 incident calls each year. This project aimed to improve RFD’s operational efficiency
by analyzing historical incident data and forecasting future trends to optimize resource
allocation. Using advanced geospatial analysis and predictive modeling, students identified key
patterns in incidents and evaluated the allocation of personnel and equipment across 15
fire stations. The team utilized included time-series forecasting with the Prophet model and developed interactive maps to visualize response times and incident types. The model achieved strong predictive accuracy, capturing seasonal and spatial trends, and provided actionable insights for resource reallocation. Results included recommendations for relocating units to high-demand areas and implementing a low-acuity response program to reduce the strain on emergency resources. These strategies will position RFD to meet growing demands in their jurisdiction while maintaining timely and equitable service.

Goergen Institute for Data Science and Artificial Intelligence (GIDS-AI): Analyzing Statements of Purpose for Insights About Applicants

Team Members: Ajay Patel, Francesco Colombo, Ruiyang Peng, Sen Liu, Yijie Bai

The Goergen Institute of Data Science and Artificial Intelligence (GIDS-AI) aims to discern whether the contents of Statements of Purpose (SOPs) submitted by its master’s in data science applicants can provide insights about applicants and indicate whether they will accept an admission offer to the program. Therefore, the goal of this project was to build a predictive model capable of forecasting an applicant’s decision by analyzing their SOP. To achieve this goal, students used emotion and sentiment analysis on SOPs to determine which sentiments applicants most commonly express. The team then passed these sentiments through predictive models, including logistic regression and random forest models, and incorporated admission and demographic traits to see if they held any relevance. While students were able to measure generally hopeful sentiments and achieved predictions that were better than a random baseline, their correlation analysis and results demonstrated that SOPs alone cannot be used to predict applicants’ admission acceptance decisions.

KOAA-AAS: Visualizing the Effects of the Total Solar Eclipse on the Rochester Community

Team Members: Jilan Lang, Shen Zhou, Yunfan Gong, Mingzhe Liu

KOAA-AAS wants to measure, document, and visualize how the total solar eclipse has impacted the Rochester community network. Therefore, the goal of this project was to understand shifts in sub-community dynamics and changes in interactions within the Rochester community. To achieve this goal, students applied the Louvain Algorithm, a distance-based community detection method, to divide the Rochester network into six sub-communities and assigned descriptive keywords to each sub-community. They also employed advanced visualization techniques to analyze and transformed data, quantifying changes both within and between these sub-communities. The team’s results highlighted the impact of the total eclipse on the Rochester community, showcasing how interactions and relationships evolved over time. This project provided a clear picture of the broader societal and cultural effects of the eclipse, how significant astronomical events can alter community behavior and engagement patterns.

MacroXStudio: Data Analysis Assistant

Team Members: Abhishek Sharma, Jayant Patil, Bhargav Sai Bhuvanagiri, Aashrith Maisa

This goal of this project was to develop a data analysis tool for finance professionals, particularly portfolio
managers who lack coding skills. The goal was to enable these professionals to analyze financial
data and generate insights through natural language interactions. Students applied various
techniques, including the integration of Large Language Models (LLMs) like GPT-4 and
LLaMA, natural language processing, and vector similarity searches to identify relevant data
columns and efficiently perform analyses. The final workflow achieved response times of under one second for simple tasks, balancing both accuracy and speed. This solution empowers non-technical users to perform sophisticated analyses, improves decision-making efficiency, and bridges the gap between complex data analysis and financial expertise.

MacroXStudio: Monitoring Global Gender Inequality and Child Labor Using Facebook

Team Members: Christy Kim, Weihong Qi, Jason Wang, Jiaqi Zhu

The goal of this project was to understand and address global gender inequality and child labor by exploring their interconnections with data from World Bank, the International Labour Organization (ILO), UNICEF, and Facebook API. Students employed Elastic Net regression and advanced feature engineering techniques to build predictive models for the Global Gender Gap Index (GGI) and child labor rates. Key findings include an R² of 97.54% for GGI predictions and the identification of socio-economic factors such as the Human Development Index and labor force participation as critical child labor indicators. The team also developed an interactive dashboard, enabling users to visualize trends and correlations. These results will provide valuable insights to policymakers, supporting more targeted and effective strategies to address global inequality and reduce child labor.

Rel8ed Analytics: Predictive Modeling of Business Credit Risk Using Archived Metadata and Activity Signals

Team Members: Neha Rana, Amisha Dubey, Ashika Kotia, Vansh Desai

The goal of this project was to develop predictive models for assessing financial risks and to identify business opportunities through advanced machine learning techniques. To achieve this goal, students employed Logistic Regression and Random Forest algorithms, complimented by sophisticated feature selection and model validation strategies. The team’s methodology also incorporated several advanced techniques: permutation importance for feature selection, SHAP (SHapley Additive exPlanations) analysis for model interpretability, and a Simple Imputer to prevent data leakage. In addition, students conducted rigorous hyperparameter tuning to optimize model performance and utilized time series analysis to engineer innovative new features, enhancing the predictive capabilities of the models. The modeling approach yielded impressive results, with the risk prediction model achieving 93% accuracy and the business opportunity classification model demonstrating high precision. The statistical performance metrics were also particularly strong and included an R² of 0.81 for score prediction, a Root Mean Square Error (RMSE) of 0.86, and an F1 score of 0.92 for risk trend classification. By integrating advanced statistical methods, machine learning algorithms, and sophisticated analytical techniques, the team successfully developed a comprehensive predictive modeling framework capable of providing nuanced insights into business potential and financial risk.

UR Accounts Payable: Automating Accounts Payable Anomaly and Duplicate Detection

Team Members: Hengyu Zhang, Junya Jian, Sirui Tang, Daniel Jiao

The University of Rochester’s Accounts Payable Department processes over one million invoices annually, making it a challenge to identify and prevent financial discrepancies. The goal of this project was to develop a system that detects anomalies and duplicates payments efficiently, reducing the department’s reliance on costly external service providers and improving operational accuracy. To achieve this goal, students utilized advanced machine learning models, such as Lightweight Online Detector of Anomalies (LODA), Isolation Forest, and One-Class Support Vector Machine (OCSVM), which were integrated into a stacked ensemble for anomaly detection. The team also developed a robust Exact Matching method to detect duplicate payments caused by formatting inconsistencies or entry errors. Ultimately, the system successfully flagged over 53,000 potential anomalies and duplicates, prioritizing high-risk transactions and streamlining the audit process. The project’s final deliverables included an automated detection framework and a list of actionable insights for the Accounts Payable Department.

UR LLE: Laser Performance Prediction using Generative AI

Team Members: Anjaly George, Jaimin Shah, Karthik Dinesh, Rishabh Sharma

The goal of this project was to predict laser system performance by modeling the relationship between input and output energy distributions, enabling more precise and efficient experiment designs. To achieve this goal, students developed a two-stage framework, incorporating a ViT-3D Convolver for initial predictions, independent denoisers – UNet-ResNet for spatial beams, and SFFN-AE for pulse shapes to refine outputs. Additionally, the team explored alternative architectures such as LSTM for pulse-only prediction tasks. The ViT-3D Pulse Denoiser achieved the best results, with a Mean Squared Error (MSE) of 0.947 and an energy difference of 3.66 Joules for pulse predictions. For spatial beam outputs, the UNet-ResNet-based image denoiser demonstrated a low energy difference of 3.35 Joules and a Mean Absolute Percentage Error (MAPE) of 17.8%, showcasing its robustness across both modalities. These findings demonstrate the potential of generative AI for real-time, high-accuracy laser system predictions.

URMC CTSI: Exploring E-Cigarette Perception among Spanish and English-Speaking Users on Social Media

Team Members: Chen-Jui Chen, Anik De, Wonha Shin, Yadi Zhang, Chen Zhang

The goal of this project was to analyze public perceptions of e-cigarettes from English and Spanish language posts on social media, informing culturally appropriate public health interventions. Students employed natural language processing (NLP) techniques for classification tasks such as data cleaning, preprocessing, labeling, and fine-tuning transformer-based machine learning models (e.g., Bertweet and Twitter Twin BERT Large). They then applied Sentence Transformers and K-means clustering for topic modeling to analyze thematic content. The team’s results revealed nuanced differences between linguistic groups. Specifically, English posts focused on public health and regulatory concerns, while Spanish posts highlighted the cultural and lifestyle aspects of vaping. Additionally, further attitude analysis revealed significant discrepancies between traditional sentiment metrics for English and Spanish speakers; translation evaluations revealed potential information losses but highlighted cultural nuances that could affect interpretations of the project’s findings. The team’s work underscores the need for culturally sensitive health messaging and policy approaches. It also emphasizes the value of robust multilingual analyses to inform public health strategies.

Wegmans: Predicting Customer Reorder Intervals for Wegmans Meals2GO

Team Members: Manasvi Patwa, Neel Agarwal, Pranav Yeola, Shyam Shah

The goal of this project was to predict the time intervals between customer orders for Wegmans Meals2GO to enhance inventory management and personalize marketing strategies. To achieve this goal, students analyzed a dataset of over 1.5 million items ordered by 277,099 customers over two years. The team refined the dataset through comprehensive preprocessing and feature engineering to ensure consistency and reliability. They then utilized a bucketing strategy to segment customers based on their ordering patterns, and employed machine learning models, such as Extreme Gradient Boosting (XGBoost), to predict reorder intervals. Finally, students explored iterative modeling techniques to further improve their model’s predictive accuracy. The results showed that XGBoost can effectively predict reorder intervals with time-based data splits to improve model performance. This analysis provides actionable insights for Wegmans Meals2GO that will optimize future inventory planning and align promotional strategies with customer behaviors.