CS 4824/ECE 4424 Project Proposal Ideas

Following are some project proposals ideas. For each of them, we provide an overview of the problem and suggest some resources to get you started. Feel free to use them as templates for planning, but you are not obligated to adhere to them.

Project Proposal 1: Identifying Communities and Influencers in Social Networks with Machine Learning
Project Proposal 2: Machine Learning for Weather Prediction
Project Proposal 3: Machine Learning for Finance Fraud Detection
Project Proposal 4: Sentiment Analysis Using Machine Learning
Project Proposal 5: Designing a Recommendation Engine
Project Proposal 6: Iris Species Classification Using Machine Learning
Project Proposal 7: Machine Learning for Sales Forecasting
Project Proposal 8: Predicting Stock Prices Using Machine Learning
Project Proposal 9: Stroke Predicting Using Machine Learning
Project Proposal 10: Student Performance Predictions Using Machine Learning
Project Proposal 11: Predict heart disease risk based on patient health data
Project Proposal 12: Predicting Telco Customer Churn Using Machine Learning

Overview:

The project focuses on analyzing social networks to identify distinct communities and influential users. Social networks are intricate webs of interactions and connections, reflecting complex social dynamics. Understanding these networks helps in mapping out how information spreads, identifying community structures, and recognizing key figures who influence these communities. This analysis is crucial for various applications, including marketing, information dissemination, and sociological research.

Dataset Suggestion:

Stanford Large Network Dataset Collection. Offers a wide range of social network datasets, including email networks, collaboration networks, and web graphs which are ideal for community detection and influencer identification tasks.
link: https://snap.stanford.edu/data/
Facebook Large Page-Page Network Data Set A dataset capturing public pages and their mutual likes, useful for community detection
link: https://www.kaggle.com/datasets/ishandutta/facebook-large-pagepage-network-data-set or https://snap.stanford.edu/data/

Evaluation Metrics:

Modularity: Measures the strength of division of a network into modules or communities.
Betweenness Centrality: Identifies influencers by calculating the number of times a node acts as a bridge along the shortest path between two other nodes.

Introductory Materials:

"A Survey of Community Detection Approaches: From Statistical Modeling to Deep Learning" by Di Jin, Zhizhi Yu, Pengfei Jiao, Shirui Pan, Dongxiao He, Jia Wu, Philip S. Yu, Weixiong Zhang.
This review introduces the methods for Community Detection from statistical method to deep learning method.
link: https://ieeexplore.ieee.org/abstract/document/9511798?casa_token=63MaEA1MuzYAAAAA:0hLXERo8fo4nO5fdEXr4FyvmsfYyv4GM7R1IQwa6H3_QBjLOOuE9hGnmfz87sx_5076qgeSibA
Comment: Read this first to get some knowledge about community detection
"Detection of Opinion Leaders in Social Networks: A Survey" by Seifallah Arrami, Wided Oueslati, Jalel Akaichi
This paper present different research works that aimed to detect opinions leaders in social network.
link: https://link.springer.com/chapter/10.1007/978-3-319-59480-4_36

Tools and Packages:

NetworkX: A Python package for the creation, manipulation, and study of complex networks.
Tutorials:
1. https://networkx.org/documentation/networkx-1.9.1/_downloads/networkx_tutorial.pdf
2. https://www.youtube.com/watch?v=ollW8lwZxNE
Gephi: An open-source network analysis and visualization software.
1. https://gephi.org/users/
2. https://www.youtube.com/watch?v=GXtbL8avpik

Project Proposal 2: Machine Learning for Weather Prediction

Overview:

This project aims to leverage machine learning algorithms to improve the accuracy and reliability of weather predictions. By analyzing historical weather data, including temperature, humidity, atmospheric pressure, wind speed, and direction, the project seeks to forecast future weather conditions. The initiative will explore various machine learning models to identify patterns and correlations within the data, enabling more precise predictions of weather phenomena such as rain, storms, and temperature changes.

Dataset Suggestion:

Kaggle Weather Dataset. This dataset includes various weather conditions, which can be a good starting point for predictive modeling.
link: https://www.kaggle.com/datasets/muthuj7/weather-dataset
TensorFlow Weather Time Series Dataset TensorFlow provides a tutorial that uses a weather time series dataset recorded by the Max Planck Institute for Biogeochemistry. The dataset contains 14 different features such as air temperature, atmospheric pressure, and humidity, collected from 2009 to 2016.
link: https://www.tensorflow.org/tutorials/structured_data/time_series

Evaluation Metrics:

Root Mean Square Error (RMSE): Measures the differences between predicted values and observed values.
Mean Absolute Error (MAE): Provides an average of the absolute errors between predictions and actual outcomes.
Coefficient of Determination (R²): Indicates the proportion of the variance in the dependent variable predictable from the independent variables.

Introductory Materials:

"Survey on weather prediction using big data analystics" by P. Chandrashaker Reddy, A. Suresh Babu
This paper surveys methods for weather prediction using big data analytics, focusing on rainfall forecasting and the challenges of achieving accurate predictions. It emphasizes the importance of advanced models and data from meteorological departments to enhance forecasting techniques.
link: https://ieeexplore.ieee.org/abstract/document/8117883/
Comment: Read this first to get some knowledge about weather prediction
"Deep Learning Weather Forecasting Techniques: Literature Survey" by Ayman M. Abdalla, Iyad H. Ghaith, Abdelfatah A. Tamimi
The paper provides a comparative analysis of deep learning models for weather forecasting, including CNNs, RNNs, and LSTMs. It focuses on their performance in predicting weather at different timescales and discusses the importance of model architecture, dataset evaluation, and prediction accuracy.
link: https://ieeexplore.ieee.org/document/9491774
Comment: Read this paper to get some knowledge about applying deep learning on weather prediction

Tools and Packages:

MetPy: A Python package designed for meteorological data processing, offering tools for reading, visualizing, and interpreting weather data.
Tutorials:
1. https://www.youtube.com/playlist?list=PLQut5OXpV-0ir4IdllSt1iEZKTwFBa7kO
GeoPandas: An extension of Pandas designed to make working with geospatial data in Python easier, useful for handling and analyzing weather data across different geographical locations.
Tutorials:
1. https://www.youtube.com/watch?v=t7lliJXFt8w
2. https://geopandas.org/en/stable/getting_started.html

Project Proposal 3: Machine Learning for Finance Fraud Detection

Overview:

This project aims to harness machine learning algorithms to detect fraudulent activities in the financial sector. By analyzing patterns within transactional data, customer behavior, and financial records, the initiative seeks to identify anomalous and potentially fraudulent transactions. Implementing machine learning models will provide a dynamic tool for financial institutions to enhance their security measures, reduce losses due to fraud, and protect customer assets. The project encapsulates the development and deployment of predictive models that can sift through vast datasets to flag suspicious activities, showcasing the critical role of machine learning in bolstering financial security.

Dataset Suggestion:

Credit Card Fraud Detection. This dataset contains transactions made by credit cards in September 2013 by European cardholders.
link: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
Synthetic Financial Datasets For Fraud Detection This dataset contains data that simulated with labeled fraudulent and legitimate transactions. This is a synthetic dataset generated using the simulator called PaySim.
link: https://www.kaggle.com/datasets/ealaxi/paysim1

Evaluation Metrics:

Accuracy: Although a useful starting point, accuracy alone can be misleading in imbalanced datasets typical of fraud detection scenarios.
Precision, Recall (Sensitivity), and F1-Score: Critical for evaluating the effectiveness of the fraud detection model, especially in balancing false positives and false negatives.
Area Under the ROC Curve (AUC-ROC): Measures the model's ability to distinguish between fraudulent and legitimate transactions.

Introductory Materials:

"Financial Fraud Detection Based on Machine Learning: A Systematic Literature Review" by Abdulalem Ali, Shukor Abd Razak, Siti Hajar Othman, Taiseer Abdalla Elfadil Eisa, Arafat Al-Dhaqm, Maged Nasser, Tusneem Elhassan, Hashim Elshafie and Abdu Saif
This review article provides a comprehensive examination of machine learning approaches to financial fraud detection, critically analyzing the effectiveness of various models and methodologies. It emphasizes the significance of Support Vector Machines (SVM) and Artificial Neural Networks (ANN) in tackling fraud, particularly in credit card transactions, highlighting the evolving landscape of financial security challenges and the pivotal role of advanced analytical techniques in their mitigation.
link: https://www.mdpi.com/2076-3417/12/19/9637
Comment: Read this first to get some knowledge about Financial Fraud
"Financial Fraud: A Review of Anomaly Detection Techniques and Recent Advances" by Waleed Hilal, S. Andrew Gadsden, John Yawney
This paper conducts a thorough review of anomaly detection techniques applied in financial fraud detection, focusing on recent advancements in semi-supervised and unsupervised learning models. It examines the evolution of fraud detection systems, addressing the shift from supervised learning models, which face significant challenges, to the promising potential of semi-supervised and unsupervised models in recent literature.
link: https://www.sciencedirect.com/science/article/pii/S0957417421017164
Comment: Read this paper to get some knowledge about Anomaly Detection

Tools and Packages:

PyOD (Python Outlier Detection): Specializes in detecting anomalies and outliers in data, which is crucial for identifying fraudulent activities. PyOD includes more than 20 algorithms, ranging from classical LOF (Local Outlier Factor) to contemporary deep learning models like AutoEncoders.
Tutorials:
1. https://www.youtube.com/watch?v=QPjG_313GOw
2. https://pyod.readthedocs.io/en/latest/

Project Proposal 4: Sentiment Analysis Using Machine Learning

Overview:

Sentiment Analysis using Machine Learning focuses on the automated process of identifying and categorizing opinions expressed in text to assess the writer's sentiment towards specific topics or the overall context. This approach leverages machine learning techniques to distinguish between positive, negative, and neutral sentiments within a wide array of text sources such as social media posts, product reviews, and customer feedback. By harnessing the power of machine learning algorithms, sentiment analysis transcends traditional linguistic rule-based methods, allowing for more nuanced and accurate interpretations of the complex variations in human emotions. This capability is especially beneficial for applications in market research, brand monitoring, and enhancing customer experience, where understanding consumer sentiment is crucial.

Description: The Overview section focuses on the background of the field, describing clearly the specific problem that the field is solving, so that the reader can get a quick sense of whether or not this is an area of interest to them. Neither the Topic nor the Overview should contain technology-specific statements, such as Sentiment Analysis Using NLP, to provide the reader with an open-ended Topic!

Dataset Suggestion:

Sentiment Labelled Sentences Dataset. This dataset includes labeled sentences from the IMDb, Amazon, and Yelp, perfect for binary sentiment classification tasks.
link: https://archive.ics.uci.edu/dataset/331/sentiment+labelled+sentences
Twitter Data set for Arabic Sentiment Analysis This dataset is a collection of Arabic-language tweets, specifically curated for training and evaluating machine learning models on the task of sentiment analysis in the Arabic language.
link: https://archive.ics.uci.edu/dataset/293/twitter+data+set+for+arabic+sentiment+analysis

Evaluation Metrics:

Accuracy: Measure the percentage of correctly predicted sentiments.
Precision, Recall, and F1-Score: These metrics will help in understanding the model's performance, especially in cases of class imbalances.
Confusion Matrix: To visualize the performance of the algorithm.

Introductory Materials:

"A Survey on Aspect-Based Sentiment Analysis: Tasks, Methods, and Challenges" by Wenxuan Zhang, Xin Li, Yang Deng, Lidong Bing, Wai Lam.
This review article takes an in-depth look at aspect-based sentiment analysis, an important branch of sentiment analysis that involves analyzing the emotional tendencies of specific aspects of a text
link: https://arxiv.org/abs/2203.01054
Comment: Read this first to get some knowledge about sentiment analysis
"Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts" by Cicero Nogueira dos Santos and Maira Gatti
This paper explores how deep convolutional neural networks can be used to handle sentiment analysis of short texts, which can be very helpful in understanding the sentiment analysis of different types of texts
link: https://aclanthology.org/C14-1008/
Comment: Classic paper in the field of Sentiment Analysis

Tools and Packages:

Natural Language Toolkit (NLTK): A popular Python library that provides tools for handling text data, including tokenization, stemming, tagging, parsing, and more.
Tutorials:
1. https://www.analyticsvidhya.com/blog/2021/07/nltk-a-beginners-hands-on-guide-to-natural-language-processing/
2. https://www.youtube.com/watch?v=FLZvOKSCkxY&list=PLQVvvaa0QuDf2JswnfiGkliBInZnIC4HL
spaCy: Another powerful library for NLP in Python. It's known for its efficiency and ease of use in handling large text datasets.
Tutorials:
1. https://spacy.io/usage/spacy-101
BERT and Transformers (Hugging Face): The Transformers library by Hugging Face provides a collection of state-of-the-art pre-trained models like BERT, GPT-2, T5, etc., which can be fine-tuned for specific tasks like sentiment analysis.
Tutorials:
1. https://www.unite.ai/complete-beginners-guide-to-hugging-face-llm-tools/
2. https://www.youtube.com/watch?v=00GKzGyWFEs&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o

Project Proposal 5: Designing a Recommendation Engine

Overview:

The aim of this project is to develop a recommendation engine that mitigates decision fatigue and enhances user experiences on digital platforms. Utilizing sophisticated systems, the engine will analyze extensive datasets to suggest products, services, or content tailored to user preferences, based on their past behavior and other relevant factors. This personalization is crucial in aiding users to navigate the plethora of choices available online, enhancing engagement and satisfaction in domains such as entertainment, e-commerce, and social media. The project will leverage machine learning algorithms to refine and improve the accuracy of these recommendations continually.

Dataset Suggestion:

MovieLens 20M Dataset. A comprehensive collection of movie ratings and tags from the MovieLens movie recommendation service, featuring over 20 million ratings and 465,564 tag applications across 27,278 movies by 138,493 users from January 1995 to March 2015. This dataset serves as an excellent basis for developing and evaluating recommendation systems.
link: https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset

Evaluation Metrics:

Root Mean Square Error (RMSE): Measures the differences between predicted values and observed values.
Mean Absolute Error (MAE): Provides an average of the absolute errors between predictions and actual outcomes.
Precision and Recall: To assess the relevance of the recommendations, where precision measures the proportion of recommended items that are relevant, and recall quantifies the number of relevant items recommended.

Introductory Materials:

"Matrix Factorization Techniques for Recommender Systems" by Yehuda Koren, Robert Bell, and Chris Volinsky
This paper introduces the matrix factorization technique, a cornerstone approach in recommendation systems.
link: https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf

"Deep Learning based Recommender System: A Survey and New Perspectives" by Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay
A survey covering the use of deep learning techniques in recommendation systems, providing insights into the field's advancements.
link: https://arxiv.org/abs/1707.07435

Tools and Packages:

Scikit-Learn: Offers tools for building recommendation systems using algorithms like matrix factorization.
Tutorials:
1. https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
Deep Learning Libraries (Pytorch, TensorFlow, and Keras): Support the development of complex models for recommendation systems, including collaborative filtering and content-based recommendations.
Tutorials:

Project Proposal 6: Iris Species Classification Using Machine Learning

Overview:

The Iris Species Classification project leverages machine learning to accurately classify iris plants into one of three species: Iris Setosa, Iris Versicolour, and Iris Virginica. This task is facilitated by analyzing the unique physical attributes of each iris species, which include sepal length, sepal width, petal length, and petal width. These features serve as the foundation for creating a predictive model that distinguishes between the species with high accuracy. The project not only embodies a classic problem in the field of machine learning but also provides a practical application of statistical pattern recognition and data analysis techniques.

Dataset Suggestion:

** UCI Machine Learning Repository Iris Data Set ** This dataset is a foundational resource for the Iris Species Classification project, offering measurements for 150 iris plants across the three target species, with 50 instances for each. The dataset includes four features: sepal length, sepal width, petal length, and petal width, crucial for training machine learning models to differentiate between the species. Description: The Dataset Suggestion section points to the primary dataset used for this project, detailing the types of data included and providing a direct link for easy access.
link: https://archive.ics.uci.edu/ml/datasets/Iris

Evaluation Metrics:

Root Mean Square Error (RMSE): Measures the differences between predicted values and observed values.
Mean Absolute Error (MAE): Provides an average of the absolute errors between predictions and actual outcomes.
Accuracy: Evaluates the overall effectiveness of the model in correctly classifying the iris species.
Precision, Recall, and F1-Score: These metrics offer a more nuanced view of the model's performance, particularly useful in analyzing the balance between identifying relevant instances and minimizing false positives.

Introductory Materials:

""Machine Learning, Neural and Statistical Classification" by D. Michie, D.J. Spiegelhalter, and C.C. Taylor
This book provides a comprehensive overview of various classification methods, including statistical, neural, and machine learning approaches, with practical examples that can help understand the foundational concepts behind iris species classification.

"Pattern Recognition and Machine Learning" by Christopher M. Bishop
This textbook offers in-depth coverage of pattern recognition techniques and their application in machine learning, providing valuable insights into the methodologies that can be applied to the Iris Species Classification project.

Tools and Packages:

Scikit-Learn: Offers tools for building recommendation systems using algorithms like matrix factorization.
Tutorials:
1. https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
2. https://scikit-learn.org/stable/tutorial/index.html
Deep Learning Libraries (Pytorch, TensorFlow, and Keras): Support the development of complex models for recommendation systems, including collaborative filtering and content-based recommendations.
Tutorials:

Project Proposal 7: Machine Learning for Sales Forecasting

Overview:

Machine Learning for Sales Forecasting harnesses the predictive power of machine learning algorithms to estimate future sales volumes based on historical data and influencing factors. This approach is critical for businesses seeking to optimize inventory management, allocate resources efficiently, and develop strategic marketing campaigns. By leveraging machine learning, companies can move beyond traditional forecasting methods, which often rely on simple extrapolation, to embrace models that consider complex patterns, seasonal variations, and the impact of external factors such as economic indicators and promotional activities. The capacity to predict sales with greater accuracy enables businesses to respond more agilely to market demands, minimize overstock and understock situations, and improve overall financial performance.

Dataset Suggestion:

Walmart Recruiting. Store Sales Forecasting Dataset. This dataset includes historical sales data for 45 Walmart stores, featuring sales figures across various departments and accounting for holiday markdown events which significantly impact sales. The dataset challenges participants to forecast future sales by considering factors such as store type, size, regional economic conditions, and promotional activities. The inclusion of holiday weeks, where sales weights are increased, adds an additional layer of complexity to the forecasting task.

link: https://www.kaggle.com/competitions/walmart-recruiting-store-sales-forecasting/overview/description

Evaluation Metrics:

Weighted Mean Absolute Error (WMAE): This metric is used to evaluate the accuracy of sales forecasts by measuring the average magnitude of the errors in the predictions, with a higher weight assigned to holiday weeks. The use of WMAE ensures that models are particularly attuned to accurately forecasting sales during critical holiday periods, which are essential for the retail industry.

Introductory Materials:

"Python for Data Analysis" by Wes McKinney"
While not exclusively about forecasting, this book is essential for anyone working with data in Python. It provides a thorough introduction to using pandas, a key Python library for data manipulation and analysis, which is crucial for preparing your dataset for modeling.

""Introduction to Machine Learning with Python: A Guide for Data Scientists" by Andreas C. Müller & Sarah Guido
This book offers a practical introduction to machine learning with Python, focusing on the use of scikit-learn. It's a great resource for understanding the fundamentals of machine learning and how to apply them to real-world problems, such as sales forecasting.

Tools and Packages:

Scikit-Learn: Offers tools for building recommendation systems using algorithms like matrix factorization.
Tutorials:
1. https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
Deep Learning Libraries (Pytorch, TensorFlow, and Keras): Support the development of complex models for recommendation systems, including collaborative filtering and content-based recommendations.
Tutorials:

Project Proposal 8: Predicting Stock Prices Using Machine Learning

Overview:

The project, Predicting Stock Prices Using Machine Learning, dives into the complex and dynamic world of financial markets to tackle the age-old investing mantra of "buy low, sell high". This endeavor seeks to demystify the patterns of stock price movements by applying machine learning algorithms on historical trading data. The objective is to forecast future stock prices, thus providing investors with insights that could potentially lead to more informed decision-making. The challenge lies in the unpredictable nature of the stock market, influenced by numerous factors including economic indicators, company performance, and global events. By leveraging machine learning, this project aims to decode the seemingly random fluctuations in stock prices, offering a quantitative tool to aid in the prediction of stock trends.

Dataset Suggestion:

Huge Stock Market Dataset. This dataset encompasses a comprehensive collection of historical daily price and volume data for all US-based stocks and ETFs trading on the NYSE, NASDAQ, and NYSE MKT. It stands out due to its high-quality, granularity, and the breadth of financial instruments covered, making it an ideal candidate for developing and testing stock price prediction models.
link: https://www.kaggle.com/datasets/borismarjanovic/price-volume-data-for-all-us-stocks-etfs

Evaluation Metrics:

Root Mean Square Error (RMSE): Measures the differences between predicted values and observed values.
Mean Absolute Error (MAE): Provides an average of the absolute errors between predictions and actual outcomes.
R-squared (R2): Quantifies the percentage of the variance in the dependent variable that is predictable from the independent variable(s), providing insight into the goodness of fit of the model.
Precision and Recall: To assess the relevance of the recommendations, where precision measures the proportion of recommended items that are relevant, and recall quantifies the number of relevant items recommended.

Introductory Materials:

"Machine Learning for Stock Price Prediction: From Basics to Advanced" by Jason Brownlee
This comprehensive guide covers various aspects of applying machine learning to stock price prediction, from foundational concepts to more advanced techniques.
link: https://machinelearningmastery.com/start-here/#deep_learning_time_series
"Forecasting Stock Returns through Machine Learning Models" by Roberto Maestre and Yuwei Chen
This paper provides an in-depth analysis of different machine learning models for stock return prediction, comparing their performance and applicability.
link: https://www.sciencedirect.com/science/article/pii/S0957417419307280

Tools and Packages:

Scikit-Learn: Offers tools for building recommendation systems using algorithms like matrix factorization.
Tutorials:
1. https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
Deep Learning Libraries (Pytorch, TensorFlow, and Keras): Support the development of complex models for recommendation systems, including collaborative filtering and content-based recommendations.
Tutorials:

Project Proposal 9: Stroke Predicting Using Machine Learning

Overview:

This project aims to develop a predictive model using machine learning techniques to assess the likelihood of patients experiencing a stroke, based on a comprehensive set of health indicators and lifestyle factors such as age, hypertension, heart disease, diabetes, and smoking status. By integrating these variables, the model will predict stroke risk with the goal of supporting healthcare providers in their decision-making processes. This enables the identification and monitoring of high-risk patients, facilitating timely and potentially life-saving interventions. Moreover, the project will explore different machine learning algorithms to find the most accurate and efficient model for stroke prediction, thus contributing to improved healthcare outcomes and preventive care strategies.

Dataset Suggestion:

Stroke Prediction Dataset. This dataset provides information on patients and is used to predict the likelihood of strokes. The features include gender, age, hypertension status, heart disease status, marriage status, work type, residence type, average glucose level, BMI, and smoking status.
link: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

Evaluation Metrics:

Accuracy: Measures the model's overall correctness, crucial for reliably identifying stroke risk in patients, ensuring effective healthcare interventions.
Precision and Recall: Critical in medical diagnostics to avoid costly errors. Precision minimizes false positives, reducing unnecessary treatments, while recall reduces false negatives, ensuring high-risk patients receive necessary care.
F1 Score: Offers a balanced measure of precision and recall, useful in managing class imbalances typical in medical datasets where stroke cases are less frequent but critically important to detect.
ROC-AUC: Assesses the model's effectiveness in distinguishing between patients at risk and not at risk for stroke, with higher values indicating a model's better discriminative ability, essential for accurate medical predictions.

Introductory Materials:

"Stanford Webinar: How Artificial Intelligence Can Improve Healthcare" by Nigam Shah
This video covers various tools that are beneficial for conducting usefulness analysis in healthcare settings. It explains how existing frameworks can assist in evaluating the practical usefulness of predictive models and includes a case study demonstrating how these models impact patient care.
link: https://www.youtube.com/watch?v=7rs79MUDId0
"What is a Stroke" by Cleveland Clinic
This Article provides a comprehensive overview of stroke, including causes, symptoms, and treatments. This resource is crucial for understanding the medical context of the predictive modeling, helping to inform the development and application of the stroke prediction model.
link: https://my.clevelandclinic.org/health/diseases/5601-stroke

Tools and Packages:

Scikit-Learn: For building predictive models using logistic regression, decision trees, and random forest algorithms.
Tutorials:
1. https://scikit-learn.org/stable/
Pandas and Numpy: For data manipulation and numerical calculations.
Deep Learning Libraries (Pytorch, TensorFlow, and Keras): Support the development of complex models.
Tutorials:

Project Proposal 10: Student Performance Predictions Using Machine Learning

Overview:

This project aims to develop a predictive model using machine learning to forecast student performance based on a variety of factors that influence academic success. By analyzing features such as attendance, study habits, previous academic performance, and extracurricular activities, the model will provide insights into how these variables affect final grades. This information will be pivotal for educators to implement targeted interventions to help students improve and excel academically.

Dataset Suggestion:

Student Performance Dataset:. This synthetic dataset is crafted to reflect realistic educational research scenarios, incorporating a broad range of variables that impact student outcomes. The data includes attributes like attendance records, study habits, historical academic performance, and engagement in extracurricular activities, providing a comprehensive base for predicting student grades.
link: https://www.kaggle.com/datasets/haseebindata/student-performance-predictions

Evaluation Metrics:

Accuracy: To gauge the overall effectiveness of the predictive model in classifying students into performance categories accurately.
Precision and Recall: Crucial for educational settings where it's important to correctly identify students needing intervention without misclassifying those who are performing well.
F1 Score: Helps balance the precision and recall, especially useful in datasets where there might be an imbalance in the distribution of student performance categories.
ROC-AUC: Useful for evaluating the model’s ability to distinguish between different levels of student performance.

Introductory Materials:

"The power of Deep Learning techniques for predicting student performance in Virtual Learning Environments: A systematic literature review" by Bayan Alnasyan, Mohammed Basheri, Madini Alassafi.
This comprehensive guide discusses various methods and applications of educational data mining, offering practical examples and systems that can enhance predictive analysis in educational settings.
link: https://www.sciencedirect.com/science/article/pii/S2666920X24000328
"Student Performance Prediction Using Machine Learning Algorithms" by Esmael Ahmed
This paper explores how predictive analytics is being used to shape future learning environments, focusing on the integration of data-driven insights to improve educational outcomes.
link: https://onlinelibrary.wiley.com/doi/10.1155/2024/4067721

Tools and Packages:

Scikit-Learn: For building predictive models using logistic regression, decision trees, and random forest algorithms.
Tutorials:
1. https://scikit-learn.org/stable/
Pandas and Numpy: For data manipulation and numerical calculations.
Deep Learning Libraries (Pytorch, TensorFlow, and Keras): Support the development of complex models.
Tutorials:

Project Proposal 11: Predict heart disease risk based on patient health data

Overview:

The project, Predicting Heart Disease Risk Using Machine Learning, addresses one of the most critical challenges in healthcare: the early detection of heart disease. By analyzing patient health data, this initiative aims to predict an individual's risk of developing heart disease, enabling more proactive and informed healthcare decisions. Using the Cleveland Heart Disease Dataset, which includes a range of clinical and lifestyle variables, machine learning algorithms will be applied to identify patterns and risk factors associated with cardiovascular conditions. The goal is to create a predictive model that provides healthcare providers with a reliable tool to assess heart disease risk, allowing for timely interventions and personalized treatment plans. The challenge lies in accurately modeling complex patient data while accounting for diverse factors such as age, cholesterol levels, blood pressure, and lifestyle habits.

Dataset Suggestion:

Heart Disease Dataset:. This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The "target" field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.
link: https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset

Evaluation Metrics:

Accuracy: Measures the model's overall correctness, crucial for reliably identifying heart disease risk in patients, ensuring effective healthcare interventions.
Precision and Recall: Critical in medical diagnostics to avoid costly errors. Precision minimizes false positives, reducing unnecessary treatments, while recall reduces false negatives, ensuring high-risk patients receive necessary care.
F1 Score: Offers a balanced measure of precision and recall, useful in managing class imbalances typical in medical datasets where heart disease cases are less frequent but critically important to detect.
ROC-AUC: Assesses the model's effectiveness in distinguishing between patients at risk and not at risk, with higher values indicating a model's better discriminative ability, essential for accurate medical predictions.

Introductory Materials:

"Building A Heart Disease Prediction Model Using Machine Learning" by Oluseye Jeremiah
This comprehensive guide covers exploration of a similar dataset and applies a Machine learning model. This can give you an idea of a typical workflow.
link: https://medium.com/@oluseyejeremiah/building-a-heart-disease-prediction-model-using-machine-learning-4c690243a93e
"HDPM: An Effective Heart Disease Prediction Model for a Clinical Decision Support System" by Norma Latif Fitriyani; Muhammad Syafrudin; Ganjar Alfian; Jongtae Rhee.
This study proposes an effective heart disease prediction model (HDPM) for a CDSS which consists of Density-Based Spatial Clustering based system. Two publicly available datasets (Statlog and Cleveland) were used to build the model and compare the results with those of other models (naive bayes (NB), logistic regression (LR), multilayer perceptron (MLP), support vector machine (SVM), decision tree (DT), and random forest (RF)) and of previous study results.
link: https://ieeexplore.ieee.org/abstract/document/9144587
"Heart disease prediction using supervised machine learning algorithms: Performance analysis and comparison" by Md Mamun Ali, Bikash Kumar Paul, Kawsar Ahmed, Francis M Bui, Julian MW Quinn, Mohammad Ali Moni
This study aimed to identify machine learning classifiers with the highest accuracy for early heart disesase diagnostic purposes. Several supervised machine-learning algorithms were applied and compared for performance and accuracy in heart disease prediction. Feature importance scores for each feature were estimated for all applied algorithms except MLP and KNN. All the features were ranked based on the importance score to find those giving high heart disease predictions.
link: https://www.sciencedirect.com/science/article/abs/pii/S0010482521004662

Tools and Packages:

Scikit-Learn: For building predictive models using logistic regression, decision trees, and random forest algorithms.
Tutorials:
1. https://scikit-learn.org/stable/
Pandas and Numpy: Numpy is useful for numerical calculations and Pandas can help with CSV file reading and writing.
Deep Learning Libraries (Pytorch, TensorFlow, and Keras): Support the development of complex models.
Tutorials:

Project Proposal 12: Predicting Telco Customer Churn Using Machine Learning

Overview:

In business, churn is the percentage of customers who stop using a company's products or services within a specific time period. It's also known as customer attrition or customer turnover. This project tackles a crucial challenge of identifying customers at risk of leaving a service provider. The task is to predict customer churn based on various factors such as contract type, payment methods, service usage, and demographic data. This will help telecom companies understand why customers leave and develop strategies to retain them. The complexity arises from the wide range of variables influencing customer behavior, including pricing, service quality, and customer satisfaction. By building predictive models, this project aims to provide actionable insights, enabling companies to implement targeted retention efforts, reduce churn rates, and improve customer loyalty.

Dataset Suggestion:

IBM Telco Customer Churn Dataset:. This sample data module tracks a fictional telco company's customer churn based on various factors. The churn column indicates whether the customer departed within the last month. Other columns include gender, dependents, monthly charges, and many with information about the types of services each customer has.
link: https://www.kaggle.com/datasets/blastchar/telco-customer-churn

Evaluation Metrics:

Accuracy: Measures the model's overall correctness, crucial for reliably identifying customers who might leave the company.
Precision and Recall: Precision minimizes false positives, while recall reduces false negatives, ensuring the correct factors can be identified.
F1 Score: Offers a balanced measure of precision and recall, useful in managing class imbalances typical in real life datasets.
ROC-AUC: Assesses the model's effectiveness in distinguishing between customers prone and not prone to leave, with higher values indicating a model's better discriminative ability.

Introductory Materials:

"Churn Prediction using Machine Learning (Bank Customer)" by Simge Erek
This comprehensive guide covers exploration of a similar dataset and task, and applies a Machine learning model. This can give you an idea of a typical workflow.
link: https://www.kaggle.com/code/simgeerek/churn-prediction-using-machine-learning
"Research on telecom customer churn prediction based on ensemble learning" by Yajun Liu, Jingjing Fan, Jianfang Zhang, Xinxin Yin & Zehua Song
This study presents multidimensional data preprocessing, feature extraction and processing of the dataset provided by the telecom operator. Then, the k-means algorithm is used to cluster different consumer groups, which in turn analyses the factors of concern to different consumer groups and makes targeted suggestions. Finally, to improve the effectiveness and robustness of the model, ensemble learning is introduced which is the combination of multiple models.
link: https://link.springer.com/article/10.1007/s10844-022-00739-z
"Risk assessment of customer churn in telco using FCLCNN-LSTM model" by Cheng Wang, Congjun Rao, Fuyan Hu, Xinping Xiao, Mark Goh
This study explores a more advanced method based on deep learning models. A novel Maj-LASSO algorithm is proposed to identify churn predictors under the constraint of unbalanced data. It uses a CNN and LSTM fusion model for churn prediction task.
link: https://www.sciencedirect.com/science/article/abs/pii/S0957417424002173

Tools and Packages:

Scikit-Learn: For building predictive models using logistic regression, decision trees, and random forest algorithms.
Tutorials:
1. https://scikit-learn.org/stable/
Pandas and Numpy: Numpy is useful for numerical calculations and Pandas can help with CSV file reading and writing.
Deep Learning Libraries (Pytorch, TensorFlow, and Keras): Support the development of complex models.
Tutorials:

CS 4824/ECE 4424 Project Proposal Ideas

Project Proposal 1: Identifying Communities and Influencers in Social Networks with Machine Learning

Overview:

Dataset Suggestion:

Evaluation Metrics:

Introductory Materials:

Tools and Packages:

Project Proposal 2: Machine Learning for Weather Prediction

Overview:

Dataset Suggestion:

Evaluation Metrics:

Introductory Materials:

Tools and Packages:

Project Proposal 3: Machine Learning for Finance Fraud Detection

Overview:

Dataset Suggestion:

Evaluation Metrics:

Introductory Materials:

Tools and Packages:

Project Proposal 4: Sentiment Analysis Using Machine Learning

Overview:

Dataset Suggestion:

Evaluation Metrics:

Introductory Materials:

Tools and Packages:

Project Proposal 5: Designing a Recommendation Engine

Overview:

Dataset Suggestion:

Evaluation Metrics:

Introductory Materials:

Tools and Packages:

Project Proposal 6: Iris Species Classification Using Machine Learning

Overview:

Dataset Suggestion:

Evaluation Metrics:

Introductory Materials:

Tools and Packages:

Project Proposal 7: Machine Learning for Sales Forecasting

Overview:

Dataset Suggestion:

Evaluation Metrics:

Introductory Materials:

Tools and Packages:

Project Proposal 8: Predicting Stock Prices Using Machine Learning

Overview:

Dataset Suggestion:

Evaluation Metrics:

Introductory Materials:

Tools and Packages:

Project Proposal 9: Stroke Predicting Using Machine Learning

Overview:

Dataset Suggestion:

Evaluation Metrics:

Introductory Materials:

Tools and Packages:

Project Proposal 10: Student Performance Predictions Using Machine Learning

Overview:

Dataset Suggestion:

Evaluation Metrics:

Introductory Materials:

Tools and Packages:

Project Proposal 11: Predict heart disease risk based on patient health data

Overview:

Dataset Suggestion:

Evaluation Metrics:

Introductory Materials:

Tools and Packages:

Project Proposal 12: Predicting Telco Customer Churn Using Machine Learning

Overview:

Dataset Suggestion:

Evaluation Metrics:

Introductory Materials:

Tools and Packages: