CMSE 381 - Project description#

Project Overview#

This project provides students with an opportunity to work on real-world datasets while ensuring diversity in their project choices, fostering a broader understanding of data science applications.

Project Objectives#

  1. Select two distinct datasets from the UCI ML Repository.

  2. Explore and understand the selected datasets.

  3. Perform a regression analysis on one dataset.

  4. Perform a classification task on the other dataset.

  5. Each must include some additional improvement, such as using K-fold CV to do parameter selection, or subset selection to decrease the number of parameters used in the model.

  6. Analyze and interpret the results of both regression and classification tasks.

  7. Present findings and insights in a Jupyter Notebook.

Dataset Selection#

  • Each student must choose two datasets from the UCI Machine Learning Repository.

  • One data set must be used for a regression task, one data set for a classification task.

  • No more than two students can select the same dataset. No pair of people can do the same for both data sets.

  • Starting in class 9/25/24, we will work on picking data sets. The spreadsheet with already claimed datasets is here.

  • Dr. Munch has final veto power on any data set choice.

Project Deliverables#

You will submit a Jupyter Notebook containing the following for each of your two data sets.

  • A description of the data, including information about the variables as well as possible values of each variable as applicable. For example, any qualitative variable should have information on the levels included.

  • A description of the task being performed, including the input and output variables, plus information about the levels of the output for the classification task.

  • Code and prose showing data preprocessing and exploration.

  • Regression or classification task implementation and evaluation, along with any supporting discussion and code for parameter choices.

  • Results of the model with visualizations to support analysis.

  • Interpretation of results and insights.

Project Deadline#

You will submit your project to crowdmark. The deadline is the last day of class: Friday Dec 6 at Midnight.

Evaluation Criteria#

Projects will be assessed based on the following criteria:

  1. Thorough data exploration and understanding.

  2. Proper implementation of regression and classification tasks.

  3. Effective use of tools and libraries.

  4. Clarity and organization of Jupyter Notebooks, including a combination of markdown cells with explanations and well-commented code to run the tasks.

  5. Interpretation of results and insights.

The full rubric for grading can be found here: Project Rubric.