Homework 03: Projects#
✅ Put your name here.#
It is time to start working on your project. So let’s start from the beginning.
Deadline: Sunday, November 19 at 11:59PM
Total points: 22
Part 0. Define your goal (4 points)#
Clearly define the goal of your project. Be specific!
Part 1. Exploratory Data Analysis (4 points)#
Give a comprehensive description of your data with text and code. For example, do you have one dataset or multiple datasets ? Are you going to merge them? what data does your dataset contain? What do the rows and columns represent? If you are doing image classification: what does each image represent? How much data do you have? How big is your dataset? How many categorical vs numerical features do you have and how are you going to handle it?How was obtained? Is it real world data or simulation data? Is it a time series? What type is your data, e.g. int, float, object, string?
Data cleaning: Start performing data cleaning. Are there any missing values? How are you encoding the data? explain your choice. If you have multiple datasets, how are you handling them? How are you splitting your data?
Visualization: Make plots showing characteristics of the data. If you are doing regression on time series data, make plots showing historical events of importance. What other information do you have that are not contained in the dataset, but that are important for understanding data?
Statistics: Show the distribution and correlations of your data. Calculate relevant statistics of your data e.g. mean, median, skew, kurtosis.
Unsupervised: What unsupervised learning technique you think you can use here? Dimensionality Reduction? Clustering? You don’t need to do them yet. Just describe what you think would be useful.
Part 2. Define a Metric (4 points)#
How do you know that your model is doing well, e.g. RMSE, Log loss, sparse categorical cross entropy, accuracy? Explain your choice and write the mathematical equation for it using LaTeX. What is your loss function? Are there already analytical models? How do they perform with your metric?
Optimization: what optimization method are you going to use? If you don’t know what it is look at the documentation of your model on sklearn
or tensorflow
.
Part 3. Create a Baseline (4 points)#
Before you start using complex models look at the performance of the simplest model. If the simplest model gets a score of 99.9999999%, then what is the point of using complex, expensive models? No one wants to waste time and money? What is the simplest model that you can throw at your data? Regression -> Linear Regression, Classification -> Logistic regression or a random classifier. How do these model perform on your data? The score of your simplest model is your baseline and your other ML models should do better than this (remember that we had 90% chance of being right on the MNIST dataset).
This will require some hyperparameter tuning, cross-validation.
Visualization: Make plots of the learning curve, train-validation-test curves, timing curves. How long does it take for your machine learning model to fit?
Part 4. Other ML models (2 points)#
Now run your data through your other ML models and compare with the baseline. Make a table with the score of your models.
Part 5. Make a plan for the rest (4 points)#
Now that you have a baseline, how do you plan to improve your models? What kind of feature engineering can you do to improve your model? What other hyperparameter can you change? Optimization techniques? Unsupervised techniques?
© Copyright 2023, Department of Computational Mathematics, Science and Engineering at Michigan State University.