CMSE 495

Logo

This is the webpage for CMSE495 Data Science Capstone Course (Spring 2022)

View the Project on GitHub msu-cmse-courses/cmse495-SS22

In-Class Assignment: Creating Jupyter Tutorials and Demos

Diagram of population density by state

Free image hosted on wikipedia

In class today your team will build a tutorial/Demo (written in a jupyter notebook) which will teach a fellow data scientists how to utilize different data and software packages.

Everyone will be given 2 class periods (today and next Friday) to finish this project and teams will present their work at the end of the second class.

Agenda (80 Minutes)

Team assignments

Each team will be assigned a dataset or tool (see list below). Your job is to build a simple yet comprehensive tutorial that can demonstrate how to access and use the dataset or tool.

The tutorial should include the following:

All teams should submit your juptyer document (and any support files) as a pull request to the following Git Repository:

CMSE495 Spring 2022 Tutorials

Basic pull request instructions (as a reminder):

  1. Fork the above repository
  2. Clone your forked repository
  3. Create a branch inside your forked repository
  4. Add your demo/tutorial to the branch
  5. Push your branch to your fork
  6. Issue a pull request from your branch on the github page.

This should also work (You don’t really need a branch):

  1. Fork the above repository
  2. Clone your forked repository
  3. Add your demo/tutorial to the branch
  4. Push your changes to your fork
  5. Issue a pull request from your fork on the github page.
Team Project
Hope Village Revitalization 1. Census Data
Air Force Research Lab 2. TPOT AutoML
Boeing 3. GENE AutoML
Neogen 4. Auto-SKLearn AutoML
Old Nation Brewry 5. Zotero Reference Database
Ford Motor Compnay 6. Video Image Data
Argonne National Labs 7. Audio Data
Kellogg’s Company 8. Social Media
Delta Dental 9. Google Sheets
QSIDE Institute 10. Graphical User Interfaces

1. Census Data

Every 10 years the US conducts a census. This data is extremely important and is used in a wide variety of projects (government policy, social justice, business, etc). Your job is to build a simple yet comprehensive tutorial that can demonstrate how to access and use the Census data.


2. TPOT AutoML

The TPOT project (written, in-part by an MSU alumni) uses AutoML to explore classification algorithm space. AutoML tools can be helpful when exploring a dataset and trying to figure out what is possible. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to use the TPOT software. A good tutorial will make it simple to swap out the example data for other datasets.


3. GENE AutoML

The GANA project uses AutoML to explore classification algorithm space. AutoML tools can be helpful when exploring a dataset and trying to figure out what is possible. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to access and use the GENE software. A good tutorial will make it simple to swap out the example data for other datasets.


4. Auto-SKLearn AutoML

The auto-sklearn uses AutoML to explore classification algorithm space. AutoML tools can be extreamly helpful when exploring a dataset and trying to figure out what is possible. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to access and use the Auto-sklearn software. A good tutorial will make it simple to swap out the example data for other datasets.


5. Zotero Reference Database

Zotero is a reference management tool that allows you to store and share references for papers in an easy to access online database. This database has a simple API you can use to download and access Zotero the data. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to access and use zotero data in a program. Do some simple calculations such as number of references in a database. Maybe a bar chart with number of references published in each year. Be creative.

Note: you will want to make a free zotero account and set up a quick database. The instructor also has a database and API key you can use if need be.


6. Video Image Data

Much of our image data is stored in video format. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to open a video and process each frame as an image in Python. For a bonus, also include how Open can be used to write a video to a movie file.


7. Audio Data

Audio is yet another type of data that has it’s own special formats. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to open a audio data and process it in Python (plotting a simple wave form would be sufficent). As a bonus, also include how to “play” the video inside the notebook.


8. Social Media

Social media can be a rich source of data for a variety of scientific projects. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to download and process social media data in Python. Start with an account you may already have and then see if you can add to the tutorial with other similar acocunts. Some examples include, Reddit, facebook, twitter, linked-in


9. Google Sheets

Google sheets are a simple online spreadsheet that can be used to store data. For example, this course uses Google forms to make surveys that write to google sheets. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to access live google sheet data inside of jupyter.


10. Graphical User Interfaces

Gradio is a Python library that will allow you to build dynamic user interfaces inside of Jupyter. Your job is to build a simple yet comprehensive tutorial that can demonstrate how this library can be used. Focus on how this could be used to help provide sponsors with notebooks that act as a full interface to process data from a project.


Written by Dr. Dirk Colbry, Michigan State University Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.