This is the webpage for CMSE495 Data Science Capstone Course (Spring 2022)
Free image hosted on wikipedia
In class today your team will build a tutorial/Demo (written in a jupyter notebook) which will teach a fellow data scientists how to utilize different data and software packages.
Everyone will be given 2 class periods (today and next Friday) to finish this project and teams will present their work at the end of the second class.
Each team will be assigned a dataset or tool (see list below). Your job is to build a simple yet comprehensive tutorial that can demonstrate how to access and use the dataset or tool.
The tutorial should include the following:
All teams should submit your juptyer document (and any support files) as a pull request to the following Git Repository:
Basic pull request instructions (as a reminder):
This should also work (You don’t really need a branch):
Team | Project |
---|---|
Hope Village Revitalization | 1. Census Data |
Air Force Research Lab | 2. TPOT AutoML |
Boeing | 3. GENE AutoML |
Neogen | 4. Auto-SKLearn AutoML |
Old Nation Brewry | 5. Zotero Reference Database |
Ford Motor Compnay | 6. Video Image Data |
Argonne National Labs | 7. Audio Data |
Kellogg’s Company | 8. Social Media |
Delta Dental | 9. Google Sheets |
QSIDE Institute | 10. Graphical User Interfaces |
Every 10 years the US conducts a census. This data is extremely important and is used in a wide variety of projects (government policy, social justice, business, etc). Your job is to build a simple yet comprehensive tutorial that can demonstrate how to access and use the Census data.
The TPOT project (written, in-part by an MSU alumni) uses AutoML to explore classification algorithm space. AutoML tools can be helpful when exploring a dataset and trying to figure out what is possible. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to use the TPOT software. A good tutorial will make it simple to swap out the example data for other datasets.
The GANA project uses AutoML to explore classification algorithm space. AutoML tools can be helpful when exploring a dataset and trying to figure out what is possible. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to access and use the GENE software. A good tutorial will make it simple to swap out the example data for other datasets.
The auto-sklearn uses AutoML to explore classification algorithm space. AutoML tools can be extreamly helpful when exploring a dataset and trying to figure out what is possible. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to access and use the Auto-sklearn software. A good tutorial will make it simple to swap out the example data for other datasets.
Zotero is a reference management tool that allows you to store and share references for papers in an easy to access online database. This database has a simple API you can use to download and access Zotero the data. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to access and use zotero data in a program. Do some simple calculations such as number of references in a database. Maybe a bar chart with number of references published in each year. Be creative.
Note: you will want to make a free zotero account and set up a quick database. The instructor also has a database and API key you can use if need be.
Much of our image data is stored in video format. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to open a video and process each frame as an image in Python. For a bonus, also include how Open can be used to write a video to a movie file.
Audio is yet another type of data that has it’s own special formats. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to open a audio data and process it in Python (plotting a simple wave form would be sufficent). As a bonus, also include how to “play” the video inside the notebook.
Social media can be a rich source of data for a variety of scientific projects. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to download and process social media data in Python. Start with an account you may already have and then see if you can add to the tutorial with other similar acocunts. Some examples include, Reddit, facebook, twitter, linked-in
Google sheets are a simple online spreadsheet that can be used to store data. For example, this course uses Google forms to make surveys that write to google sheets. Your job is to build a simple yet comprehensive tutorial that can demonstrate how to access live google sheet data inside of jupyter.
Gradio is a Python library that will allow you to build dynamic user interfaces inside of Jupyter. Your job is to build a simple yet comprehensive tutorial that can demonstrate how this library can be used. Focus on how this could be used to help provide sponsors with notebooks that act as a full interface to process data from a project.
Written by Dr. Dirk Colbry, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.