This is the webpage for CMSE495 Data Science Capstone Course (Spring 2022)
Free image from pixabay
In class today we are going to try an experiment to write some code as a team. We will take a problem and divide it up into parts. Each person or sub-groups of people will work on their part and then we will try to compile them as a group and see if they all work together.
Everyone will be given 2 class periods (today and next Friday) to finish this project and teams will present their work at the end of the second class.
The instructions were not clear so there was a LOT of confusion regarding Signing your student agreements and turning them in. Please review the following video and double check your submission.
Reminder, the Team Charter assignment is due on Sunday
A git repository keeps track of individual authors and their changes. We want to write a program to evaluate the contribution of authors in a git repository for grading. This will involve, generate a list of all of the contributors, measure each authors contribution and graphing the results.
The course instructor has broken the project down the following programming components:
log
)diff
)grade
)graph
)Assuming we get all of these steps written as functions we could imagine a program running from inside a git directory using the following syntax (or something similar):
from group1 import git_log
author_table = git_log() #Convert the output of "git log" command to a pandas table.
from group2 import git_diff
hash1 = author_table['Hash'][0]
hash2 = author_table['Hash'][1]
nlines = git_diff(hash1, hash2) # Show the number of lines for an individual hash contribution.
from group3 import git_grade
authors = git_grade(author_table) # Generate a list of authors and their contribution for a repository.
from group4 import git_graph
git_graph(authors) # Graph the results in a meaningful way.
Where each of the variables are of the following types:
author_table
- Pandas table with entries for Author, Date, Hash, and Comment.nlines
- number of lines changed by the entry provided as an integer.authors
- dictionary with tag for author names and values of their number of total lines.We are going to try to write each component of our software separately and then assemble them as a team. The instructor has assigned each group into the following teams:
Team A | Team B | |
---|---|---|
Management | AFRL | Delta Dental |
git_log | Argonne | Ford |
git_diff | QSIDE | Hope Village |
git_grade | Kelloggs | Neogen |
git_graph | Boeing | Old Nation |
✅ DO THIS: There are two breakout rooms set up (one for Team A and one for team B). Please join the breakout room for your team and conduct a short meeting.
We should now have three zoom rooms you can use. Make sure you have all three written down someplace!!! The main course zoom room, Team A zoom room and Team B zoom room.
✅ DO THIS: Once you have noted all three zoom rooms. Join your teams room zoom room, get into your group’s breakout room and start working.
For the groups in charge of generating code you should focus on doing the following:
Key to the success of this project is careful communication between the groups. If a group gets done early and join the management group to help each other out. Good luck!
log
)author_table = git_log()
This group’s job is to write a function that parses the output of the git log
command and returns it as a table. Key to the success of this is figuring out how to run git log
from inside python (there are multiple ways) and to make sure that your output data is formated in a way consistent with what is expected as input down stream.
HINT You may want to consider working with Group 2 to find a common syntax for accessing git from python.
✅ DO THIS: Identify or clone a git repository you can use for testing. Pick one with lots of entries from a handful of authors. (Ex: SEE-Segment)
✅ DO THIS: As a group, create a file called group1.py
and write a function called git_log
that takes in a path to a git folder (default current folder ‘.’) and uses the “git log” command to generate a table of git commits from the folder and includes the following fields: Author, Date, Hash, Comment. Output this as a pandas table. As you make changes, commit/push this file to the assignment git repository.
diff
)nlines = git_diff(hash1, hash2)
This group’s job is to write a function that takes two repository “hashes” and parses the output of the git diff
command to return an integer representing the number of lines changed between the two hashes. Key to the success of this is figuring out how to run git diff
from inside python (there are multiple ways) and to make sure that your output data is formated in a way consistent with what is expected as input down stream.
HINT You may want to consider working with Group 1 to find a common syntax for accessing git from python.
✅ DO THIS: Identify or clone a git repository you can use for testing. Pick one with lots of entries from a handful of authors. (Ex: SEE-Segment)
✅ DO THIS: As a group, create a file called group2.py
and write a function called git_diff
that takes two hash values and parses the output of the “git diff” command to return an integer with the total number of lines made during that commit. As you make changes, commit/push this file to the assignment git repository.
grade
)authors = git_grade(author_table)
This group’s job is to write a function that takes a pandas table as input and uses the git_diff
function to generate a dictionary of authors and the total number of lines that they have contributed. Key to the success of this is figuring out how to write the loop without a working git_log
or git_diff
function. this will require coordinating with group1 and group2 to make sure you get the syntax right.
✅ DO THIS: As a group, create a file called group3.py
and write a function called git_grade
which takes a pandas table as input and uses the Group 2 git_diff
function to loop over all of the authors and adds up the total number of lines they contribute. This function should return a dictionary with tag for author names and values of their number of total lines. As you make changes, commit/push this file to the assignment git repository.
graph
)git_graph(authors)
This group’s job is to write a function that takes a dictionary as input and outputs a graph representing the mangitude of contribution of authors for an input git repository. Key to the success of this is figuring out how best summarize and visualize the data in a way that is easy to understand by an instructor.
✅ DO THIS: As a group create a file called group4.py
and write a function called git_graph
which takes a dictionary of authors as inputs and generates a figure that clearly shows the contribution of each author and can be used to determine grading by an instructor. As you make changes, commit/push this file to the assignment git repository.
The management group will create the team zoom room and a git repository and share it with the class. It is their job to organize the functions together and help support and coordinate the other groups.
✅ DO THIS: Have all of your members read though this entire document to see how your part of the project will fit in with other parts of the project.
✅ DO THIS: Create a git repository on gitlab.msu.edu and share this repository with all members of the class. The file structure for your git repository should probably be something like the following:
-- git_grader_repository
|-- .gitignore
|-- README.md
|-- group1.py
|-- group2.py
|-- group3.py
|-- group4.py
|-- git_grader_demo.ipynb
|-- git_grade.py
✅ DO THIS: Check in with each group and make sure they can clone and contribute to the repository their initial stub functions.
✅ DO THIS: Continue to review all groups code and make sure that everything will work together when it is all finished. Anticipate challenges, write test scripts, ask questions and provide help when needed. Bring groups together for meetings if there is confusion. Generally be there to help out and make sure the project has the resources it needs to succeed.
✅ DO THIS: Combine all of the functions into a single python file called git_grader.py
. Create a jupyter notebook that demonstrates the use of the program on a couple of different git repositories. DO NOT wait until the end to write these tests. Having them early will help you visualize what needs to be done. Something like the following:
import git_grade as gg
author_table = gg.git_log() #Convert the output of "git log" command to a pandas table.
hash1 = author_table['Hash'][0]
hash2 = author_table['Hash'][1]
nlines = gg.diff(hash1, hash2) # Show the number of lines for an individual hash contribution.
authors = gg.grade(author_table) # Generate a list of authors and their contribution for a repository.
gg.graph(authors) # Graph the results in a meaningful way.
✅ DO THIS: Coordinate a 5 minute (max) presentation and be ready to share what your entire team did with the instructor. Demos of the working code are expected. Be prepared to answer questions such as “what works?” “What doesn’t work?” “is this a good way to grade contributions?”, “describe something interesting or challenging that happened during the project” etc. (We will have presentations at the end of next Friday).
Written by Dr. Dirk Colbry, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.