{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pre-Class Assignment: Regression\n",
    "# Day 13\n",
    "# CMSE 202"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <p style=\"text-align: right;\"> &#9989; Put your name here</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Goals for Preclass\n",
    "\n",
    "After this pre-class assignment you should be able to:\n",
    "\n",
    "1. Generate a variety of randomized data\n",
    "2. Construct a 1-dimensional Linear Regression fit to these data\n",
    "3. Explain how to judge the quality of a 1-dimensional Linear Regression fit\n",
    "\n",
    "**This assignment is due by 11:59 p.m. the day before class,** and should be uploaded into appropriate submission folder on D2L.  Submission instructions can be found at the end of the notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Imports for the notebook\n",
    "\n",
    "Make sure you execute the following cell to get all of the imports you will need for this notebook.\n",
    "\n",
    "&#9989; **Review all of these imports, are any of them unfamiliar to you?** If so, look up the unfamiliar module(s) to learn a bit about what they do."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import random\n",
    "import statsmodels.api as sm\n",
    "from IPython.display import HTML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font size=+3>&#9998;</font> Do This - Erase the contents of this cell and replace it with notes on any of the imports that were unfamiliar to you and what they appear to be for. (double-click on this text to edit this cell, and hit shift+enter to save the text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 1. Regression\n",
    "\n",
    "In this pre-class assignment, we're going to revisit a concept that you should have explored a bit in CMSE 201: Regression.\n",
    "\n",
    "The term \"Regression\" actually represents an entire class of algorithms that are used to **model** data. A regression model provides a way to visualize and predict how a **predictor variable**, usually the x-axis relates to a **dependent variable** usually shown on the y-axis. It is usually a first place to start when trying to build a model of data.\n",
    "\n",
    "Of the family of regression algorithms, the one most often taught at first is called Ordinary Least Squares, often shortened to OLS. When regression is first discussed in 201, you should have also learned a bit about the OLS approach to regression.\n",
    "\n",
    "**Let's revisit and review OLS.**\n",
    "\n",
    "A regression is used to estimate predictor parameters from data. In simple linear regression the x and y data are assumed to be linearly related. That is, there exists an equation:\n",
    "\n",
    "$$ y = Ax + B $$\n",
    "\n",
    "such that we must discover the values of $A$, often called the **slope** and of $B$, often called the **intercept** that is somehow optimal. There are many forms of optimization, but we will focus on OLS optimization."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 Ordinary Least Squares\n",
    "\n",
    "The OLS approach to optimizing the parameters above are to \"minimize the residuals.\" The picture below demonstrates the concept of a residual:\n",
    "\n",
    "<img src=\"https://community.cloudera.com/t5/image/serverpage/image-id/25068iFF075A5AEC3B8528/image-size/medium?v=1.0&px=400\"> (from <a href=\"https://community.cloudera.com/t5/Community-Articles/Understanding-Linear-Regression/ta-p/281391\"> here </a>)\n",
    "\n",
    "The idea is to **minimize** the square of the \"residuals\", that is the sum of the squared error as measured between the line and each actual value. Thus OLS tries to minimize:\n",
    "\n",
    "$$min(\\sum{{e_i}^2}) =  min(\\sum{({\\hat{y_{i}} - y_{i}})^2}) $$\n",
    "\n",
    "where $e^i$ is the error, more specifically if $\\hat{y_i}$ is the predicted value and $y_i$ is the actual value, the error is the difference between the two."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 2. OLS and statsmodels\n",
    "\n",
    "In 201, when work with regression and fitting models to data, you used some of the built in functions available in NumPy and SciPy, but these are just the tip of the iceberg in terms of what you can do with statistics and regression in Python.\n",
    "\n",
    "In addition, it's possible you've come across the `sklearn` module (no worries if you haven't) and you might have used it for doing some regression but, at least for some of the model work we want to do right now, the `statsmodel` package provides a little more statistical information and background. We'll certainly use `sklearn` later in the course, but it is always good to learn some alternatives, and there are quite a number available for Python!\n",
    "\n",
    "\n",
    "#### So what are we going to do?\n",
    "\n",
    "For this assignment, we are going to make some sample data and do an OLS type regression but, most importantly, we want to pay attention to some of the statistics that are generated from the model. It's all well and good to make a linear regression model, but it is **very** important that we know how good that model is!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Generating random numbers - a quick aside\n",
    "\n",
    "To generate our sample data we need to use some random numbers. It is important that we understand what a Random Number Generator, RNG for short, is and what we can and cannot do with it.\n",
    "\n",
    "An RNG is an algorithm for generating random numbers. If you thought about that statement at all, it might trouble you just a bit. By definition an algorithm is a method, a reproducible method, that can be used to accomplish a task. In fact, it is even worse than that. Most RNGs take a `seed` value, a value that initializes the RNG. If you seed an RNG with the same value, you get the same sequence. \n",
    "\n",
    "The `random` package provides access to RNG functions. Three approaches that we might find useful are (there are many more, look at <a href=\"https://docs.python.org/3/library/random.html\"> the docs </a>):\n",
    "* `random.seed()`. If you want to seed the RNG, you provide an argument. If not, a system dependent (often time-based) seed is used.\n",
    "* `random.int(start, stop)` : generate an **integer** between start and stop **inclusive**\n",
    "* `random.uniform(start, stop)`: generate a **float** between start and stop **inclusive**\n",
    "* `random.choice([sequence])` : yield one of the values from the sequence (not the range, the actual values).\n",
    "\n",
    "In NumPy we can create random numbers in similar ways doing things like `np.random.<METHOD>` (where `<METHOD>` is the method we want to use for generating the random numbers or making random selections), but we can also do things a little differently and first create a random number generator **object** much like we have done when we wrote our own class or used objects from other packages. Below you can find some examples creating random numbers with NumPy.\n",
    "\n",
    "**Let's quickly review these methods** (you should be a bit familiar with generating random numbers from 201 as well)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a random number generator\n",
    "rng = np.random.default_rng(seed = 123) \n",
    "# Create an array of 10 random integers from the domain [1, 5). 1 is included while 5 isn't\n",
    "ary1 = rng.integers(1, 5, size = 10)\n",
    "ary2 = rng.integers(1, 5, size = 10)\n",
    "rng2 = np.random.default_rng(seed = 123)\n",
    "ary3 = rng2.integers(low = 1, high = 5, size = 10)\n",
    "\n",
    "# Create an array of 10 random floats from a uniform distribution in the domain [0, 1). 0 is included while 1 isn't\n",
    "f_ar1 = rng.uniform(low = 0.0, high = 1.0, size = 10)\n",
    "\n",
    "print(ary1)\n",
    "print(ary2)\n",
    "print(ary3)\n",
    "print(f_ar1)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`ary1` and `ary2` use the same call and generate different sequences. But if we recreate the RNG with the same seed, we get the same sequence. `ary1` and `ary3` are the same.\n",
    "\n",
    "What an RNG can provide is randomness of the sequence. That is, using statistical and other measures, predicting what the next number will be given an existing sequence should appear random. It is the sequence, which number comes next, that an RNG tries to randomize. \n",
    "\n",
    "It can be convenient to get the same sequence given the same seed, for testing purposes as an example. But the sense of random from an RNG is the randomness of the sequence, not the specific numbers generated.\n",
    "\n",
    "In the next code cell we will create random numbers from a normal distribution (aka Gaussian) and create and histogram plot from them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an array of 1000 random floats from a normal distribution with mean 0, and standard deviation 0.1\n",
    "\n",
    "def gaussian_dist(x, mu, sigma):\n",
    "    \"\"\"Calculate the Gaussian probability density function.\n",
    "    \n",
    "    Parameters:\n",
    "    ----------\n",
    "    x : numpy.ndarray\n",
    "        Range of values\n",
    "    \n",
    "    mu: float\n",
    "        Center of the Gaussian PDF.\n",
    "    \n",
    "    sigma: float\n",
    "        Standard deviation of the Gaussian PDF.\n",
    "    \n",
    "    Return\n",
    "    ------\n",
    "    f : numpy.ndarray\n",
    "        Gaussian PDF\n",
    "    \n",
    "    \"\"\"\n",
    "    f =  np.exp( - (x - mu)**2 / (2 * sigma**2)  )\n",
    "    f /= (sigma * np.sqrt(2 * np.pi)) \n",
    "                \n",
    "    return f\n",
    "                      \n",
    "mu, sigma = 0, 0.1 # mean and standard deviation\n",
    "s = rng.normal(mu, sigma, 1000)\n",
    "\n",
    "count, bins, _ = plt.hist(s, 30, density=True)\n",
    "plt.plot(bins, gaussian_dist(bins, mu, sigma),linewidth=2, color='r')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's generate some data. \n",
    "\n",
    "\n",
    "&#9989; **Do This** - Using `numpy` and `random` as appropriate, let's make the following arrays:\n",
    "* `x_ary` with float values from 0 to 10 by 0.5 (you can use `arange`)\n",
    "* `y_ary` that has values that are based on `x_ary` linearly. Let's use $2x + 3$ as our linear relationship. \n",
    "* `y_noisy` is created by adding random noise to the values in `y_ary`. Do it uniformly from [-1 to 1]\n",
    "* `y_fixed` Use the `choice` method of your RNG to add either 5 or -5 to each value in `y_ary`\n",
    "\n",
    "**Note**: this can be done with just NumPy's modules/methods, if you which to avoid using the built-in Python `random` module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# put your code here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&#9989; **Do This** - Plot `x_ary` vs the other three array and see what you have. Here's an example that uses the matplotlib `subplots` functionality to make a stacked column of three plots that used `sharey=True` to ensure the y-axis values use the same ranges as it's already aligned in x by default:\n",
    "\n",
    "<img src=\"https://i.ibb.co/G33V3hg/example-plot.png\" alt=\"example-plot\" border=\"0\" width=350px>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Doing the fit\n",
    "\n",
    "Let's now make an OLS fit for these three sets: (`x_ary`, `y_ary`), (`x_ary`, `y_noisy`), (`x_ary`, `y_fixed`). \n",
    "\n",
    "&#9989; Before you try to find the fits, think about it for a minute and answer below. **Which do you think will have the best fit? Which the worst fit? Why?**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font size=+3>&#9998;</font> Do this - Erase this and put your answer here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although we already imported this at the top of the notebook, we're going to do the imports here again for `statsmodels` so you can see it in context. As a common practice though, it's still good to put all important imports at the top of the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import statsmodels.api as sm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you look at [the statsmodels documentation for OLS](https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html) you should see that the standard order of operations is to:\n",
    "* create the model\n",
    "* fit the model\n",
    "* look at the results. \n",
    "\n",
    "Let's do the \"perfect fit\" pair above as an example and then you'll do the other two."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_perfect = sm.OLS(y_ary, x_ary)  # make the model\n",
    "results = model_perfect.fit()         # run the OLS fit\n",
    "print(results.params)                 # print intercept and slope"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Well that's not right!** We should expect two values: slope and intercept. Let's take a look at the predicted values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(results.predict())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### So what happened?\n",
    "So what we actually got was a slope back with the assumption that the intercept was 0. But how to get the correct intercept? The `statsmodels` OLS assumes that every column in the x values can be used as a parameter for regression, called **multiple regression** (which we'll get to a little later). To get an intercept, we need to add a column of constant value, essentially turning the linear equation into:\n",
    "\n",
    "$$ y = (A*x) + (B*constant) $$\n",
    "\n",
    "Conveniently, statsmodels has a function just for that called (wait for it) `add_constant`. It adds a constant column to either a DataFrame or as a NumPy array. Let's do that and try the fit again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_with_cnst = sm.add_constant(x_ary)\n",
    "print(x_with_cnst)\n",
    "model = sm.OLS(y_ary, x_with_cnst)\n",
    "results = model.fit()\n",
    "print(\"Intercept and slope are:\", results.params)\n",
    "print(results.predict())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**That's the perfect fit we originally expected!** We can also print a large number of statistics using another built-in method, `summary()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(results.summary() )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What does all of this mean?\n",
    "\n",
    "There is a lot of information there but the top 2 entries of 2nd column is information that tell us about the **quality of the fit**. The R-squared value has a range of 0 to 1.0, with 1.0 being perfect. As expected, ours is perfect. \n",
    "\n",
    "_We'll talk about Adjusted R-squared later in multiple regression but it also ranges from 0-1.0 and, in this example, is also perfect._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also make plots of the fit using `plot_regress_exog`, which provides a comparison of the real data and the modeled data, as well as the residual plot. \n",
    "_There are two additional plots that `plot_regress_exog` produces -- the partial regression plot and the CCPR plot. We will discuss those later._\n",
    "\n",
    "You should notice that the residuals for this fit are extremely small, this is because of our perfect! As a result, the residuals are numerically equivalent to *zero*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(15,8))\n",
    "fig = sm.graphics.plot_regress_exog(results, \"x1\", fig=fig)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 Fit the other two data sets\n",
    "\n",
    "Now that you've followed along with the example that used our perfect fit, you're going to try to fit the other two \"y\" arrays.\n",
    "\n",
    "&#9989; **Do This** - Perform linear model fits for the `y_noisy` and `y_fixed` pairs. **Print**: the fit parameters, the predictions of the model, and the model summaries as we have already done. Also use `plot_regress_exog` to produce plots of the model fits.\n",
    "\n",
    "**Note**: Make sure you use the new `x_with_cnst` array we created previously to ensure that you get back both a slope and a y-intercept in your best fit parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fit the y_noisy values here\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fit the y_fixed code here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&#9989; **Do This**: Between the `noisy` and `fixed` data sets, which has a better linear fit? How can you tell?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font size=+3>&#9998;</font> Do this - Erase this and put your answer here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Follow-up Questions\n",
    "\n",
    "Copy and paste the following questions into the appropriate box in the assignment survey include below and answer them there. (Note: You'll have to fill out the assignment number and go to the \"NEXT\" section of the survey to paste in these questions.)\n",
    "\n",
    "1. When we think about doing regression modeling, what do we mean by \"residuals\"?\n",
    "\n",
    "2. What is the value that provides some form of measurement of the \"goodness of fit\" between our model and our data?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----\n",
    "# Assignment Wrap-up\n",
    "\n",
    "Please fill out the form that appears when you run the code below.  **You must completely fill this out in order to receive credit for the assignment!**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import HTML\n",
    "HTML(\n",
    "\"\"\"\n",
    "<iframe \n",
    "\tsrc=\"https://cmse.msu.edu/cmse202-pc-survey\" \n",
    "\twidth=\"800px\" \n",
    "\theight=\"600px\" \n",
    "\tframeborder=\"0\" \n",
    "\tmarginheight=\"0\" \n",
    "\tmarginwidth=\"0\">\n",
    "\tLoading...\n",
    "</iframe>\n",
    "\"\"\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---------\n",
    "### Congratulations, you're done with your pre-class assignment!\n",
    "\n",
    "Now, you just need to submit this assignment by uploading it to the course <a href=\"https://d2l.msu.edu/\">Desire2Learn</a> web page for the appropriate pre-class submission folder (Don't forget to add your name in the first cell)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&#169; Copyright 2024,  Department of Computational Mathematics, Science and Engineering at Michigan State University\n",
    "\n",
    "<!-- 9/16/2024 -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}