-
Notifications
You must be signed in to change notification settings - Fork 53
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Simple Linear Regression - Revised (#473)
* fixing quick sort section * Update content/english/debugging/04-identify-the-problem3.md Co-authored-by: Oliver Zhang <[email protected]> * fixing link * initial commit * more updates * Finishing regression explanaition * working on simple linear regression * working on confidence intervals * 1st draft done * 1st draft done * updating last section * updating index * - adding image to explain fitting - updating working and story telling - addressing comments of previous commits * addressing feedback comments 1 * addressing feedback * addressing feedback * addressing more feedback * Update exercise to calculate R-Squared in linear regression * addressing feedback --------- Co-authored-by: groberto <[email protected]> Co-authored-by: Oliver Zhang <[email protected]> Co-authored-by: Roberto Guzman <[email protected]>
- Loading branch information
1 parent
c1537de
commit 8729210
Showing
30 changed files
with
56,684 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
--- | ||
title: "What is regression?" | ||
description: "Make computers learn to predict outcomes." | ||
prereq: "Python" | ||
icon: "" | ||
draft: false | ||
weight: 1 | ||
--- | ||
|
||
# What is regression? | ||
Regression is a technique to model the relationship between a feature (independent variables) and a prediction (dependent variables). It helps us understand how the value of a dependent variable is changing based on the value of the independent variable. When applied properly it helps us predict values 😯! | ||
|
||
<!-- TODO: Add diagram to replace image below. --> | ||
|| | ||
|:--:| | ||
|Regression| | ||
|
||
Regression is a powerful technique that lets us find a line or a curve that fits the data we have. By doing regression, we can create or reuse `mathematical models` that show how the dependent and independent variables are related. [Mathematical models](https://www.youtube.com/watch?v=xHtsuOB-TPw) are useful because they help us understand a system and make predictions based on the system's variables. Sometimes, the mathematical model we choose may not match the data well, so we have to look for a better model that can capture the patterns we see. | ||
|
||
## Example 1: How do we know if it is going to rain? | ||
Whenever you use your phone, watch a news cast, or just ask in the internet "What is the chance of rain today?", mathematical models are used by meteorologists to predict if there will be the possibility of rain. You may wonder, What `variables` are taken into consideration when running those mathematical models? Is the smell of rain enough? | ||
|
||
{{% expand "**What variables should you consider in order to know if it is going to rain?**"%}} | ||
|
||
- Temperature → Is it hot, is it cold or is it on the sweet spot? | ||
- Altitude → Depending on how high you are relative to the ground places can become cooler or warmer. | ||
- Location → Where you are matters. (e.g: Forest, Beach, Mountain Range, Desert, etc) | ||
- Humidity → Do you feel that the air is heavy? Is there enough water in the air for it to rain? | ||
- Time of the Year → What month is it? What season are we currently on? | ||
- And many many more! | ||
|
||
There are many more variables to consider whenever we think there is a chance of rain. At the end of the day the "**chance of rain**" is the dependent variable while the other properties we have mentioned above are the independent variables. The chance of rain is dependent on the values of the other variables mentioned. | ||
{{% /expand %}} | ||
|
||
### Exercise 1 | ||
Suppose that someone wants to know the type of shapes they have based on the area of the shape. | ||
|
||
1. Is that even possible? Can someone know what shape they have based on the area? | ||
{{% expand "**Click to show answer**" %}} | ||
***No***, as you can quickly guess there is no relation between the **area** and the type of shape a figure may have. | ||
|
||
For example, a square of side length 3 has an area of 9. | ||
|
||
<h3> | ||
\[ | ||
3 \cdot 3 = 9 | ||
\] | ||
</h3> | ||
|
||
But so does a triangle with width of 6 and height of 3. | ||
<h3> | ||
\[ | ||
(6 \cdot 3)/2 = 9 | ||
\] | ||
</h3> | ||
|
||
{{% /expand %}} | ||
<br> | ||
|
||
2. If the area is not a good variable or property to understand the type of shape we can have. We need to come up What relationship can we use in order to know what is the type of shape? | ||
{{% expand "**Click to show answer**" %}} | ||
|
||
***The number of **sides** it has! As the number of sides increases, you can know the type of shape.*** | ||
|
||
If you noticed in our table, we have built a linear relationship that can be represented as X=Y. | ||
|
||
X → **the number of sides** | ||
Y → **type of shape** | ||
|
||
There you go, you created your first machine learning model! | ||
{{% /expand %}} | ||
<br> | ||
|
||
A model, in the case of our previous example, is just a function that is built to establish a relationship between our dependent variable and independent variables. For example, functions such as lines (y = ax + b), parabolas (y = a(x -h)^2), cubic curves, (y=ax^3+bx^2+cx+d), and many more can be used as models. | ||
|
||
### How to make sure your model fits the data? | ||
|
||
When we perform analysis using a machine learning model that already exists we need to ensure that we select the appropriate model and that somewhat represents our data. On the image below you can see that the data points represent a parabola. It is very likely that the model we need to use is a parabolic model something like this... | ||
|
||
<h2> | ||
\[ | ||
y = a(x -h)^2 | ||
\] | ||
</h2> | ||
|
||
When you use a parabolic model you need to know where the vertices of the parabola will land in order to predict where a new point will be. You could go ahead and run a model, setting no vertices and end up like the "Under-fitting" image, or setup way to many vertices and end up like the "Over-fitting" image. We want to set it just right enough so that our model "fits" the data and new data points can be represented using the model. | ||
|
||
<!-- TODO: Add diagram to replace image below. --> | ||
|| | ||
|:--:| | ||
|Image 1: Under-Fitting and Over-Fitting| | ||
|
||
You always need to make sure that the model you choose to represent your data fits what you are working on. Otherwise, you might have one of two issues: | ||
- **Under-Fitting** | ||
|
||
This happens when you are unable to find a relationship in the data you have been given. This often happens when there is not much data to use. | ||
|
||
- **Over-Fitting** | ||
|
||
When you are trying to accommodate every possible value in your data, even the ones that don't represent anything. By doing this you might be choosing values that are **outliers** and do not represent the reality of things. For example, having a shape that has 1 or 2 lines doesn't make sense. |
206 changes: 206 additions & 0 deletions
206
content/english/ml-machine-learning/02-simple-linear-regression.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,206 @@ | ||
--- | ||
title: "Simple Linear Regression" | ||
description: "Make computers learn to predict outcomes." | ||
prereq: "Python" | ||
icon: "" | ||
draft: false | ||
weight: 2 | ||
--- | ||
|
||
# What is simple linear regression? | ||
|
||
Simple linear regression aims to find a correlation between two variables and derive mathematical equations that explain the relationships between a dependent and an independent variable. With simple linear regression in general, we want to reach the conclusion: | ||
|
||
1. Is there a **relationship** between the variables we have? | ||
|
||
You can determine the relationship between income and spending, experience and salary, or humidity and temperature. But, as an example, there is NO relationship between the height of a student and their exam scores. | ||
|
||
2. Can we **forecast / predict** values with this? | ||
|
||
With regression, we can train the model and find out if we can predict values with certainty. Can we use what we know about the relationship to predict new values? | ||
|
||
Example: What will be the temperature tomorrow? How much will my bakery sell this year compared to last year? How much will my salary be if I have 5 years of experience? | ||
|
||
# Variable roles | ||
|
||
In simple linear regression, variables can take one of 2 roles. | ||
|
||
1. **Dependent Variable** | ||
|
||
The variable whose value we want to predict or forecast. We call it **dependent** because its value depends on something else. We will call this variable **y**. | ||
|
||
2. **Independent Variables** | ||
|
||
This is the variable which we can control or change in order to affect the dependent variable. We will call this variable **x**. | ||
|
||
Example: If an apple costs $1.00, and you buy 10 of them the total cost will be $10.00. The dependent variable here is the `total cost` while the independent variable are the number of apples you want to buy. | ||
|
||
# The Simple Linear Equation Mathematical Model | ||
|
||
When we use simple linear regression we call it **linear** because, well... the mathematical model represents a straight line in a 2D plane. Let's think about it for a second. | ||
|
||
What is the math equation for a straight line? | ||
|
||
{{% expand "**Click to show answer**" %}} | ||
|
||
This equation may seem very familiar to you. If it is, it's the general equation for a straight line. | ||
|
||
<h1> | ||
\[ | ||
y = ax + b | ||
\] | ||
</h1> | ||
|
||
- **x** is the independent variable. | ||
- **y** is the dependent value. | ||
- **a** is the slope of the line. | ||
- **b** is the intercept or the value of **y** when **x = 0**. | ||
|
||
Following this equation we will elaborate on how simple linear regression math models calculate and predict new values. | ||
|
||
{{% /expand %}} | ||
|
||
# True World Examples | ||
|
||
In the real world, data sometimes is not linear and behaves differently to what we think. At first glance, it may seem that data has no relation at all. In the case of simple linear regression what you need to look for is data that somewhat follows a linear pattern. | ||
|
||
Suppose that you work as a `Data Analyst` for the human resources department of a company that has over 10,000 employees. Your boss wants to know if the years of experience of an employee has anything to do with the amount of money they win. Of course, since you are a `Data Analyst` you can check the database of employees and quickly verify the following: | ||
|
||
1. What is their current salary? | ||
2. How many years of experience does the person have? | ||
|
||
Assume that you are able to get data from 30 random employees which looks something like this: | ||
|
||
|Employee ID|Years of Experience|Salary| | ||
|:--:|:--:|:--:| | ||
|1|1.1|39343| | ||
|2|1.3|46205| | ||
|3|1.5|37731| | ||
|4|2.0|43525| | ||
|5|2.2|39891| | ||
|6|2.9|56642| | ||
|7|3.0|60150| | ||
|8|3.2|54445| | ||
|...|...|...| | ||
|26|9.0|105582| | ||
|27|9.5|116969| | ||
|28|9.6|112635| | ||
|29|10.3|122391| | ||
|30|10.5|121872| | ||
|
||
After you check the table, you plot all these values in a 2D scatter plot and get an image like so. | ||
|
||
|| | ||
|:--:| | ||
|Scatter Plot: Years of Experience vs Salary.| | ||
|
||
As you can see, the dots _somewhat_ resemble a line. Let's go ahead and draw an imaginary line and see if we can pass through all the dots. | ||
|
||
|| | ||
|:--:| | ||
|Scatter Plot: Years of Experience vs Salary with line.| | ||
|
||
As you can see the line doesn't pass through **ALL** the dots, but it's somewhat close. What does this mean? Why in some cases are the dots close or far away from our imaginary line? | ||
|
||
So far we know that: | ||
|
||
1. The data in follows somewhat of a **linear** approach. | ||
2. The data has 2 important variables **SALARY** and **YEARS OF EXPERIENCE**. This means, that we can start to **model** our like a linear equation. | ||
|
||
**Question:** We know that **SALARY** and **YEARS OF EXPERIENCE** are our variables but which one is the dependent and which one is the independent variable? | ||
{{% expand "**Click to show answer**" %}} | ||
|
||
- **YEARS OF EXPERIENCE (XP)** is our independent variable. | ||
- **SALARY** is our dependent variable. | ||
|
||
If we plug these into our linear equation you will get something like this. | ||
|
||
<h1> | ||
\[ | ||
SALARY = a(XP) + b | ||
\] | ||
</h1> | ||
|
||
With an equation like this we are saying: "The years of experience has a direct effect on the salary an employee". | ||
|
||
{{% /expand %}} | ||
|
||
# The Possibility of Errors | ||
|
||
As we mentioned before, data may or may not be always consistent and can behave in different ways. What this means is that our linear equation needs to consider a possible error. But how do we represent that error in the equation? How can that error be visualized in the scatter plot? | ||
|
||
Lets assume that all the employees that were chosen in the table above are from San Antonio, TX but the HR department accidentally added employees from the city of Seattle, WA into the data set you chose. [The cost of living in Seattle is 28.6% higher than San Antonio](https://www.numbeo.com/cost-of-living/compare_cities.jsp?country1=United+States&country2=United+States&city1=Seattle%2C+WA&city2=San+Antonio%2C+TX&tracking=getDispatchComparison). This would explain why, some data points in the plot, are farther away from the imaginary line we have traced. These are considered errors in our data. | ||
|
||
<!-- TODO: Add diagram to replace the figure below.--> | ||
|| | ||
|:--:| | ||
|Error Lines for Simple Linear Regression| | ||
|
||
In our linear equation, let's add that error with the greek letter **ε**. | ||
|
||
<h1> | ||
\[ | ||
SALARY = a(XP) + b + ε | ||
\] | ||
</h1> | ||
|
||
**ε** is the possible error our data can have. What `simple linear regression` aims to do is to draw an imaginary line that minimizes this error between the data points. This error is a value that is often ignored but the important thing is that our linear equation will consider this and we can represent the linear equation in a way that is familiar to our example. | ||
|
||
# Exercise 1: Playing with Scikit-learn | ||
|
||
Scikit-learn is a machine-learning library that will help us analyze and use the built-in simple linear regression model to predict data. In the Replit window below, you can run the program `02-e1.py` which will use a data set of employees alongside their years of experience. The program will plot a sample of 30 employees out of the employees within the company: | ||
|
||
<iframe height="500px" width="100%" src="https://replit.com/@nuevofoundation/LinearRegression-ConsoleApp#src/02-e1.py" scrolling="no" frameborder="no" allowtransparency="true" allowfullscreen="true" sandbox="allow-forms allow-pointer-lock allow-popups allow-same-origin allow-scripts allow-modals"></iframe> | ||
|
||
# Exercise 2: Finding the Slope and Intercept | ||
|
||
Before we go any further, lets analyze our equation once again. We know that our equation has been updated like so: | ||
|
||
<h1> | ||
\[ | ||
SALARY = a(XP) + b + ε | ||
\] | ||
</h1> | ||
|
||
We have been able to determine what are the values of **x** and **y** but, what about **a** and **b**? Let us recall what each of the missing values mean: | ||
|
||
- **a** is the slope or coefficient of the line. The **slope** represents the estimated change on the dependent variable, in this case, the **SALARY**. | ||
- **b** is the intercept or the value of **y** when **x=0**. After plotting our data set, you can see that the value for the **SALARY** when the **YEARS OF EXPERIENCE** is 0. | ||
|
||
Hold on for a second? If I join the company with no experience, my salary will be 0? That doesn't sound right. Lets go ahead and figure out what is the actual value is. | ||
|
||
Using scikit-learn we can use the linear regression model and find the value of **a** and **b**. On the Replit window below, lets analyze the code | ||
|
||
First, we need to import the data from the CSV file: | ||
|
||
```python | ||
# Importing dataset | ||
dataset = pd.read_csv("Experience_vs_Salary.csv") | ||
x = dataset.iloc[:, :-1].values # Get all the values from "Experience" | ||
y = dataset.iloc[:, 1].values # Get all the values from "Salary" | ||
``` | ||
|
||
Then we make an instance of the `LinearRegression` model class and **fit** the model to the data. The `fit` function will analyze the values from our CSV file and find the **slope** and **intercept** values. | ||
|
||
```python | ||
model = linear_model.LinearRegression() | ||
model.fit(x,y) | ||
``` | ||
|
||
<iframe height="500px" width="100%" src="https://replit.com/@nuevofoundation/LinearRegression-ConsoleApp#src/02-e2.py" scrolling="no" frameborder="no" allowtransparency="true" allowfullscreen="true" sandbox="allow-forms allow-pointer-lock allow-popups allow-same-origin allow-scripts allow-modals"></iframe> | ||
|
||
As you can see, the code has returned the value for the **coefficient** and **intercept** of our linear equation. Let's update our linear equation with this. | ||
|
||
<h3> | ||
\[ | ||
Intercept = 25792.20 | ||
\] | ||
\[ | ||
Coefficient= 9449.96 | ||
\] | ||
\[ | ||
SALARY = 9449.96(XP) + 25792.20 + ε | ||
\] | ||
</h3> | ||
|
||
We know that the model gave us an intercept of **25792.20**. What this means is that an employee with NO experience would have a salary of $25792.20. But what does the 9449.96 mean? This means that, for every year of experience, the salary of employees has an increase of 9449.96. But wait a moment, how can we make sure these are the correct values? Do we have confidence that these are indeed the correct values? If we grab another 30 random employees and verify their salaries, are we going to get the same values? |
Oops, something went wrong.