Skip to content

Commit

Permalink
Simple Linear Regression - Revised (#473)
Browse files Browse the repository at this point in the history
* fixing quick sort section

* Update content/english/debugging/04-identify-the-problem3.md

Co-authored-by: Oliver Zhang <[email protected]>

* fixing link

* initial commit

* more updates

* Finishing regression explanaition

* working on simple linear regression

* working on confidence intervals

* 1st draft done

* 1st draft done

* updating  last section

* updating index

* - adding image to explain fitting
- updating working and story telling
- addressing comments of previous commits

* addressing feedback comments 1

* addressing feedback

* addressing feedback

* addressing more feedback

* Update exercise to calculate R-Squared in linear regression

* addressing feedback

---------

Co-authored-by: groberto <[email protected]>
Co-authored-by: Oliver Zhang <[email protected]>
Co-authored-by: Roberto Guzman <[email protected]>
  • Loading branch information
4 people authored Sep 20, 2024
1 parent c1537de commit 8729210
Show file tree
Hide file tree
Showing 30 changed files with 56,684 additions and 0 deletions.
100 changes: 100 additions & 0 deletions content/english/ml-machine-learning/01-regression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
title: "What is regression?"
description: "Make computers learn to predict outcomes."
prereq: "Python"
icon: ""
draft: false
weight: 1
---

# What is regression?
Regression is a technique to model the relationship between a feature (independent variables) and a prediction (dependent variables). It helps us understand how the value of a dependent variable is changing based on the value of the independent variable. When applied properly it helps us predict values 😯!

<!-- TODO: Add diagram to replace image below. -->
|![Regression](../resources/regression.png)|
|:--:|
|Regression|

Regression is a powerful technique that lets us find a line or a curve that fits the data we have. By doing regression, we can create or reuse `mathematical models` that show how the dependent and independent variables are related. [Mathematical models](https://www.youtube.com/watch?v=xHtsuOB-TPw) are useful because they help us understand a system and make predictions based on the system's variables. Sometimes, the mathematical model we choose may not match the data well, so we have to look for a better model that can capture the patterns we see.

## Example 1: How do we know if it is going to rain?
Whenever you use your phone, watch a news cast, or just ask in the internet "What is the chance of rain today?", mathematical models are used by meteorologists to predict if there will be the possibility of rain. You may wonder, What `variables` are taken into consideration when running those mathematical models? Is the smell of rain enough?

{{% expand "**What variables should you consider in order to know if it is going to rain?**"%}}

- Temperature &rarr; Is it hot, is it cold or is it on the sweet spot?
- Altitude &rarr; Depending on how high you are relative to the ground places can become cooler or warmer.
- Location &rarr; Where you are matters. (e.g: Forest, Beach, Mountain Range, Desert, etc)
- Humidity &rarr; Do you feel that the air is heavy? Is there enough water in the air for it to rain?
- Time of the Year &rarr; What month is it? What season are we currently on?
- And many many more!

There are many more variables to consider whenever we think there is a chance of rain. At the end of the day the "**chance of rain**" is the dependent variable while the other properties we have mentioned above are the independent variables. The chance of rain is dependent on the values of the other variables mentioned.
{{% /expand %}}

### Exercise 1
Suppose that someone wants to know the type of shapes they have based on the area of the shape.

1. Is that even possible? Can someone know what shape they have based on the area?
{{% expand "**Click to show answer**" %}}
***No***, as you can quickly guess there is no relation between the **area** and the type of shape a figure may have.

For example, a square of side length 3 has an area of 9.

<h3>
\[
3 \cdot 3 = 9
\]
</h3>

But so does a triangle with width of 6 and height of 3.
<h3>
\[
(6 \cdot 3)/2 = 9
\]
</h3>

{{% /expand %}}
<br>

2. If the area is not a good variable or property to understand the type of shape we can have. We need to come up What relationship can we use in order to know what is the type of shape?
{{% expand "**Click to show answer**" %}}

***The number of **sides** it has! As the number of sides increases, you can know the type of shape.***

If you noticed in our table, we have built a linear relationship that can be represented as X=Y.

X &rarr; **the number of sides**
Y &rarr; **type of shape**

There you go, you created your first machine learning model!
{{% /expand %}}
<br>

A model, in the case of our previous example, is just a function that is built to establish a relationship between our dependent variable and independent variables. For example, functions such as lines (y = ax + b), parabolas (y = a(x -h)^2), cubic curves, (y=ax^3+bx^2+cx+d), and many more can be used as models.

### How to make sure your model fits the data?

When we perform analysis using a machine learning model that already exists we need to ensure that we select the appropriate model and that somewhat represents our data. On the image below you can see that the data points represent a parabola. It is very likely that the model we need to use is a parabolic model something like this...

<h2>
\[
y = a(x -h)^2
\]
</h2>

When you use a parabolic model you need to know where the vertices of the parabola will land in order to predict where a new point will be. You could go ahead and run a model, setting no vertices and end up like the "Under-fitting" image, or setup way to many vertices and end up like the "Over-fitting" image. We want to set it just right enough so that our model "fits" the data and new data points can be represented using the model.

<!-- TODO: Add diagram to replace image below. -->
|![Fitting Data Example](../resources/fitting-data-example.png)|
|:--:|
|Image 1: Under-Fitting and Over-Fitting|

You always need to make sure that the model you choose to represent your data fits what you are working on. Otherwise, you might have one of two issues:
- **Under-Fitting**

This happens when you are unable to find a relationship in the data you have been given. This often happens when there is not much data to use.

- **Over-Fitting**

When you are trying to accommodate every possible value in your data, even the ones that don't represent anything. By doing this you might be choosing values that are **outliers** and do not represent the reality of things. For example, having a shape that has 1 or 2 lines doesn't make sense.
206 changes: 206 additions & 0 deletions content/english/ml-machine-learning/02-simple-linear-regression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
---
title: "Simple Linear Regression"
description: "Make computers learn to predict outcomes."
prereq: "Python"
icon: ""
draft: false
weight: 2
---

# What is simple linear regression?

Simple linear regression aims to find a correlation between two variables and derive mathematical equations that explain the relationships between a dependent and an independent variable. With simple linear regression in general, we want to reach the conclusion:

1. Is there a **relationship** between the variables we have?

You can determine the relationship between income and spending, experience and salary, or humidity and temperature. But, as an example, there is NO relationship between the height of a student and their exam scores.

2. Can we **forecast / predict** values with this?

With regression, we can train the model and find out if we can predict values with certainty. Can we use what we know about the relationship to predict new values?

Example: What will be the temperature tomorrow? How much will my bakery sell this year compared to last year? How much will my salary be if I have 5 years of experience?

# Variable roles

In simple linear regression, variables can take one of 2 roles.

1. **Dependent Variable**

The variable whose value we want to predict or forecast. We call it **dependent** because its value depends on something else. We will call this variable **y**.

2. **Independent Variables**

This is the variable which we can control or change in order to affect the dependent variable. We will call this variable **x**.

Example: If an apple costs $1.00, and you buy 10 of them the total cost will be $10.00. The dependent variable here is the `total cost` while the independent variable are the number of apples you want to buy.

# The Simple Linear Equation Mathematical Model

When we use simple linear regression we call it **linear** because, well... the mathematical model represents a straight line in a 2D plane. Let's think about it for a second.

What is the math equation for a straight line?

{{% expand "**Click to show answer**" %}}

This equation may seem very familiar to you. If it is, it's the general equation for a straight line.

<h1>
\[
y = ax + b
\]
</h1>

- **x** is the independent variable.
- **y** is the dependent value.
- **a** is the slope of the line.
- **b** is the intercept or the value of **y** when **x = 0**.

Following this equation we will elaborate on how simple linear regression math models calculate and predict new values.

{{% /expand %}}

# True World Examples

In the real world, data sometimes is not linear and behaves differently to what we think. At first glance, it may seem that data has no relation at all. In the case of simple linear regression what you need to look for is data that somewhat follows a linear pattern.

Suppose that you work as a `Data Analyst` for the human resources department of a company that has over 10,000 employees. Your boss wants to know if the years of experience of an employee has anything to do with the amount of money they win. Of course, since you are a `Data Analyst` you can check the database of employees and quickly verify the following:

1. What is their current salary?
2. How many years of experience does the person have?

Assume that you are able to get data from 30 random employees which looks something like this:

|Employee ID|Years of Experience|Salary|
|:--:|:--:|:--:|
|1|1.1|39343|
|2|1.3|46205|
|3|1.5|37731|
|4|2.0|43525|
|5|2.2|39891|
|6|2.9|56642|
|7|3.0|60150|
|8|3.2|54445|
|...|...|...|
|26|9.0|105582|
|27|9.5|116969|
|28|9.6|112635|
|29|10.3|122391|
|30|10.5|121872|

After you check the table, you plot all these values in a 2D scatter plot and get an image like so.

|![Years of Experience vs Salary](../resources/Years_vs_Salary.png)|
|:--:|
|Scatter Plot: Years of Experience vs Salary.|

As you can see, the dots _somewhat_ resemble a line. Let's go ahead and draw an imaginary line and see if we can pass through all the dots.

|![Years of Experience vs Salary with Trendline](../resources/Years_vs_Salary_with_trendline.png)|
|:--:|
|Scatter Plot: Years of Experience vs Salary with line.|

As you can see the line doesn't pass through **ALL** the dots, but it's somewhat close. What does this mean? Why in some cases are the dots close or far away from our imaginary line?

So far we know that:

1. The data in follows somewhat of a **linear** approach.
2. The data has 2 important variables **SALARY** and **YEARS OF EXPERIENCE**. This means, that we can start to **model** our like a linear equation.

**Question:** We know that **SALARY** and **YEARS OF EXPERIENCE** are our variables but which one is the dependent and which one is the independent variable?
{{% expand "**Click to show answer**" %}}

- **YEARS OF EXPERIENCE (XP)** is our independent variable.
- **SALARY** is our dependent variable.

If we plug these into our linear equation you will get something like this.

<h1>
\[
SALARY = a(XP) + b
\]
</h1>

With an equation like this we are saying: "The years of experience has a direct effect on the salary an employee".

{{% /expand %}}

# The Possibility of Errors

As we mentioned before, data may or may not be always consistent and can behave in different ways. What this means is that our linear equation needs to consider a possible error. But how do we represent that error in the equation? How can that error be visualized in the scatter plot?

Lets assume that all the employees that were chosen in the table above are from San Antonio, TX but the HR department accidentally added employees from the city of Seattle, WA into the data set you chose. [The cost of living in Seattle is 28.6% higher than San Antonio](https://www.numbeo.com/cost-of-living/compare_cities.jsp?country1=United+States&country2=United+States&city1=Seattle%2C+WA&city2=San+Antonio%2C+TX&tracking=getDispatchComparison). This would explain why, some data points in the plot, are farther away from the imaginary line we have traced. These are considered errors in our data.

<!-- TODO: Add diagram to replace the figure below.-->
|![Error Lines for Simple Linear Regression](../resources/error-lines.svg)|
|:--:|
|Error Lines for Simple Linear Regression|

In our linear equation, let's add that error with the greek letter **ε**.

<h1>
\[
SALARY = a(XP) + b + ε
\]
</h1>

**ε** is the possible error our data can have. What `simple linear regression` aims to do is to draw an imaginary line that minimizes this error between the data points. This error is a value that is often ignored but the important thing is that our linear equation will consider this and we can represent the linear equation in a way that is familiar to our example.

# Exercise 1: Playing with Scikit-learn

Scikit-learn is a machine-learning library that will help us analyze and use the built-in simple linear regression model to predict data. In the Replit window below, you can run the program `02-e1.py` which will use a data set of employees alongside their years of experience. The program will plot a sample of 30 employees out of the employees within the company:

<iframe height="500px" width="100%" src="https://replit.com/@nuevofoundation/LinearRegression-ConsoleApp#src/02-e1.py" scrolling="no" frameborder="no" allowtransparency="true" allowfullscreen="true" sandbox="allow-forms allow-pointer-lock allow-popups allow-same-origin allow-scripts allow-modals"></iframe>

# Exercise 2: Finding the Slope and Intercept

Before we go any further, lets analyze our equation once again. We know that our equation has been updated like so:

<h1>
\[
SALARY = a(XP) + b + ε
\]
</h1>

We have been able to determine what are the values of **x** and **y** but, what about **a** and **b**? Let us recall what each of the missing values mean:

- **a** is the slope or coefficient of the line. The **slope** represents the estimated change on the dependent variable, in this case, the **SALARY**.
- **b** is the intercept or the value of **y** when **x=0**. After plotting our data set, you can see that the value for the **SALARY** when the **YEARS OF EXPERIENCE** is 0.

Hold on for a second? If I join the company with no experience, my salary will be 0? That doesn't sound right. Lets go ahead and figure out what is the actual value is.

Using scikit-learn we can use the linear regression model and find the value of **a** and **b**. On the Replit window below, lets analyze the code

First, we need to import the data from the CSV file:

```python
# Importing dataset
dataset = pd.read_csv("Experience_vs_Salary.csv")
x = dataset.iloc[:, :-1].values # Get all the values from "Experience"
y = dataset.iloc[:, 1].values # Get all the values from "Salary"
```

Then we make an instance of the `LinearRegression` model class and **fit** the model to the data. The `fit` function will analyze the values from our CSV file and find the **slope** and **intercept** values.

```python
model = linear_model.LinearRegression()
model.fit(x,y)
```

<iframe height="500px" width="100%" src="https://replit.com/@nuevofoundation/LinearRegression-ConsoleApp#src/02-e2.py" scrolling="no" frameborder="no" allowtransparency="true" allowfullscreen="true" sandbox="allow-forms allow-pointer-lock allow-popups allow-same-origin allow-scripts allow-modals"></iframe>

As you can see, the code has returned the value for the **coefficient** and **intercept** of our linear equation. Let's update our linear equation with this.

<h3>
\[
Intercept = 25792.20
\]
\[
Coefficient= 9449.96
\]
\[
SALARY = 9449.96(XP) + 25792.20 + ε
\]
</h3>

We know that the model gave us an intercept of **25792.20**. What this means is that an employee with NO experience would have a salary of $25792.20. But what does the 9449.96 mean? This means that, for every year of experience, the salary of employees has an increase of 9449.96. But wait a moment, how can we make sure these are the correct values? Do we have confidence that these are indeed the correct values? If we grab another 30 random employees and verify their salaries, are we going to get the same values?
Loading

0 comments on commit 8729210

Please sign in to comment.