When we say we want the "best" line, what do we actually mean? Think about it this way: for each student in our data, our line makes a prediction, and we can measure how far off that prediction is from the real score.
For example, if:
- A student studied 3 hours and got 82%
- Our line predicts 81% for 3 hours of study
- The error (or difference) is 1%
The Squared Error
We care about all errors, whether we predicted too high or too low. That's why we square these differences. In math terms, for each student i:
errori=(yi−f(xi))2
errori=(yi−(β0+β1xi))2
Where:
- yi is the actual score
- f(xi) is our predicted score
- The square 2 makes all errors positive
The Total Error
To find the best line, we want to minimize the average of all these squared errors. We write this as:
R=2n1∑i=1n(yi−(β0+β1xi))2
Don't let this formula scare you! It just means:
- Take each prediction error
- Square it
- Add up all the squared errors
- Take the average
To find the best values for β0 and β1, we need to find where this error R is smallest.
We usually use Gradient Descent to do that.