Exploring Linear Regression
Instructions:
Play with this applet to see how we can choose a line to fit some data.
- Green points control a line of your choosing.
- Small points represent data: blue if above your line; red if below. You can move these, too.
- Red and blue segments connect data to corresponding points on your line (with same x-coordinates).
- Deviations (positive for blue points, negative for red) are shown by default, but you can uncheck the Deviations check box to hide them.
- You can check the Regression Line check box to show the least squares best fit line.
- Total deviation just adds up all the deviations. Can you position the green line so that it doesn't represent the data well, but it gets a "total deviation" very close to zero? (This is why we don't use total deviation.)
- Total absolute deviation adds up the absolute values of the deviations, so opposite deviations don't cancel each other out. This is one way to solve the problem that total deviation has.
- Total squared deviation squares all deviations and adds up those numbers. This is another way to avoid the problem that total deviation has. If you minimize this score, you've found the "least squares" regression line.
- Of the three scoring methods, the first one is tempting but useless.
- The second and third have merits, but (as far as I know) the third one, least squares, is far more popular.
- I guess this is because the sum of squared deviations is a quadratic polynomial. This can be optimized by algebra (completing the square) or calculus (the derivative is nice).
- By comparison, the sum of several absolute value expressions is harder to analyze at first glance. Perhaps it's not too bad, though. Is there some advantage to a piecewise linear badness function? Are there other advantages to the best fit line determined by least absolute deviation?