This example allows students to explore
three methods for measuring how well a linear model fits a set of data
points. The Data Analysis and Probability Standard calls for students to
explore how residuals (the difference between a predicted and observed
value) may be used to measure the "goodness of fit" of a linear model.
Two of the methods use residuals and the third uses the
shortest distance between a data point and the line given by the model.
||Understanding the Least-Squares Regression Line with a Visual Model: Measuring Error in a Linear Model (applet)
To introduce the idea of a measure of fit, in the first three tasks a
line is given and the students explore the effects that six data points
have on three measures of error. However, it rarely happens that the
model is known and the data are not. Generally, we know the data and
need to find a linear model.
additional tasks provide an opportunity to suggest and evaluate a
variety of linear models and methods for a particular set of data.
- In this task a linear equation is used to model a set of data. By modifying the data points, explore how each of three methods—distance squared, absolute value, and shortest distance—measures how well the model approximates the data. How do individual data points contribute to the error? How do these contributions differ among the three methods of measuring the "goodness of fit"?
- How do the three methods compare when one of the points is far from the line and the rest of the points are quite close?
- For at least four different sets of data points, record the error measured by the absolute-value and shortest-distance methods. Be sure to use data sets that are quite different from one another in the number of points that are close to and far from the line. What relationships do you notice among the errors? (Hint: For each data set, try doing some arithmetic with the errors measured by the two methods.
How to Use the Interactive Figure
This interactive figure
allows the user to manipulate both the points and the line to observe the effects
on the sum of residuals shown at the bottom of the figure.
Modify Line: Checking this
box allows the user to modify the placement of the line in two ways: (a)
by dragging the line, the user rotates the line around a the y-intercept,
thus changing the slope of the line without changing the y-intercept;
and (b) by dragging the point at the y-intercept, the user translates
the line without changing its slope. To disable this capability, click on the
checked box to clear the check.
Modify Points: Checking this
box allows the user to modify the placement of the points. Once the box is checked,
the user can change the position of any red point by clicking on that point
and dragging it to a different position.
Squares: This button allows
the user to choose the square of the vertical distance between each point and
the line, or "distance squared," as the mode of calculation for the equation
at the bottom of the figure.
Absolute Value: This button
allows the user to choose the "absolute value" of the vertical distance between
each point and the line as the mode of calculation for the equation at the bottom
of the figure.
Shortest Distance: This button
allows the user to choose the perpendicular distance, or "shortest distance,"
between each point and the line as the mode of calculation for the equation
at the bottom of the figure.
For the given data set and for each of the measures of error, find a line (a linear model) for which the error is as small as possible. Try various slopes and various y-intercepts before you settle on your line of "best fit." For each method, record the equation of the line.
- Change the data set so that all the points except one lie in a line. Again find a line of best fit for each of the methods.
- Change the data set so that the points follow a curve, and find the line of best fit for each of the methods. Is there only one line of best fit for each case?
- Change the data set so that the points appear to follow no particular pattern, and find the line of best fit. Is there only one line of best fit for each case?
Students should have experience graphing data generated by linear situations and writing equations for the lines that pass through such data points. Finding equations is relatively straightforward when the data all lie on a line. When the data are only approximately linear, however, no line will fit the data exactly and students must decide from among many possible linear models. This situation often arises when data come from real contexts and a model is desired from which predictions can be made. Before students engage with these interactive examples, they should be given a set of data that is somewhat, but not exactly, linear and asked to plot a line that they think fits the data well. They should be asked to defend their choice of linear model. Some might argue that their line is a good fit because it "passes through" many of the points. Others might argue that fitting well means that it is "closest to the most points" or that it is "in the middle of the points." Students could be asked to define statements such as "closest to the most points" numerically and to quantify their reasoning in other ways so that the effectiveness of two proposed models can be compared.
Given a set of bivariate data, graphing calculators or spreadsheets may be used to find the least-squares regression line for the data set. This investigation may be used to help students develop an understanding that there is more than one way to define the "line of best fit" and to help them develop meaning for the approach they are most likely to encounter: the method of "least squares." The least-squares regression line minimizes the sum of the squares of the residuals, a criterion not often suggested by students. The interactive figure above provides a visual model for the sum of the squares and prepares students to approximate a least-squares regression line in subsequent examples.
Take Time to Reflect
- What do you learn about how well the linear model shown represents the data as points move farther from the line vertically? As they move parallel to the line? As they move horizontally across the line?
- How do outliers, points that are significantly distant from the other data points, affect the selection and evaluation of a linear model? How does this effect vary among the methods for determining the line of best fit?
- For a fixed data set, determine the ratio of the minimum sum of squares to the sum of squares when the line is horizontal and passes through the average of the y-values. What could this ratio tell you about the set of data?
- What other criteria might be used to assess the goodness of fit of a linear model?
- The absolute value and shortest distance methods should give the same line of best fit. Why? How would you characterize the difference between the two methods?