Chapter 3: Describing Relationships (first spread)



Chapter 12 ATE: More About Regression

Alternate Examples and Activities

[Page 738]

Alternate Activity: Does seat location matter?

Many people believe that students learn better if they sit closer to the front of the classroom. Does sitting closer cause higher achievement, or do better students simply choose to sit in the front? To investigate, an AP Statistics teacher randomly assigned students to seat locations in his classroom for a particular chapter and recorded the test score for each student at the end of the chapter. The explanatory variable in this experiment is which row the students were assigned (row 1 is closest to the front and row 7 is the farthest away). Here are the results, including a scatterplot and least-squares regression line:

Row 1: 76, 77, 94, 99

Row 2: 83, 85, 74, 79

Row 3: 90, 88, 68, 78

Row 4: 94, 72, 101, 70, 79

Row 5: 76, 65, 90, 67, 96

Row 6: 88, 79, 90, 83

Row 7: 79, 76, 77, 63

[pic]

1. Interpret the slope of the least-squares regression line in this context. (For each additional row a student sits from the front, the predicted score will decrease by 1.12 points.)

2. Explain why it was important to randomly assign the students to seats rather than letting each student choose his or her own seat. (If students are allowed to choose their own seats, the most conscientious students might choose to sit in the front, making it appear that sitting in the front causes higher scores. If we randomly assign the seats, the ability level of students should be approximately the same in each row.)

3. Does the negative slope provide convincing evidence that sitting closer causes higher achievement or is it plausible that the association is due the chance variation in the random assignment?

Using 30 note cards, label each one with one of the student scores.

a. Shuffle the 30 cards and divide them into 7 piles (4 cards for rows 1, 2, 3, 6 and 7 and 5 cards for rows 5 and 6).

b. Calculate the slope of the least-squares regression line using the re-randomized data (x = row number and y = test score).

c. Repeat this process over and over, graphing the approximate randomization distribution of the slope on a dotplot.

d. Was the observed slope of –1.12 unusual, or is it possible to get a slope this small due to the chance variation in random assignment? What conclusion should you make based on this study?

Here are the results of 100 trials of this simulation.

[pic]

As you can see, a slope of –1.12 is not very unusual, meaning that a slope this small could have resulted from the chance variation in random assignment. This study does not provide convincing evidence that sitting closer causes higher achievement.

[Page 744]

Alternate Example: Does seat location matter?

We used Fathom to carry out a least-squares regression analysis for the “Does seat location matter?” Alternate Activity from page 738. A scatterplot, residual plot, histogram and Normal probability plot of the residuals are shown below.

[pic]

[pic]

[pic]

Problem: Check whether the conditions for performing inference about the regression model are met.

Solution:

• Linear The scatterplot shows a weak linear relationship. The residual plot does not show any obvious left-over patterns indicating that this condition has been violated.

• Independent Students were randomly assigned to seats and were monitored for cheating, so knowing the score for one student should give no additional information about another student’s score.

• Normal The histogram is roughly unimodal and symmetric and the Normal probability plot is roughly linear.

• Equal variance Although there is a different amount of variability in each row, the differences aren’t very large and there is no systematic pattern, such as increasing variability as x increases.

• Random The students were assigned to seats at random.

Because there are no serious violations of the conditions, we should be safe performing inference about the regression model in this setting.

[Page 745]

Alternate Example: Does seat location matter?

Here is computer output for the least-squares regression analysis on the seating chart data from the previous Alternate Activitiy.

Regression Analysis: Score versus Row

Predictor Coef SE Coef T P

Constant 85.706 4.239 20.22 0.000

Row -1.1171 0.9472 -1.18 0.248

S = 10.0673 R-Sq = 4.7% R-Sq(adj) = 1.3%

Problem:

(a) State the equation of the least-squares regression line. Define any variables you use.

(b) Interpret the slope, y intercept (if possible), and standard deviation of the residuals.

Solution:

(a) [pic] = 85.706 – 1.1171x, where [pic] = predicted score and x = row number.

(b) Slope: For each additional row from the front of the class, the test score is predicted to go down by 1.1171 points, on average. y-intercept: A value of x = 0 does not make sense in this context since no students sit closer than the first row. Standard deviation of the residuals: When we use the least-squares regression line to predict test score from a student’s row number, we will be off by about 10.0673 points, on average.

[Page 748]

Alternate Example: Does seat location matter?

Earlier, we used Minitab to analyze the results of an experiment designed to see if sitting closer to the front of a classroom causes higher achievement. We checked the conditions for inference earlier. Here is a scatterplot of the data and some output from the analysis.

[pic]

Regression Analysis: Score versus Row

Predictor Coef SE Coef T P

Constant 85.706 4.239 20.22 0.000

Row -1.1171 0.9472 -1.18 0.248

S = 10.0673 R-Sq = 4.7% R-Sq(adj) = 1.3%

Problem:

(a) Identify the standard error of the slope SEb from the computer output. Interpret this value in context.

(b) Calculate the 95% confidence interval for the true slope. Show your work.

(c) Interpret the interval from part (b) in context.

(d) Based on your interval, is there convincing evidence that seat location affects scores?

Solution:

(a) SEb = 0.9472. If we repeated the random assignment many times, the slope of the estimated regression line would typically vary by about 0.9472 from the slope of the true regression line for predicting test score from row number.

(b) Because n = 30, df = 30 – 2 = 28 and t* = 2.048. The 95% confidence interval is:

–1.1171 [pic] 2.048(0.9472) = –1.1171 [pic] 1.9399 = (–3.0570, 0.8228)

(c) We are 95% confident that the interval from –3.0570 to 0.8228 captures the slope of the true regression line relating a student’s test score y and the student’s row number x.

(d) Because the interval of plausible slopes includes 0, we do not have convincing evidence that there is an association between test score and row number.

[Page 749]

Alternate Example: Fresh flowers?

For their second-semester project, two AP Statistics students decided to investigate the effect of sugar on the life of cut flowers. They went to the local grocery store and randomly selected 12 carnations. All the carnations seemed equally healthy when they were selected. When they got home, the students prepared 12 identical vases with exactly the same amount of water in each vase. They put one tablespoon of sugar in 3 vases, two tablespoons of sugar in 3 vases, and three tablespoons of sugar in 3 vases. In the remaining 3 vases, they put no sugar. After the vases were prepared and placed in the same location, the students randomly assigned one flower to each vase and observed how many hours each flower continued to look fresh. Here are the data:

Sugar (tbs) Freshness (hours)

0 168

0 180

0 192

1 192

1 204

1 204

2 204

2 210

2 210

3 222

3 228

3 234

Minitab output from a least-squares regression analysis for these data is shown below.

[pic]

[pic]

[pic]

Regression Analysis: Freshness (hours) versus Sugar (tbs)

Predictor Coef SE Coef T P

Constant 181.200 3.635 49.84 0.000

Sugar (tbs) 15.200 1.943 7.82 0.000

S = 7.52596 R-Sq = 86.0% R-Sq(adj) = 84.5%

Problem:

(a) Construct and interpret a 99% confidence interval for the slope of the true regression line.

(b) Would you feel confident predicting the hours of freshness if 10 tablespoons of sugar are used? Explain.

Solution:

(a) State: We want to estimate the slope [pic] of the true regression line relating hours of freshness y to amount of sugar x at the 99% confidence level.

Plan: If conditions are met, we will use a t interval for the slope to estimate [pic].

• Linear The scatterplot shows a linear pattern and there is no obvious leftover curvature in the residual plot, even though all of the residuals are negative for x = 2.

• Independent All of the flowers were in different vases, so knowing the hours of freshness for one flower shouldn’t provide additional information about the hours of freshness for other flowers.

• Normal The histogram of residuals does not show any skewness or outliers.

• Equal variance Although the variability of the residuals is not constant at each value of x, there is no systematic pattern such as increasing variation as x increases.

• Random Flowers were randomly assigned to the treatments.

Do: Using df = 12 – 2 = 10 the t critical value is t* = 3.169. Thus, the 99% confidence interval is 15.2 [pic] 3.169(1.943) = 15.2 [pic] 6.16 = (9.04, 21.36).

Conclude: We are 99% confident that the interval from 9.04 to 21.36 captures the slope of the true regression line relating hours of freshness y to amount of sugar x.

(b) No, this would be an extrapolation. We have no idea if the linear form will continue beyond x = 3 tablespoons. At some point there will be more sugar than the flower can handle.

[Page 752]

Alternate Example: Tipping at a buffet

Do customers who stay longer at buffets give larger tips? Charlotte, an AP statistics student who worked at an Asian buffet, decided to investigate this question for her second semester project. While she was doing her job as a hostess, she obtained a random sample of receipts, which included the length of time (in minutes) the party was in the restaurant and the amount of the tip (in dollars). Do these data provide convincing evidence that customers who stay longer give larger tips? Here is the data:

|Time (minutes) |Tip (dollars) |

|23 |5.00 |

|39 |2.75 |

|44 |7.75 |

|55 |5.00 |

|61 |7.00 |

|65 |8.88 |

|67 |9.01 |

|70 |5.00 |

|74 |7.29 |

|85 |7.50 |

|90 |6.00 |

|99 |6.50 |

Problem:

(a) Here is a scatterplot of the data with the least-squares regression line added. Describe what this graph tells you about the relationship between the two variables.

[pic]

Minitab output from a linear regression analysis on these data is shown below.

Regression Analysis: Tip (dollars) versus Time (minutes)

Predictor Coef SE Coef T P

Constant 4.535 1.657 2.74 0.021

Time (minutes) 0.03013 0.02448 1.23 0.247

S = 1.77931 R-Sq = 13.2% R-Sq(adj) = 4.5%

[pic]

[pic]

(b) What is the equation of the least-squares regression line for predicting the amount of the tip from the length of the stay? Define any variables you use.

(c) Interpret the slope and y intercept of the least-squares regression line in context.

(d) Carry out an appropriate test to answer Charlotte’s question.

Solution:

(a) There is a weak, positive, linear association between length of stay and tip amount. People who stay longer tend to give larger tips.

(b) The equation of the least-squares regression line is [pic] = 4.535 + 0.030x, where [pic] = predicted tip amount (in dollars) and x = length of stay (in minutes).

(c) Slope: For each additional minute that a party stays at the buffet, the least-squares regression line predicts an increase in the tip of $0.030 (3 cents). y intercept: The model predicts that a party that stays at a buffet for 0 minutes will leave a tip of $4.535. This seems pretty unreasonable and is an extrapolation since we have no data around x = 0.

(d) State: We want to perform a test of

[pic]: [pic] = 0

[pic]: [pic] > 0

where [pic] is the true slope of the population regression line relating length of stay x to tip amount y. We will use [pic] = 0.05.

Plan: If the conditions are met we will do a t test for the slope [pic].

• Linear The scatterplot shows a weak, positive, linear relationship between length of stay and tip amount. The residual plot looks randomly scattered about the residual = 0 line.

• Independent Knowing one tip amount shouldn’t provide additional information about other tip amounts. Since we are sampling without replacement, we must assume that there are more than 10(12) = 120 receipts in the population from which these receipts were selected.

• Normal The Normal probability plot looks roughly linear.

• Equal variance The residual plot shows a fairly equal amount of scatter around the residual = 0 line.

• Random The receipts were randomly selected.

With no obvious violations, we proceed to inference.

Do: From the computer output:

• Test statistic t = 1.23

• P-value Since the P-value in the computer output is for a two-sided test, we must cut it in half for a one-sided test. P = 0.247/2 = 0.1235 using df = 12 – 2 = 10.

Conclude: Since the P-value is larger than 0.05, we fail to reject [pic]. We do not have convincing evidence that parties who stay longer at buffets leave larger tips.

[Page 755]

Alternate Example: Used Hondas

A random sample of 11 used Honda CR-Vs from the 2002-2006 model years was selected from the inventory at . The number of miles driven and the advertised price were recorded for each CR-V. A 95% confidence interval for the slope of the true least-squares regression line for predicting advertised price from number of miles (in thousands) driven is

(–50.1, –122.3).

Problem: Based on this interval, what conclusion should we draw from a test of [pic]: [pic] = 0 versus [pic]: [pic] [pic]0 at the 0.05 significance level?

Solution: Since 0 is not in the interval of plausible slopes, we reject [pic]. There is convincing evidence of a linear relationship between the number of miles driven and the advertised price.

[Page 766]

Alternate Example: More from gapminder

• Go to and select the “GAPMINDER WORLD” tab. A scatterplot of life expectancy vs. log(income per person) appears.

• To show students the graph on page 766, choose “lin” from the menu in the lower right corner (currently it should say “log”).

• To look at relationships between other variables, click on the variable name on either axis and a menu with many other variables will pop up. Consider looking at the following relationships that show non-linear patterns. Transform the variables by taking logarithms to see if the relationships can become linear.

o Income per person y versus Children per woman x

o Income per person y versus Under-5 mortality rate x

o Income per person y versus Literacy rate (adult total) x

• To see how these relationships change over time, move the slider at the bottom of the graph to 1800 and press “Play.”

[Page 768]

Alternate Example: Child mortality and income

What does a country’s under-5 child mortality rate (per 1000 live births) tell us about the income per person for residents of that country? Here are the data and a scatterplot for a random sample of 14 countries in 2009 (data from ):

Country Under-5 Mortality Rate Income per Person

Switzerland 4.4 38003.9

Timor-Leste 56.4 2475.68

Uganda 127.5 1202.53

Ghana 68.5 1382.95

Peru 21.3 7858.97

Cambodia 87.5 1830.97

Suriname 26.3 8199.03

Armenia 21.6 4523.44

Sweden 2.8 32021

Niger 160.3 643.39

Serbia 7.1 10005.2

Kenya 84 1493.53

Fiji 17.6 4016.2

Grenada 14.5 8826.9

[pic]

The scatterplot shows a strong negative association that is clearly non-linear. Because the horizontal and vertical axes look like asymptotes, perhaps there is a reciprocal relationship between the variables, such as [pic]. If we find the reciprocal of child mortality rate and graph income versus 1/(child mortality rate), we find a linear relationship!

[pic]

Likewise, we could also calculate the reciprocal of the income per person and graph 1/(income per person) versus child mortality rate. Again we get a linear relationship!

[pic]

[Page 770]

Alternate Example: Child mortality and income

Here is Minitab output from separate regression analyses of the two sets of transformed data.

Transformation 1: ([pic], Income)

Predictor Coef SE Coef T P

Constant 869 1615 0.54 0.601

RecipUnder5 104868 13074 8.02 0.000

S = 4797.65 R-Sq = 84.3% R-Sq(adj) = 83.0%

[pic]

Transformation 2: (Under5, [pic])

Predictor Coef SE Coef T P

Constant -0.00000585 0.00004849 -0.12 0.906

Under5_Mortality_Rate 0.00000829 0.00000070 11.80 0.000

S = 0.000125175 R-Sq = 92.1% R-Sq(adj) = 91.4%

[pic]

Problem: Do the following for both transformations

(a) Give the equation of the least-squares regression line. Define any variables you use.

(b) Predict the income per person for Turkey, which had a child mortality rate of 20.3.

(c) Interpret the value of r2 in context.

Solution:

(a) Transformation 1: [pic]

Transformation 2: [pic]

(b) Transformation 1: [pic][pic]= $6034.91

Transformation 2: [pic] = 0.000162437

[pic]= $6156.23

(c) Transformation 1: 84.3% of the variation in income is accounted for by the least-squares regression line using x = [pic].

Transformation 2: 92.1% of the variation in [pic]is accounted for by the least-squares regression line using x = Under-5 mortality rate.

[Page 772]

Alternate Example: Dating with carbon-14

When scientists estimate the age of fossils, they often measure the amount of the radioactive isotope Carbon-14 in the remains. Carbon-14 has a half life of approximately 5730 years, meaning that in 5730 years the amount of carbon-14 in organic matter will be half of what it was originally. Suppose that the organic matter in a fossil originally had 1000 mg of carbon-14. After 5730 years, the amount of carbon-14 should be 1000(1/2)1 = 500. To find y = amount of carbon-14 remaining after x years, use the function [pic]. Here is a graph of this function for the first 50,000 years.

[pic]

[Page 773]

Alternate Example: More M&M’s

A student opened a bag of M&M’s, dumped them out, and ate all the ones with the M on top. When he finished, he put the remaining 30 M&M’s back in the bag and repeated the same process over and over until all the M&M’s were gone. Here is a table and scatterplot showing the number of M&M’s remaining at the end of each “course”.

|Course |M&M’s remaining |

|1 |30 |

|2 |13 |

|3 |10 |

|4 |3 |

|5 |2 |

|6 |1 |

|7 |0 |

[pic]

Since the number of M&M’s should be cut in half after each course, an exponential model should describe the relationship between the variables.

Problem:

(a) A scatterplot of the natural log of the number of M&M’s remaining versus course number is shown below. The last observation in the table is not included since ln(0) is undefined. Explain why it would be reasonable to use an exponential model to describe the relationship between the number of M&M’s remaining and the course number.

[pic]

(b) Minitab output from a linear regression analysis on the transformed data is shown below. Give the equation of the least-squares regression line defining any variables you use.

Regression Analysis: LnRemaining versus Course

Predictor Coef SE Coef T P

Constant 4.0593 0.1852 21.92 0.000

Course -0.68073 0.04755 -14.32 0.000

S = 0.198897 R-Sq = 98.1% R-Sq(adj) = 97.6%

(c) Use your model from part (b) to predict the original number of M&M’s in the bag.

(d) A residual plot of the linear regression in part (b) is shown below. Discuss what this graph tells you about the appropriateness of the model.

[pic]

Solution:

(a) If there is an exponential relationship between two variables x and y, we expect a scatterplot of (x, ln y) to be roughly linear. Since the scatterplot of ln(remaining) versus course number is roughly linear, an exponential model seems appropriate here.

(b) [pic] = 4.0593 – 0.68073x, where [pic] = the predicted value of the natural log of the number of M&M’s remaining and x = course number.

(c) To estimate the original number of M&M’s, we need to predict the amount remaining after course 0. [pic]= 4.0593 – 0.68073(0) = 4.0593. Thus, [pic]= 57.93 M&M’s.

(d) Since there is no obvious leftover curvature in the residual plot, the model is appropriate for these data.

[Page 778]

Alternate Example: Child mortality and income

In an earlier alternate example, we transformed the variables Under-5 child mortality and Income per person using the reciprocal function for a random sample of 14 countries. We can also try transforming these variables using the natural logarithm function. Here is a scatterplot of data.

[pic]

Problem: The graphs below show the results of two different transformations of the data.

[pic]

[pic]

(a) Explain why a power model would provide a more appropriate description of the relationship between income per person and under-5 mortality rate than an exponential model.

(b) Minitab output for a linear regression analysis using y = ln(Income) and x = ln(Under5) is shown below. Give the equation of the least-squares regression line, defining any variables you use.

Regression Analysis: ln(Income) versus ln(Under5)

Predictor Coef SE Coef T P

Constant 11.4950 0.2680 42.90 0.000

ln(Under5) -0.93740 0.07579 -12.37 0.000

S = 0.343501 R-Sq = 92.7% R-Sq(adj) = 92.1%

(c) Use your model to predict the income per person for Turkey, with an under-5 mortality rate of 20.3.

(d) A residual plot for the linear regression in part (b) is shown below. What does the plot tell you about the appropriateness of the power model?

[pic]

Solution:

(a) The scatterplot of ln(Income) versus ln(Under5) is more linear than the scatterplot of ln(Income) versus Under5.

(b) [pic], where [pic]is the predicted value of ln(income) and x = Under5 mortality rate.

(c) [pic]= 8.6728( [pic] = [pic]= $5842.

(d) Since there is no left over curvature in the residual plot, the power model is appropriate.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download