diff --git a/24-inf-model-slr.qmd b/24-inf-model-slr.qmd
index e9e7c349..ae6a8e62 100644
--- a/24-inf-model-slr.qmd
+++ b/24-inf-model-slr.qmd
@@ -947,7 +947,7 @@ When fitting a least squares line, we generally require the following:
- **Nearly normal residuals.** Generally, the residuals should be nearly normal.
When this condition is found to be unreasonable, it is often because of outliers or concerns about influential points, which we'll talk about more in @sec-outliers-in-regression.
An example of a residual that would be potentially concerning is shown in the second panel of @fig-whatCanGoWrongWithLinearModel, where one observation is clearly much further from the regression line than the others. Outliers should be treated extremely carefully. Do not automatically remove an outlier if it truly belongs in the dataset. However, be honest about its impact on the analysis. A strategy for dealing with outliers is to present two analyses: one with the outlier and one without the outlier.
- Additionally, a type of violation of normality happens when the positive residuals are smaller in magnitude than the negative residuals (or vice versa). That is, when the residuals are not symmetrically distributed around the line $y=0.$
+Additionally, a type of violation of normality happens when the positive residuals are smaller in magnitude than the negative residuals (or vice versa). That is, when the residuals are not symmetrically distributed around the line $y=0.$
- **Constant or equal variability.** The variability of points around the least squares line remains roughly constant.
An example of non-constant variability is shown in the third panel of @fig-whatCanGoWrongWithLinearModel, which represents the most common pattern observed when this condition fails: the variability of $y$ is larger when $x$ is larger.
@@ -1180,8 +1180,6 @@ If there are large deviations, we will be unable to trust the calculated p-value
The linearity condition is among the most important if your goal is to understand a linear model between $x$ and $y$.
For example, the value of the slope will not be at all meaningful if the true relationship between $x$ and $y$ is quadratic, as in @fig-notGoodAtAllForALinearModel.
Not only should we be cautious about the inference, but the model *itself* is also not an accurate portrayal of the relationship between the variables.
-
-In @sec-inf-model-mlr we discuss model modifications that can often lead to an excellent fit of strong relationships other than linear ones.
However, an extended discussion on the different methods for modeling functional forms other than linear is outside the scope of this text.
**The importance of Independence**
@@ -1207,7 +1205,7 @@ You should consider the "bell" of the normal distribution as sitting on top of t
The normality condition is less important than linearity or independence for a few reasons.
First, the linear model fit with least squares will still be an unbiased estimate of the true population model.
However, the distribution of the estimate will be unknown.
-Fortunately the Central Limit Theorem tells us that most of the analyses (e.g., SEs, p-values, confidence intervals) done using the mathematical model (with the $t$-distribution) will still hold (even if the data are not normally distributed around the line) as long as the sample size is large enough.
+Fortunately the Central Limit Theorem (described in @sec-one-mean-math) tells us that most of the analyses (e.g., SEs, p-values, confidence intervals) done using the mathematical model (with the $t$-distribution) will still hold (even if the data are not normally distributed around the line) as long as the sample size is large enough.
One analysis method that *does* require normality, regardless of sample size, is creating intervals which predict the response of individual outcomes at a given $x$ value, using the linear model.
One additional reason to worry slightly less about normality is that neither the randomization test nor the bootstrapping procedures require the data to be normal around the line.
@@ -1229,7 +1227,7 @@ In particular, random effects models, repeated measures, and interaction are all
When the technical conditions hold, the extensions to the linear model can provide important insight into the data and research question at hand.
We will discuss some of the extended modeling and associated inference in @sec-inf-model-mlr and @sec-inf-model-logistic.
Many of the techniques used to deal with technical condition violations are outside the scope of this text, but they are taught in universities in the very next class after this one.
-If you are working with linear models or curious to learn more, we recommend that you continue learning about statistical methods applicable to a larger class of datasets.
+If you are working with linear models or are curious to learn more, we recommend that you continue learning about statistical methods applicable to a larger class of datasets.
\clearpage
diff --git a/_freeze/24-inf-model-slr/execute-results/html.json b/_freeze/24-inf-model-slr/execute-results/html.json
index b852b8f3..8939e750 100644
--- a/_freeze/24-inf-model-slr/execute-results/html.json
+++ b/_freeze/24-inf-model-slr/execute-results/html.json
@@ -1,8 +1,8 @@
{
- "hash": "6c3a88d0e70db8d77b45dac287a23b86",
+ "hash": "5e3f8cd695d0b3b44e310bc579c42dab",
"result": {
"engine": "knitr",
- "markdown": "# Inference for linear regression with a single predictor {#sec-inf-model-slr}\n\n\n\n\n\n\\chaptermark{Inference for regression with a single predictor}\n\n::: {.chapterintro data-latex=\"\"}\nWe now bring together ideas of inferential analyses with the descriptive models seen in [Chapter -@sec-model-slr].\nIn particular, we will use the least squares regression line to test whether there is a relationship between two continuous variables.\nAdditionally, we will build confidence intervals which quantify the slope of the linear regression line.\nThe setting is now focused on predicting a numeric response variable (for linear models) or a binary response variable (for logistic models), we continue to ask questions about the variability of the model from sample to sample.\nThe sampling variability will inform the conclusions about the population that can be drawn.\n\nMany of the inferential ideas are remarkably similar to those covered in previous chapters.\nThe technical conditions for linear models are typically assessed graphically, although independence of observations continues to be of utmost importance.\n\nWe encourage the reader to think broadly about the models at hand without putting too much dependence on the exact p-values that are reported from the statistical software.\nInference on models with multiple explanatory variables can suffer from data snooping which result in false positive claims.\nWe provide some guidance and hope the reader will further their statistical learning after working through the material in this text.\n:::\n\n\n\n\n\n## Case study: Sandwich store\n\n### Observed data\n\nWe start the chapter with a hypothetical example describing the linear relationship between dollars spent advertising for a chain sandwich restaurant and monthly revenue.\nThe hypothetical example serves the purpose of illustrating how a linear model varies from sample to sample.\nBecause we have made up the example and the data (and the entire population), we can take many many samples from the population to visualize the variability.\nNote that in real life, we always have exactly one sample (that is, one dataset), and through the inference process, we imagine what might have happened had we taken a different sample.\nThe change from sample to sample leads to an understanding of how the single observed dataset is different from the population of values, which is typically the fundamental goal of inference.\n\nConsider the following hypothetical population of all of the sandwich stores of a particular chain seen in @fig-sandpop.\nIn this made-up world, the CEO actually has all the relevant data, which is why they can plot it here.\nThe CEO is omniscient and can write down the population model which describes the true population relationship between the advertising dollars and revenue.\nThere appears to be a linear relationship between advertising dollars and revenue (both in \\$1,000).\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-sandpop fig-alt='Scatterplot with advertising amount on the x-axis and revenue on the y-axis. A linear model is superimposed. The points show a reasonably strong and positive linear trend.' width=90%}\n:::\n:::\n\n\nYou may remember from @sec-model-slr that the population model is: $$y = \\beta_0 + \\beta_1 x + \\varepsilon.$$\n\nAgain, the omniscient CEO (with the full population information) can write down the true population model as: $$\\texttt{expected revenue} = 11.23 + 4.8 \\times \\texttt{advertising}.$$\n\n### Variability of the statistic\n\nUnfortunately, in our scenario, the CEO is not willing to part with the full set of data, but they will allow potential franchise buyers to see a small sample of the data in order to help the potential buyer decide whether set up a new franchise.\nThe CEO is willing to give each potential franchise buyer a random sample of data from 20 stores.\n\nAs with any numerical characteristic which describes a subset of the population, the estimated slope of a sample will vary from sample to sample.\nConsider the linear model which describes revenue (in \\$1,000) based on advertising dollars (in \\$1,000).\n\nThe least squares regression model uses the data to find a sample linear fit: $$\\hat{y} = b_0 + b_1 x.$$\n\nA random sample of 20 stores shows a different least square regression line depending on which observations are selected.\nA subset of size 20 stores shows a similar positive trend between advertising and revenue (to what we saw in @fig-sandpop which described the population) despite having fewer observations on the plot.\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-sand-samp1 fig-alt='For a random sample of 20 stores, scatterplot with advertising amount on the x-axis and\nrevenue on the y-axis. A linear model is superimposed. The points show a reasonably strong\nand positive linear trend.' width=90%}\n:::\n:::\n\n\nA second sample of size 20 also shows a positive trend!\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-sand-samp2 fig-alt='For a random sample of 20 stores, scatterplot with advertising amount on the x-axis and revenue on the y-axis. A linear model is superimposed. The points show a reasonably strong and positive linear trend.' width=90%}\n:::\n:::\n\n\nBut the lines are slightly different!\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-sand-samp12 fig-alt='For two different random samples, superimposed onto the same plot, scatterplot with advertising amount on the x-axis and revenue on the y-axis. Two linear models are plotted to demonstrate that the lines are very similar, yet they are not the same.' width=90%}\n:::\n:::\n\n\nThat is, there is **variability** in the regression line from sample to sample.\nThe concept of the sampling variability is something you've seen before, but in this lesson, you will focus on the variability of the line often measured through the variability of a single statistic: **the slope of the line**.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-slopes fig-alt='An x-y coordinate system with least squares regression lines from many random samples of size 20 (no points are plotted). The lines vary around the true population line. On the x-axis is advertising amount; on the y-axis is revenue.' width=90%}\n:::\n:::\n\n\nYou might notice in @fig-slopes that the $\\hat{y}$ values given by the lines are much more consistent in the middle of the dataset than at the ends.\nThe reason is that the data itself anchors the lines in such a way that the line must pass through the center of the data cloud.\nThe effect of the fan-shaped lines is that predicted revenue for advertising close to \\$4,000 will be much more precise than the revenue predictions made for \\$1,000 or \\$7,000 of advertising.\n\nThe distribution of slopes (for samples of size $n=20$) can be seen in a histogram, as in @fig-sand20lm.\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-sand20lm fig-alt='Histogram of the slope values from many random samples of size 20. The slope estimates vary from about 2.1 to 8. The histogram is reasonably bell-shaped.' width=90%}\n:::\n:::\n\n\nRecall, the example described in this introduction is hypothetical.\nThat is, we created an entire population in order demonstrate how the slope of a line would vary from sample to sample.\nThe tools in this textbook are designed to evaluate only one single sample of data.\nWith actual studies, we do not have repeated samples, so we are not able to use repeated samples to visualize the variability in slopes.\nWe have seen variability in samples throughout this text, so it should not come as a surprise that different samples will produce different linear models.\nHowever, it is nice to visually consider the linear models produced by different slopes.\nAdditionally, as with measuring the variability of previous statistics (e.g., $\\overline{X}_1 - \\overline{X}_2$ or $\\hat{p}_1 - \\hat{p}_2$), the histogram of the sample statistics can provide information related to inferential considerations.\n\nIn the following sections, the distribution (i.e., histogram) of $b_1$ (the estimated slope coefficient) will be constructed in the same three ways that, by now, may be familiar to you.\nFirst (in @sec-randslope), the distribution of $b_1$ when $\\beta_1 = 0$ is constructed by randomizing (permuting) the response variable.\nNext (in @sec-bootbeta1), we can bootstrap the data by taking random samples of size $n$ from the original dataset.\nAnd last (in @sec-mathslope), we use mathematical tools to describe the variability using the $t$-distribution that was first encountered in @sec-one-mean-math.\n\n## Randomization test for the slope {#sec-randslope}\n\nConsider data on 100 randomly selected births gathered originally from the US Department of Health and Human Services.\nSome of the variables are plotted in @fig-babyweight.\n\nThe scientific research interest at hand will be in determining the linear relationship between weight of baby at birth (in lbs) and number of weeks of gestation.\nThe dataset is quite rich and deserves exploring, but for this example, we will focus only on the weight of the baby.\n\n::: {.data data-latex=\"\"}\nThe [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\nWe will work with a random sample of 100 observations from these data.\n:::\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-babyweight fig-alt='Four different scatterplots, all with weight of baby on the y-axis. On the x-axis are weight gained by mother, mother\\'s age, number of hospital visits, and weeks gestation. Weeks gestation and weight of baby show the strongest linear relationship (which is positive).' width=90%}\n:::\n:::\n\n\nAs you have seen previously, statistical inference typically relies on setting a null hypothesis which is hoped to be subsequently rejected.\nIn the linear model setting, we might hope to have a linear relationship between `weeks` and `weight` in settings where `weeks` gestation is known and `weight` of baby needs to be predicted.\n\nThe relevant hypotheses for the linear model setting can be written in terms of the population slope parameter.\nHere the population refers to a larger population of births in the US.\n\n- $H_0: \\beta_1= 0$, there is no linear relationship between `weight` and `weeks`.\n- $H_A: \\beta_1 \\ne 0$, there is some linear relationship between `weight` and `weeks`.\n\nRecall that for the randomization test, we permute one variable to eliminate any existing relationship between the variables.\nThat is, we set the null hypothesis to be true, and we measure the natural variability in the data due to sampling but **not** due to variables being correlated.\n@fig-permweightScatter shows the observed data and a scatterplot of one permutation of the `weight` variable.\nThe careful observer can see that each of the observed values for `weight` (and for `weeks`) exist in both the original data plot as well as the permuted `weight` plot, but the `weight` and `weeks` gestation are no longer matched for a given birth.\nThat is, each `weight` value is randomly assigned to a new `weeks` gestation.\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-permweightScatter fig-alt='Two scatterplots, both with length of gestation on the x-axis and weight of baby on the y-axis. The left panel is the original data. The right panel is data where the weight of the baby has been permuted across the observations.' width=90%}\n:::\n:::\n\n\nBy repeatedly permuting the response variable, any pattern in the linear model that is observed is due only to random chance (and not an underlying relationship).\nThe randomization test compares the slopes calculated from the permuted response variable with the observed slope.\nIf the observed slope is inconsistent with the slopes from permuting, we can conclude that there is some underlying relationship (and that the slope is not merely due to random chance).\n\n### Observed data\n\nWe will continue to use the births data to investigate the linear relationship between `weight` and `weeks` gestation.\nNote that the least squares model (see @sec-model-slr) describing the relationship is given in @tbl-ls-births.\nThe columns in @tbl-ls-births are further described in @sec-mathslope.\n\n\n::: {#tbl-ls-births .cell tbl-cap='The least squares estimates of the intercept and slope are given in the estimate column. The observed slope is 0.335.'}\n::: {.cell-output-display}\n`````{=html}\n
\n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n\n \n (Intercept) | \n -5.72 | \n 1.61 | \n -3.54 | \n 6e-04 | \n
\n \n weeks | \n 0.34 | \n 0.04 | \n 8.07 | \n <0.0001 | \n
\n\n
\n\n`````\n:::\n:::\n\n\n### Variability of the statistic\n\nAfter permuting the data, the least squares estimate of the line can be computed.\nRepeated permutations and slope calculations describe the variability in the line (i.e., in the slope) due only to the natural variability and not due to a relationship between `weight` and `weeks` gestation.\n@fig-permweekslm shows two different permutations of `weight` and the resulting linear models.\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-permweekslm fig-alt='Two scatterplots, both with length of gestation on the x-axis and weight of baby on the y-axis. Each plot includes data where the weight of the baby has been permuted across the observations. The two different permutations produce slightly different least squares regression lines.' width=90%}\n:::\n:::\n\n\nAs you can see, sometimes the slope of the permuted data is positive, sometimes it is negative.\nBecause the randomization happens under the condition of no underlying relationship (because the response variable is completely mixed with the explanatory variable), we expect to see the center of the randomized slope distribution to be zero.\n\n### Observed statistic vs. null statistics\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-nulldistBirths fig-alt='Histogram of slopes describing the linear model from permuted weight regressed on weeks gestation. The permuted slopes range from -0.15 to +0.15 and are nowhere near the observed slope value of 0.335.' width=90%}\n:::\n:::\n\n\n\nAs we can see from @fig-nulldistBirths, a slope estimate as extreme as the observed slope estimate (the red line) never happened in many repeated permutations of the `weight` variable.\nThat is, if indeed there were no linear relationship between `weight` and `weeks`, the natural variability of the slopes would produce estimates between approximately -0.15 and +0.15.\nWe reject the null hypothesis.\nTherefore, we believe that the slope observed on the original data is not just due to natural variability and indeed, there is a linear relationship between `weight` of baby and `weeks` gestation for births in the US.\n\n## Bootstrap confidence interval for the slope {#sec-bootbeta1}\n\nAs we have seen in previous chapters, we can use bootstrapping to estimate the sampling distribution of the statistic of interest (here, the slope) without the null assumption of no relationship (which was the condition in the randomization test).\nBecause interest is now in creating a CI, there is no null hypothesis, so there won't be any reason to permute either of the variables.\n\n\n\n\n\n### Observed data\n\nReturning to the births data, we may want to consider the relationship between `mage` (mother's age) and `weight`.\nIs `mage` a good predictor of `weight`?\nAnd if so, what is the relationship?\nThat is, what is the slope that models average `weight` of baby as a function of `mage` (mother's age)?\nThe linear model regressing `weight` on `mage` is provided in @tbl-ls-births-mage.\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-magePlot fig-alt='Scatterplot with mother\\'s age on the x-axis and baby\\'s weight on the y-axis. A linear model is superimposed. The points show a weak positive linear trend.' width=90%}\n:::\n:::\n\n::: {#tbl-ls-births-mage .cell tbl-cap='The least squares estimates of the intercept and slope are given in the estimate column. The observed slope is 0.036.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n\n \n (Intercept) | \n 6.23 | \n 0.71 | \n 8.79 | \n <0.0001 | \n
\n \n mage | \n 0.04 | \n 0.02 | \n 1.50 | \n 0.1362 | \n
\n\n
\n\n`````\n:::\n:::\n\n\n### Variability of the statistic\n\nBecause the focus here is *not* on a null distribution, we sample with replacement $n = 100$ observations from the original dataset.\nRecall that with bootstrapping the resample always has the same number of observations as the original dataset in order to mimic the process of taking a sample from the population.\nWhen sampling in the linear model case, consider each observation to be a single dot.\nIf the dot is resampled, both the `weight` and the `mage` measurement are observed.\nThe measurements are linked to the dot (i.e., to the birth in the sample).\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-birth2BS fig-alt='Two scatterplots, both with mother\\'s age on the x-axis and baby\\'s weight on the y-axis. The left plot is the original data. The right plot is the bootstrapped data. Comparing the bootstrapped points to the original points, we can see that some observations were sampled more than once, and some observations were not selected for the bootstrap sample at all.' width=90%}\n:::\n:::\n\n\n@fig-birth2BS shows the original data as compared with a single bootstrap sample, resulting in (slightly) different linear models.\nThe red circles represent points in the original data which were not included in the bootstrap sample.\nThe blue circles represent a point that was repeatedly resampled (and is therefore darker) in the bootstrap sample.\nThe green circles represent a particular structure to the data which is observed in both the original and bootstrap samples.\nBy repeatedly resampling, we can see dozens of bootstrapped slopes on the same plot in @fig-birthBS.\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-birthBS fig-alt='An x-y coordinate system with least squares regression lines from many bootstrap samples (no points are plotted). The lines vary around the observed population line. On the x-axis is mother\\'s age; on the y-axis is baby\\'s weight' width=90%}\n:::\n:::\n\n\nRecall that in order to create a confidence interval for the slope, we need to find the range of values that the statistic (here the slope) takes on from different bootstrap samples.\n@fig-mageBSslopes is a histogram of the relevant bootstrapped slopes.\nWe can see that a 95% bootstrap percentile interval for the true population slope is given by (-0.01, 0.081).\nWe are 95% confident that for the model describing the population of births, described by mother's age and `weight` of baby, a one unit increase in `mage` (in years) will be associated with an increase in predicted average baby `weight` of between -0.01 and 0.081 pounds (notice that the CI overlaps zero, so the true relationship *might* be null!).\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-mageBSslopes fig-alt='Histogram of the slopes computed from many bootstrapped samples. The bootstrap samples range from -0.05 (with the 2.5 percentile at -0.01) to +0.1 (with the 97.5 percentile at 0.081). The bootstrapped slopes form a histogram that is reasonably symmetric and bell-shaped.' width=90%}\n:::\n:::\n\n\n\n::: {.workedexample data-latex=\"\"}\nUsing @fig-mageBSslopes, calculate the bootstrap estimate for the standard error of the slope.\nUsing the bootstrap standard error, find a 95% bootstrap SE confidence interval for the true population slope, and interpret the interval in context.\n\n------------------------------------------------------------------------\n\nNotice that most of the bootstrapped slopes fall between -0.01 and +0.08 (a range of 0.09).\nUsing the empirical rule (that with bell-shaped distributions, most observations are within two standard errors of the center), the standard error of the slopes is approximately 0.0225.\nThe normal cutoff for a 95% confidence interval is $z^\\star = 1.96$ which leads to a confidence interval of $b_1 \\pm 1.96 \\cdot SE \\rightarrow 0.036 \\pm 1.96 \\cdot 0.0225 \\rightarrow (-0.0081, 0.0801).$ The bootstrap SE confidence interval is almost identical to the bootstrap percentile interval.\nIn context, we are 95% confident that for the model describing the population of births, described by mother's age and `weight` of baby, a one unit increase in `mage` (in years) will be associated with an increase in predicted average baby `weight` of between -0.0081 and 0.0801 pounds\n:::\n\n## Mathematical model for testing the slope {#sec-mathslope}\n\nWhen certain technical conditions apply, it is convenient to use mathematical approximations to test and estimate the slope parameter.\nThe approximations will build on the t-distribution which was described in @sec-inference-one-mean.\nThe mathematical model is often correct and is usually easy to implement computationally.\nThe validity of the technical conditions will be considered in detail in @sec-tech-cond-linmod.\n\nIn this section, we discuss uncertainty in the estimates of the slope and y-intercept for a regression line.\nJust as we identified standard errors for point estimates in previous chapters, we start by discussing standard errors for the slope and y-intercept estimates.\n\n### Observed data\n\n**Midterm elections and unemployment**\n\nElections for members of the United States House of Representatives occur every two years, coinciding every four years with the U.S.\nPresidential election.\nThe set of House elections occurring during the middle of a Presidential term are called midterm elections.\nIn America's two-party system (the vast majority of House members through history have been either Republicans or Democrats), one political theory suggests the higher the unemployment rate, the worse the President's party will do in the midterm elections.\nIn 2020 there were 232 Democrats, 198 Republicans, and 1 Libertarian in the House.\n\nTo assess the validity of the claim related to unemployment and voting patterns, we can compile historical data and look for a connection.\nWe consider every midterm election from 1898 to 2018, with the exception of those elections during the Great Depression.\nThe House of Representatives is made up of 435 voting members.\n\n::: {.data data-latex=\"\"}\nThe [`midterms_house`](http://openintrostat.github.io/openintro/reference/midterms_house.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n@fig-unemploymentAndChangeInHouse shows these data and the least-squares regression line:\n\n$$\n\\begin{aligned}\n&\\texttt{percent change in House seats for President's party} \\\\\n&\\qquad\\qquad= -7.36 - 0.89 \\times \\texttt{(unemployment rate)}\n\\end{aligned}\n$$\n\nWe consider the percent change in the number of seats of the President's party (e.g., percent change in the number of seats for Republicans in 2018) against the unemployment rate.\n\nExamining the data, there are no clear deviations from linearity or substantial outliers (see @sec-resids for a discussion on using residuals to visualize how well a linear model fits the data).\nWhile the data are collected sequentially, a separate analysis was used to check for any apparent correlation between successive observations; no such correlation was found.\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-unemploymentAndChangeInHouse fig-alt='Scatterplot with percent unemployed on the x-axis and percent change in House seats for the President\\'s party on the y-axis. Each point represents a different President\\'s midterm and is colored according to their political party (Democrat or Republican). The relationship is moderate and negative.' width=90%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nThe data for the Great Depression (1934 and 1938) were removed because the unemployment rate was 21% and 18%, respectively.\nDo you agree that they should be removed for this investigation?\nWhy or why not?[^24-inf-model-slr-1]\n:::\n\n[^24-inf-model-slr-1]: The answer to this question relies on the idea that statistical data analysis is somewhat of an art.\n That is, in many situations, there is no \"right\" answer.\n As you do more and more analyses on your own, you will come to recognize the nuanced understanding which is needed for a particular dataset.\n In terms of the Great Depression, we will provide two contrasting considerations.\n Each of these points would have very high leverage on any least-squares regression line, and years with such high unemployment may not help us understand what would happen in other years where the unemployment is only modestly high.\n On the other hand, the Depression years are exceptional cases, and we would be discarding important information if we exclude them from a final analysis.\n\nThere is a negative slope in the line shown in @fig-unemploymentAndChangeInHouse.\nHowever, this slope (and the y-intercept) are only estimates of the parameter values.\nWe might wonder, is this convincing evidence that the \"true\" linear model has a negative slope?\nThat is, do the data provide strong evidence that the political theory is accurate, where the unemployment rate is a useful predictor of the midterm election?\nWe can frame this investigation into a statistical hypothesis test:\n\n- $H_0$: $\\beta_1 = 0$. The true linear model has slope zero.\n- $H_A$: $\\beta_1 \\neq 0$. The true linear model has a slope different than zero. The unemployment is predictive of whether the President's party wins or loses seats in the House of Representatives.\n\nWe would reject $H_0$ in favor of $H_A$ if the data provide strong evidence that the true slope parameter is different than zero.\nTo assess the hypotheses, we identify a standard error for the estimate, compute an appropriate test statistic, and identify the p-value.\n\n### Variability of the statistic\n\nJust like other point estimates we have seen before, we can compute a standard error and test statistic for $b_1$.\nWe will generally label the test statistic using a $T$, since it follows the $t$-distribution.\n\nWe will rely on statistical software to compute the standard error and leave the explanation of how this standard error is determined to a second or third statistics course.\n@tbl-midtermUnempRegTable shows software output for the least squares regression line in @fig-unemploymentAndChangeInHouse.\nThe row labeled `unemp` includes all relevant information about the slope estimate (i.e., the coefficient of the unemployment variable, the related SE, the T statistic, and the corresponding p-value).\n\n\n\n\n::: {#tbl-midtermUnempRegTable .cell tbl-cap='Output from statistical software for the regression line modeling the midterm election losses for the President\\'s party as a response to unemployment.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n\n \n (Intercept) | \n -7.36 | \n 5.16 | \n -1.43 | \n 0.16 | \n
\n \n unemp | \n -0.89 | \n 0.83 | \n -1.07 | \n 0.30 | \n
\n\n
\n\n`````\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nWhat do the first and second columns of @tbl-midtermUnempRegTable represent?\n\n------------------------------------------------------------------------\n\nThe entries in the first column represent the least squares estimates, $b_0$ and $b_1$, and the values in the second column correspond to the standard errors of each estimate.\nUsing the estimates, we could write the equation for the least square regression line as\n\n$$ \\hat{y} = -7.36 - 0.89 x $$\n\nwhere $\\hat{y}$ in this case represents the predicted change in the number of seats for the president's party, and $x$ represents the unemployment rate.\n:::\n\nWe previously used a $t$-test statistic for hypothesis testing in the context of numerical data.\nRegression is very similar.\nIn the hypotheses we consider, the null value for the slope is 0, so we can compute the test statistic using the T score formula:\n\n$$\nT \\ = \\ \\frac{\\text{estimate} - \\text{null value}}{\\text{SE}} = \\ \\frac{-0.89 - 0}{0.835} = \\ -1.07\n$$\n\nThe T score we calculated corresponds to the third column of @tbl-midtermUnempRegTable.\n\n::: {.workedexample data-latex=\"\"}\nUse @tbl-midtermUnempRegTable to determine the p-value for the hypothesis test.\n\n------------------------------------------------------------------------\n\nThe last column of the table gives the p-value for the two-sided hypothesis test for the coefficient of the unemployment rate 0.2961 That is, the data do not provide convincing evidence that a higher unemployment rate has any correspondence with smaller or larger losses for the President's party in the House of Representatives in midterm elections. If there was no linear relationship between the two variables (i.e., if $\\beta_1 = 0)$, then we would expect to see linear models as or more extreme that the observed model roughly 30% of the time.\n:::\n\n### Observed statistic vs. null statistics\n\nAs the final step in a mathematical hypothesis test for the slope, we use the information provided to make a conclusion about whether the data could have come from a population where the true slope was zero (i.e., $\\beta_1 = 0$).\nBefore evaluating the formal hypothesis claim, sometimes it is important to check your intuition.\nBased on everything we have seen in the examples above describing the variability of a line from sample to sample, ask yourself if the linear relationship given by the data could have come from a population in which the slope was truly zero.\n\n::: {.workedexample data-latex=\"\"}\nExamine @fig-elmhurstScatterWLine, which relates the Elmhurst College aid and student family income.\nAre you convinced that the slope is discernibly different from zero?\nThat is, do you think a formal hypothesis test would reject the claim that the true slope of the line should be zero?\n\n------------------------------------------------------------------------\n\nWhile the relationship between the variables is not perfect, there is an evident decreasing trend in the data.\nSuch a distinct trend suggests that the hypothesis test will reject the null claim that the slope is zero.\n:::\n\n::: {.data data-latex=\"\"}\nThe [`elmhurst`](http://openintrostat.github.io/openintro/reference/elmhurst.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nThe tools in this section help you go beyond a visual interpretation of the linear relationship toward a formal mathematical claim about whether the slope estimate is meaningfully different from 0 to suggest that the true population slope is different from 0.\n\n\n::: {#tbl-rOutputForIncomeAidLSRLineInInferenceSection .cell tbl-cap='Summary of least squares fit for the Elmhurst College data, where we are predicting the gift aid by the university based on the family income of students.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n\n \n (Intercept) | \n 24319.33 | \n 1291.45 | \n 18.83 | \n <0.0001 | \n
\n \n family_income | \n -0.04 | \n 0.01 | \n -3.98 | \n 2e-04 | \n
\n\n
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\n@tbl-rOutputForIncomeAidLSRLineInInferenceSection shows statistical software output from fitting the least squares regression line shown in @fig-elmhurstScatterWLine.\nUse the output to formally evaluate the following hypotheses.[^24-inf-model-slr-2]\n\n- $H_0$: The true coefficient for family income is zero.\n- $H_A$: The true coefficient for family income is not zero.\n:::\n\n[^24-inf-model-slr-2]: We look in the second row corresponding to the family income variable.\n We see the point estimate of the slope of the line is -0.0431, the standard error of this estimate is 0.0108, and the $t$-test statistic is $T = -3.98$.\n The p-value corresponds exactly to the two-sided test we are interested in: 0.0002.\n The p-value is so small that we reject the null hypothesis and conclude that family income and financial aid at Elmhurst College for freshman entering in the year 2011 are negatively correlated and the true slope parameter is indeed less than 0, just as we believed in our analysis of @fig-elmhurstScatterWLine.\n\n::: {.important data-latex=\"\"}\n**Inference for regression.**\n\nWe usually rely on statistical software to identify point estimates, standard errors, test statistics, and p-values in practice.\nHowever, be aware that software will not generally check whether the method is appropriate, meaning we must still verify conditions are met.\nSee @sec-tech-cond-linmod.\n:::\n\n\\clearpage\n\n## Mathematical model, interval for the slope\n\nSimilar to how we can conduct a hypothesis test for a model coefficient using regression output, we can also construct confidence intervals for the slope and intercept coefficients.\n\n::: {.important data-latex=\"\"}\n**Confidence intervals for coefficients.**\n\nConfidence intervals for model coefficients (e.g., the intercept or the slope) can be computed using the $t$-distribution:\n\n$$ b_i \\ \\pm\\ t_{df}^{\\star} \\times SE_{b_{i}} $$\n\nwhere $t_{df}^{\\star}$ is the appropriate $t^{\\star}$ cutoff corresponding to the confidence level with the model's degrees of freedom, $df = n - 2$.\n:::\n\n::: {.workedexample data-latex=\"\"}\nCompute the 95% confidence interval for the coefficient using the regression output from @tbl-rOutputForIncomeAidLSRLineInInferenceSection.\n\n------------------------------------------------------------------------\n\nThe point estimate is -0.0431 and the standard error is $SE = 0.0108$.\nWhen constructing a confidence interval for a model coefficient, we generally use a $t$-distribution.\nThe degrees of freedom for the distribution are noted in the regression output, $df = 48$, allowing us to identify $t_{48}^{\\star} = 2.01$ for use in the confidence interval.\n\nWe can now construct the confidence interval in the usual way:\n\n$$\n\\begin{aligned}\n\\text{point estimate} &\\pm t_{48}^{\\star} \\times SE \\\\\n-0.0431 &\\pm 2.01 \\times 0.0108 \\\\\n(-0.0648 &, -0.0214)\n\\end{aligned}\n$$\n\nWe are 95% confident that for an additional one unit (i.e., $1000 increase) in family income, the university's gift aid is predicted to decrease on average by \\$21.40 to \\$64.80.\n:::\n\nOn the topic of intervals in this book, we have focused exclusively on confidence intervals for model parameters.\nHowever, there are other types of intervals that may be of interest (and are outside the scope of this book), including prediction intervals for a response value and confidence intervals for a mean response value in the context of regression.\n\n\\clearpage\n\n## Checking model conditions {#sec-tech-cond-linmod}\n\nIn the previous sections, we used randomization and bootstrapping to perform inference when the mathematical model was not valid due to violations of the technical conditions.\nIn this section, we'll provide details for when the mathematical model is appropriate and a discussion of technical conditions needed for the randomization and bootstrapping procedures.\nRecall from @sec-resids that residual plots can be used to visualize how well a linear model fits the data.\n\n\n\n\n\n### What are the technical conditions for the mathematical model?\n\nWhen fitting a least squares line, we generally require the following:\n\n- **Linearity.** The data should show a linear trend.\n If there is a nonlinear trend (e.g., first panel of @fig-whatCanGoWrongWithLinearModel) an advanced regression method from another book or later course should be applied.\n\n- **Independent observations.** Be cautious about applying regression to data that are sequential observations in time such as a stock price each day.\n Such data may have an underlying structure that should be considered in a different type of model and analysis.\n An example of a dataset where successive observations are not independent is shown in the fourth panel of @fig-whatCanGoWrongWithLinearModel.\n There are also other instances where correlations within the data are important, which is further discussed in @sec-inf-model-mlr.\n\n- **Nearly normal residuals.** Generally, the residuals should be nearly normal.\n When this condition is found to be unreasonable, it is often because of outliers or concerns about influential points, which we'll talk about more in @sec-outliers-in-regression.\n An example of a residual that would be potentially concerning is shown in the second panel of @fig-whatCanGoWrongWithLinearModel, where one observation is clearly much further from the regression line than the others. Outliers should be treated extremely carefully. Do not automatically remove an outlier if it truly belongs in the dataset. However, be honest about its impact on the analysis. A strategy for dealing with outliers is to present two analyses: one with the outlier and one without the outlier.\n Additionally, a type of violation of normality happens when the positive residuals are smaller in magnitude than the negative residuals (or vice versa). That is, when the residuals are not symmetrically distributed around the line $y=0.$ \n\n- **Constant or equal variability.** The variability of points around the least squares line remains roughly constant.\n An example of non-constant variability is shown in the third panel of @fig-whatCanGoWrongWithLinearModel, which represents the most common pattern observed when this condition fails: the variability of $y$ is larger when $x$ is larger.\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-whatCanGoWrongWithLinearModel fig-alt='A grid of 2 by 4 scatterplots with fabricated data. The top row of plots contains original x-y data plots with a least squares regression line. The bottom row of plots is a series of residual plot with predicted value on the x-axis and residual on the y-axis. The first column of plots gives an example of points that have a quadratic relationship instead of a linear relationship. The second column of plots gives an example where a single outlying point does not fit the linear model. The third column of points gives an example where the points have increasing variability as the value of x increases. The last column of points gives an example where the points are correlated with one another, possibly as part of a time series.' width=100%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nShould we have concerns about applying least squares regression to the Elmhurst data in @fig-elmhurstScatterW2Lines?[^24-inf-model-slr-3]\n:::\n\n[^24-inf-model-slr-3]: The trend appears to be linear, the data fall around the line with no obvious outliers, the variance is roughly constant.\n The data do not come from a time series or other obvious violation to independence.\n Least squares regression can be applied to these data.\n\nThe technical conditions are often remembered using the **LINE** mnemonic.\nThe linearity, normality, and equality of variance conditions usually can be assessed through residual plots, as seen in @fig-whatCanGoWrongWithLinearModel.\nA careful consideration of the experimental design should be undertaken to confirm that the observed values are indeed independent.\n\n- L: **linear** model\n- I: **independent** observations\n- N: points are **normally** distributed around the line\n- E: **equal** variability around the line for all values of the explanatory variable\n\n### Why do we need technical conditions?\n\nAs with other inferential techniques we have covered in this text, if the technical conditions above do not hold, then it is not possible to make concluding claims about the population.\nThat is, without the technical conditions, the T score will not have the assumed t-distribution.\nThat said, it is almost always impossible to check the conditions precisely, so we look for large deviations from the conditions.\nIf there are large deviations, we will be unable to trust the calculated p-value or the endpoints of the resulting confidence interval.\n\n**The model based on Linearity**\n\nThe linearity condition is among the most important if your goal is to understand a linear model between $x$ and $y$.\nFor example, the value of the slope will not be at all meaningful if the true relationship between $x$ and $y$ is quadratic, as in @fig-notGoodAtAllForALinearModel.\nNot only should we be cautious about the inference, but the model *itself* is also not an accurate portrayal of the relationship between the variables.\n\nIn @sec-inf-model-mlr we discuss model modifications that can often lead to an excellent fit of strong relationships other than linear ones.\nHowever, an extended discussion on the different methods for modeling functional forms other than linear is outside the scope of this text.\n\n**The importance of Independence**\n\nThe technical condition describing the independence of the observations is often the most crucial but also the most difficult to diagnose.\nIt is also extremely difficult to gather a dataset which is a true random sample from the population of interest.\n(Note: a true randomized experiment from a fixed set of individuals is much easier to implement, and indeed, randomized experiments are done in most medical studies these days.)\n\nDependent observations can bias results in ways that produce fundamentally flawed analyses.\nThat is, if you hang out at the gym measuring height and weight, your linear model is surely not a representation of all students at your university.\nAt best it is a model describing students who use the gym (but also who are willing to talk to you, that use the gym at the times you were there measuring, etc.).\n\nIn lieu of trying to answer whether your observations are a true random sample, you might instead focus on whether you believe your observations are representative of a population of interest.\nHumans are notoriously bad at implementing random procedures, so you should be wary of any process that used human intuition to balance the data with respect to, for example, the demographics of the individuals in the sample.\n\n\\clearpage\n\n**Some thoughts on Normality**\n\nThe normality condition requires that points vary symmetrically around the line, spreading out in a bell-shaped fashion.\nYou should consider the \"bell\" of the normal distribution as sitting on top of the line (coming off the paper in a 3-D sense) so as to indicate that the points are dense close to the line and disperse gradually as they get farther from the line.\n\nThe normality condition is less important than linearity or independence for a few reasons.\nFirst, the linear model fit with least squares will still be an unbiased estimate of the true population model.\nHowever, the distribution of the estimate will be unknown.\nFortunately the Central Limit Theorem tells us that most of the analyses (e.g., SEs, p-values, confidence intervals) done using the mathematical model (with the $t$-distribution) will still hold (even if the data are not normally distributed around the line) as long as the sample size is large enough.\nOne analysis method that *does* require normality, regardless of sample size, is creating intervals which predict the response of individual outcomes at a given $x$ value, using the linear model.\nOne additional reason to worry slightly less about normality is that neither the randomization test nor the bootstrapping procedures require the data to be normal around the line.\n\n**Equal variability for prediction in particular**\n\nAs with normality, the equal variability condition (that points are spread out in similar ways around the line for all values of $x$) will not cause problems for the estimate of the linear model.\nThat said, the **inference** on the model (e.g., computing p-values) will be incorrect if the variability around the line is extremely heterogeneous.\nData that exhibit non-equal variance across the range of x-values will have the potential to seriously mis-estimate the variability of the slope which will have consequences for the inference results (i.e., hypothesis tests and confidence intervals).\n\nIn many cases, the inference results for both a randomization test or a bootstrap confidence interval are also robust to the equal variability condition, so they provide the analyst a set of methods to use when the data are heteroskedastic (that is, exhibit unequal variability around the regression line).\nAlthough randomization tests and bootstrapping allow us to analyze data using fewer conditions, some technical conditions are required for all methods described in this text (e.g., independent observation).\nWhen the equal variability condition is violated and a mathematical analysis (e.g., p-value from T score) is needed, there are other existing methods (outside the scope of this text) which can handle the unequal variance (e.g., weighted least squares analysis).\n\n### What if all the technical conditions are met?\n\nWhen the technical conditions are met, the least squares regression model and inference is provided by virtually all statistical software.\nIn addition to being ubiquitous, however, an additional advantage to the least squares regression model (and related inference) is that the linear model has important extensions (which are not trivial to implement with bootstrapping and randomization tests).\nIn particular, random effects models, repeated measures, and interaction are all linear model extensions which require the above technical conditions.\nWhen the technical conditions hold, the extensions to the linear model can provide important insight into the data and research question at hand.\nWe will discuss some of the extended modeling and associated inference in @sec-inf-model-mlr and @sec-inf-model-logistic.\nMany of the techniques used to deal with technical condition violations are outside the scope of this text, but they are taught in universities in the very next class after this one.\nIf you are working with linear models or curious to learn more, we recommend that you continue learning about statistical methods applicable to a larger class of datasets.\n\n\\clearpage\n\n## Chapter review {#sec-chp24-review}\n\n### Summary\n\nRecall that early in the text we presented graphical techniques which communicated relationships across multiple variables.\nWe also used modeling to formalize the relationships.\nMany chapters were dedicated to inferential methods which allowed claims about the population to be made based on samples of data.\nNot only did we present the mathematical model for each of the inferential techniques, but when appropriate, we also presented bootstrapping and permutation methods.\n\nHere in @sec-inf-model-slr we brought all of those ideas together by considering inferential claims on linear models through randomization tests, bootstrapping, and mathematical modeling.\nWe continue to emphasize the importance of experimental design in making conclusions about research claims.\nIn particular, recall that variability can come from different sources (e.g., random sampling vs. random allocation, see @fig-randsampValloc).\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n bootstrap CI for the slope | \n randomization test for the slope | \n technical conditions linear regression | \n
\n \n inference with single precictor regression | \n t-distribution for slope | \n variability of the slope | \n
\n\n
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#sec-chp24-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-24].\n\n::: {.exercises data-latex=\"\"}\n1. **Body measurements, randomization test.** Researchers studying anthropometry collected body and skeletal diameter measurements, as well as age, weight, height and sex for 507 physically active individuals.\n A linear model is built to predict height based on shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters.[^_24-ex-inf-model-slr-1]\n [@Heinz:2003]\n\n Below are two items.\n The first is the standard linear model output for predicting height from shoulder girth.\n The second is a histogram of slopes from 1,000 randomized datasets (1,000 times, `hgt` was permuted and regressed against `sho_gi`).\n The red vertical line is drawn at the observed slope value which was produced in the linear model output.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n 105.832 | \n 3.27 | \n 32.3 | \n <0.0001 | \n
\n \n sho_gi | \n 0.604 | \n 0.03 | \n 20.0 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n \n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. What are the null and alternative hypotheses for evaluating whether the slope of the model predicting height from shoulder girth is differen than 0.\n\n b. Using the histogram which describes the distribution of slopes when the null hypothesis is true, find the p-value and conclude the hypothesis test in the context of the problem (use words like shoulder girth and height).\n\n c. Is the conclusion based on the histogram of randomized slopes consistent with the conclusion which would have been obtained using the mathematical model?\n Explain.\n\n \\clearpage\n\n2. **Body measurements, mathematical test.** The scatterplot and least squares summary below show the relationship between weight measured in kilograms and height measured in centimeters of 507 physically active individuals.\n [@Heinz:2003]\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=70%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n -105.01 | \n 7.54 | \n -13.9 | \n <0.0001 | \n
\n \n hgt | \n 1.02 | \n 0.04 | \n 23.1 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n :::\n\n a. Describe the relationship between height and weight.\n\n b. Write the equation of the regression line.\n Interpret the slope and intercept in context.\n\n c. Do the data provide convincing evidence that the true slope parameter is different than 0?\n State the null and alternative hypotheses, report the p-value (using a mathematical model), and state your conclusion.\n\n d. The correlation coefficient for height and weight is 0.72.\n Calculate $R^2$ and interpret it.\n\n3. **Body measurements, bootstrap percentile interval.** In order to estimate the slope of the model predicting height based on shoulder girth (circumference of shoulders measured over deltoid muscles), 1,000 bootstrap samples are taken from a dataset of body measurements from 507 people.\n A linear model predicting height from shoulder girth is fit to each bootstrap sample, and the slope is estimated.\n A histogram of these slopes is shown below.\n [@Heinz:2003]\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. Using the bootstrap percentile method and the histogram above, find a 98% confidence interval for the slope parameter.\n\n b. Interpret the confidence interval in the context of the problem.\n\n \\clearpage\n\n4. **Body measurements, standard error bootstrap interval.** A linear model is built to predict height based on shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters.\n [@Heinz:2003]\n\n Below are two items.\n The first is the standard linear model output for predicting height from shoulder girth.\n The second is the bootstrap distribution of the slope statistic from 1,000 different bootstrap samples of the data.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n 105.832 | \n 3.27 | \n 32.3 | \n <0.0001 | \n
\n \n sho_gi | \n 0.604 | \n 0.03 | \n 20.0 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n \n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify the variability of the slope statistic from sample to sample).\n\n b. Find a 98% bootstrap SE confidence interval for the slope parameter.\n\n c. Interpret the confidence interval in the context of the problem.\n\n \\clearpage\n\n5. **Murders and poverty, randomization test.** The following regression output is for predicting annual murders per million (`annual_murders_per_mil`) from percentage living in poverty (`perc_pov`) in a random sample of 20 metropolitan areas.\n\n Below are two items.\n The first is the standard linear model output for predicting annual murders per million from percentage living in poverty for metropolitan areas.\n The second is a histogram of slopes from 1000 randomized datasets (1000 times, `annual_murders_per_mil` was permuted and regressed against `perc_pov`).\n The red vertical line is drawn at the observed slope value which was produced in the linear model output.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n -29.90 | \n 7.79 | \n -3.84 | \n 0.0012 | \n
\n \n perc_pov | \n 2.56 | \n 0.39 | \n 6.56 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n \n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. What are the null and alternative hypotheses for evaluating whether the slope of the model for predicting annual murder rate from poverty percentage is different than 0?\n\n b. Using the histogram which describes the distribution of slopes when the null hypothesis is true, find the p-value and conclude the hypothesis test in the context of the problem (use words like murder rate and poverty).\n\n c. Is the conclusion based on the histogram of randomized slopes consistent with the conclusion which would have been obtained using the mathematical model?\n Explain.\n\n \\clearpage\n\n6. **Murders and poverty, mathematical test.** The table below shows the output of a linear model annual murders per million (`annual_murders_per_mil`) from percentage living in poverty (`perc_pov`) in a random sample of 20 metropolitan areas.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n -29.90 | \n 7.79 | \n -3.84 | \n 0.0012 | \n
\n \n perc_pov | \n 2.56 | \n 0.39 | \n 6.56 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n :::\n\n a. What are the hypotheses for evaluating whether the slope of the model predicting annual murder rate from poverty percentage is different than 0?\n\n b. State the conclusion of the hypothesis test from part (a) in context of the data.\n What does this say about whether poverty percentage is a useful predictor of annual murder rate?\n\n c. Calculate a 95% confidence interval for the slope of poverty percentage, and interpret it in context of the data.\n\n d. Do your results from the hypothesis test and the confidence interval agree?\n Explain.\n\n7. **Murders and poverty, bootstrap percentile interval.** Data on annual murders per million (`annual_murders_per_mil`) and percentage living in poverty (`perc_pov`) is collected from a random sample of 20 metropolitan areas.\n Using these data we want to estimate the slope of the model predicting `annual_murders_per_mil` from `perc_pov`.\n We take 1,000 bootstrap samples of the data and fit a linear model predicting `annual_murders_per_mil` from `perc_pov` to each bootstrap sample.\n A histogram of these slopes is shown below.\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. Using the percentile bootstrap method and the histogram above, find a 90% confidence interval for the slope parameter.\n\n b. Interpret the confidence interval in the context of the problem.\n\n \\clearpage\n\n8. **Murders and poverty, standard error bootstrap interval.** A linear model is built to predict annual murders per million (`annual_murders_per_mil`) from percentage living in poverty (`perc_pov`) in a random sample of 20 metropolitan areas.\n\n Below are two items.\n The first is the standard linear model output for predicting annual murders per million from percentage living in poverty for metropolitan areas.\n The second is the bootstrap distribution of the slope statistic from 1000 different bootstrap samples of the data.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n -29.90 | \n 7.79 | \n -3.84 | \n 0.0012 | \n
\n \n perc_pov | \n 2.56 | \n 0.39 | \n 6.56 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n \n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify the variability of the slope statistic from sample to sample).\n\n b. Find a 90% bootstrap SE confidence interval for the slope parameter.\n\n c. Interpret the confidence interval in the context of the problem.\n\n \\clearpage\n\n9. **Baby's weight and father's age, randomization test.** US Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.\n The data used here are a random sample of 1000 births from 2014.\n Here, we study the relationship between the father's age and the weight of the baby.[^_24-ex-inf-model-slr-2]\n [@data:births14]\n\n Below are two items.\n The first is the standard linear model output for predicting baby's weight (in pounds) from father's age (in years).\n The second is a histogram of slopes from 1000 randomized datasets (1000 times, `weight` was permuted and regressed against `fage`).\n The red vertical line is drawn at the observed slope value which was produced in the linear model output.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n 7.101 | \n 0.199 | \n 35.674 | \n <0.0001 | \n
\n \n fage | \n 0.005 | \n 0.006 | \n 0.757 | \n 0.4495 | \n
\n \n
\n \n `````\n :::\n \n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. What are the null and alternative hypotheses for evaluating whether the slope of the model for predicting baby's weight from father's age is different than 0?\n\n b. Using the histogram which describes the distribution of slopes when the null hypothesis is true, find the p-value and conclude the hypothesis test in the context of the problem (use words like father's age and weight of baby).\n What does the conclusion of your test say about whether the father's age is a useful predictor of baby's weight?\n\n c. Is the conclusion based on the histogram of randomized slopes consistent with the conclusion which would have been obtained using the mathematical model?\n Explain.\n\n \\clearpage\n\n10. **Baby's weight and father's age, mathematical test.** Is the father's age useful in predicting the baby's weight?\n The scatterplot and least squares summary below show the relationship between baby's weight (measured in pounds) and father's age for a random sample of babies.\n [@data:births14]\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=70%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n 7.1042 | \n 0.1936 | \n 36.698 | \n <0.0001 | \n
\n \n fage | \n 0.0047 | \n 0.0061 | \n 0.779 | \n 0.4359 | \n
\n \n
\n \n `````\n :::\n :::\n\n a. What is the predicted weight of a baby whose father is 30 years old.\n\n b. Do the data provide convincing evidence that the model for predicting baby weights from father's age has a slope different than 0?\n State the null and alternative hypotheses, report the p-value (using a mathematical model), and state your conclusion.\n\n c. Based on your conclusion, is father's age a useful predictor of baby's weight?\n\n11. **Baby's weight and father's age, bootstrap percentile interval.** US Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.\n The data used here are a random sample of 1000 births from 2014.\n Here, we study the relationship between the father's age and the weight of the baby.\n Below is the bootstrap distribution of the slope statistic from 1,000 different bootstrap samples of the data.\n [@data:births14]\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. Using the bootstrap percentile method and the histogram above, find a 95% confidence interval for the slope parameter.\n\n b. Interpret the confidence interval in the context of the problem.\n\n \\clearpage\n\n12. **Baby's weight and father's age, standard error bootstrap interval.** US Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.\n The data used here are a random sample of 1000 births from 2014.\n Here, we study the relationship between the father's age and the weight of the baby.\n [@data:births14]\n\n Below are two items.\n The first is the standard linear model output for predicting baby's weight (in pounds) from father's age (in years).\n The second is the bootstrap distribution of the slope statistic from 1000 different bootstrap samples of the data.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n 7.101 | \n 0.199 | \n 35.674 | \n <0.0001 | \n
\n \n fage | \n 0.005 | \n 0.006 | \n 0.757 | \n 0.4495 | \n
\n \n
\n \n `````\n :::\n \n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify the variability of the slope statistic from sample to sample).\n\n b. Find a 95% bootstrap SE confidence interval for the slope parameter.\n\n c. Interpret the confidence interval in the context of the problem.\n\n13. **I heart cats.** Researchers collected data on heart and body weights of 144 domestic adult cats.\n The table below shows the output of a linear model predicting heat weight (measured in grams) from body weight (measured in kilograms) of these cats.[^_24-ex-inf-model-slr-3]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n -0.357 | \n 0.692 | \n -0.515 | \n 0.6072 | \n
\n \n Bwt | \n 4.034 | \n 0.250 | \n 16.119 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n :::\n\n a. What are the hypotheses for evaluating whether body weight is positively associated with heart weight in cats?\n\n b. State the conclusion of the hypothesis test from part (a) in context of the data.\n\n c. Calculate a 95% confidence interval for the slope of body weight, and interpret it in context of the data.\n\n d. Do your results from the hypothesis test and the confidence interval agree?\n Explain.\n\n \\clearpage\n\n14. **Beer and blood alcohol content** Many people believe that weight, drinking habits, and many other factors are much more important in predicting blood alcohol content (BAC) than simply considering the number of drinks a person consumed.\n Here we examine data from sixteen student volunteers at Ohio State University who each drank a randomly assigned number of cans of beer.\n These students were evenly divided between men and women, and they differed in weight and drinking habits.\n Thirty minutes later, a police officer measured their blood alcohol content (BAC) in grams of alcohol per deciliter of blood.\n The scatterplot and regression table summarize the findings.\n [^_24-ex-inf-model-slr-4] [@Malkevitc+Lesser:2008]\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=90%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n -0.0127 | \n 0.0126 | \n -1.00 | \n 0.332 | \n
\n \n beers | \n 0.0180 | \n 0.0024 | \n 7.48 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n :::\n\n a. Describe the relationship between the number of cans of beer and BAC.\n\n b. Write the equation of the regression line.\n Interpret the slope and intercept in context.\n\n c. Do the data provide convincing evidence that drinking more cans of beer is associated with an increase in blood alcohol?\n State the null and alternative hypotheses, report the p-value, and state your conclusion.\n\n d. The correlation coefficient for number of cans of beer and BAC is 0.89.\n Calculate $R^2$ and interpret it in context.\n\n e. Suppose we visit a bar, ask people how many drinks they have had, and take their BAC.\n Do you think the relationship between number of drinks and BAC would be as strong as the relationship found in the Ohio State study?\n\n \\clearpage\n\n15. **Urban homeowners, conditions.** The scatterplot below shows the percent of families who own their home vs. the percent of the population living in urban areas.\n [@data:urbanOwner] There are 52 observations, each corresponding to a state in the US.\n Puerto Rico and District of Columbia are also included.\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=50%}\n :::\n :::\n\n a. For these data, $R^2$ is 29.16%.\n What is the value of the correlation coefficient?\n How can you tell if it is positive or negative?\n\n b. Examine the residual plot.\n What do you observe?\n Is a simple least squares fit appropriate for these data?\n Which of the LINE conditions are met or not met?\n\n[^_24-ex-inf-model-slr-1]: The [`bdims`](http://openintrostat.github.io/openintro/reference/bdims.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_24-ex-inf-model-slr-2]: The [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_24-ex-inf-model-slr-3]: The [`cats`](https://stat.ethz.ch/R-manual/R-patched/library/MASS/html/cats.html) data used in this exercise can be found in the [**MASS**](https://cran.r-project.org/web/packages/MASS/index.html) R package.\n\n[^_24-ex-inf-model-slr-4]: The [`bac`](http://openintrostat.github.io/openintro/reference/bac.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n\n:::\n",
+ "markdown": "# Inference for linear regression with a single predictor {#sec-inf-model-slr}\n\n\n\n\n\n\\chaptermark{Inference for regression with a single predictor}\n\n::: {.chapterintro data-latex=\"\"}\nWe now bring together ideas of inferential analyses with the descriptive models seen in [Chapter -@sec-model-slr].\nIn particular, we will use the least squares regression line to test whether there is a relationship between two continuous variables.\nAdditionally, we will build confidence intervals which quantify the slope of the linear regression line.\nThe setting is now focused on predicting a numeric response variable (for linear models) or a binary response variable (for logistic models), we continue to ask questions about the variability of the model from sample to sample.\nThe sampling variability will inform the conclusions about the population that can be drawn.\n\nMany of the inferential ideas are remarkably similar to those covered in previous chapters.\nThe technical conditions for linear models are typically assessed graphically, although independence of observations continues to be of utmost importance.\n\nWe encourage the reader to think broadly about the models at hand without putting too much dependence on the exact p-values that are reported from the statistical software.\nInference on models with multiple explanatory variables can suffer from data snooping which result in false positive claims.\nWe provide some guidance and hope the reader will further their statistical learning after working through the material in this text.\n:::\n\n\n\n\n\n## Case study: Sandwich store\n\n### Observed data\n\nWe start the chapter with a hypothetical example describing the linear relationship between dollars spent advertising for a chain sandwich restaurant and monthly revenue.\nThe hypothetical example serves the purpose of illustrating how a linear model varies from sample to sample.\nBecause we have made up the example and the data (and the entire population), we can take many many samples from the population to visualize the variability.\nNote that in real life, we always have exactly one sample (that is, one dataset), and through the inference process, we imagine what might have happened had we taken a different sample.\nThe change from sample to sample leads to an understanding of how the single observed dataset is different from the population of values, which is typically the fundamental goal of inference.\n\nConsider the following hypothetical population of all of the sandwich stores of a particular chain seen in @fig-sandpop.\nIn this made-up world, the CEO actually has all the relevant data, which is why they can plot it here.\nThe CEO is omniscient and can write down the population model which describes the true population relationship between the advertising dollars and revenue.\nThere appears to be a linear relationship between advertising dollars and revenue (both in \\$1,000).\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-sandpop fig-alt='Scatterplot with advertising amount on the x-axis and revenue on the y-axis. A linear model is superimposed. The points show a reasonably strong and positive linear trend.' width=90%}\n:::\n:::\n\n\nYou may remember from @sec-model-slr that the population model is: $$y = \\beta_0 + \\beta_1 x + \\varepsilon.$$\n\nAgain, the omniscient CEO (with the full population information) can write down the true population model as: $$\\texttt{expected revenue} = 11.23 + 4.8 \\times \\texttt{advertising}.$$\n\n### Variability of the statistic\n\nUnfortunately, in our scenario, the CEO is not willing to part with the full set of data, but they will allow potential franchise buyers to see a small sample of the data in order to help the potential buyer decide whether set up a new franchise.\nThe CEO is willing to give each potential franchise buyer a random sample of data from 20 stores.\n\nAs with any numerical characteristic which describes a subset of the population, the estimated slope of a sample will vary from sample to sample.\nConsider the linear model which describes revenue (in \\$1,000) based on advertising dollars (in \\$1,000).\n\nThe least squares regression model uses the data to find a sample linear fit: $$\\hat{y} = b_0 + b_1 x.$$\n\nA random sample of 20 stores shows a different least square regression line depending on which observations are selected.\nA subset of size 20 stores shows a similar positive trend between advertising and revenue (to what we saw in @fig-sandpop which described the population) despite having fewer observations on the plot.\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-sand-samp1 fig-alt='For a random sample of 20 stores, scatterplot with advertising amount on the x-axis and\nrevenue on the y-axis. A linear model is superimposed. The points show a reasonably strong\nand positive linear trend.' width=90%}\n:::\n:::\n\n\nA second sample of size 20 also shows a positive trend!\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-sand-samp2 fig-alt='For a random sample of 20 stores, scatterplot with advertising amount on the x-axis and revenue on the y-axis. A linear model is superimposed. The points show a reasonably strong and positive linear trend.' width=90%}\n:::\n:::\n\n\nBut the lines are slightly different!\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-sand-samp12 fig-alt='For two different random samples, superimposed onto the same plot, scatterplot with advertising amount on the x-axis and revenue on the y-axis. Two linear models are plotted to demonstrate that the lines are very similar, yet they are not the same.' width=90%}\n:::\n:::\n\n\nThat is, there is **variability** in the regression line from sample to sample.\nThe concept of the sampling variability is something you've seen before, but in this lesson, you will focus on the variability of the line often measured through the variability of a single statistic: **the slope of the line**.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-slopes fig-alt='An x-y coordinate system with least squares regression lines from many random samples of size 20 (no points are plotted). The lines vary around the true population line. On the x-axis is advertising amount; on the y-axis is revenue.' width=90%}\n:::\n:::\n\n\nYou might notice in @fig-slopes that the $\\hat{y}$ values given by the lines are much more consistent in the middle of the dataset than at the ends.\nThe reason is that the data itself anchors the lines in such a way that the line must pass through the center of the data cloud.\nThe effect of the fan-shaped lines is that predicted revenue for advertising close to \\$4,000 will be much more precise than the revenue predictions made for \\$1,000 or \\$7,000 of advertising.\n\nThe distribution of slopes (for samples of size $n=20$) can be seen in a histogram, as in @fig-sand20lm.\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-sand20lm fig-alt='Histogram of the slope values from many random samples of size 20. The slope estimates vary from about 2.1 to 8. The histogram is reasonably bell-shaped.' width=90%}\n:::\n:::\n\n\nRecall, the example described in this introduction is hypothetical.\nThat is, we created an entire population in order demonstrate how the slope of a line would vary from sample to sample.\nThe tools in this textbook are designed to evaluate only one single sample of data.\nWith actual studies, we do not have repeated samples, so we are not able to use repeated samples to visualize the variability in slopes.\nWe have seen variability in samples throughout this text, so it should not come as a surprise that different samples will produce different linear models.\nHowever, it is nice to visually consider the linear models produced by different slopes.\nAdditionally, as with measuring the variability of previous statistics (e.g., $\\overline{X}_1 - \\overline{X}_2$ or $\\hat{p}_1 - \\hat{p}_2$), the histogram of the sample statistics can provide information related to inferential considerations.\n\nIn the following sections, the distribution (i.e., histogram) of $b_1$ (the estimated slope coefficient) will be constructed in the same three ways that, by now, may be familiar to you.\nFirst (in @sec-randslope), the distribution of $b_1$ when $\\beta_1 = 0$ is constructed by randomizing (permuting) the response variable.\nNext (in @sec-bootbeta1), we can bootstrap the data by taking random samples of size $n$ from the original dataset.\nAnd last (in @sec-mathslope), we use mathematical tools to describe the variability using the $t$-distribution that was first encountered in @sec-one-mean-math.\n\n## Randomization test for the slope {#sec-randslope}\n\nConsider data on 100 randomly selected births gathered originally from the US Department of Health and Human Services.\nSome of the variables are plotted in @fig-babyweight.\n\nThe scientific research interest at hand will be in determining the linear relationship between weight of baby at birth (in lbs) and number of weeks of gestation.\nThe dataset is quite rich and deserves exploring, but for this example, we will focus only on the weight of the baby.\n\n::: {.data data-latex=\"\"}\nThe [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\nWe will work with a random sample of 100 observations from these data.\n:::\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-babyweight fig-alt='Four different scatterplots, all with weight of baby on the y-axis. On the x-axis are weight gained by mother, mother\\'s age, number of hospital visits, and weeks gestation. Weeks gestation and weight of baby show the strongest linear relationship (which is positive).' width=90%}\n:::\n:::\n\n\nAs you have seen previously, statistical inference typically relies on setting a null hypothesis which is hoped to be subsequently rejected.\nIn the linear model setting, we might hope to have a linear relationship between `weeks` and `weight` in settings where `weeks` gestation is known and `weight` of baby needs to be predicted.\n\nThe relevant hypotheses for the linear model setting can be written in terms of the population slope parameter.\nHere the population refers to a larger population of births in the US.\n\n- $H_0: \\beta_1= 0$, there is no linear relationship between `weight` and `weeks`.\n- $H_A: \\beta_1 \\ne 0$, there is some linear relationship between `weight` and `weeks`.\n\nRecall that for the randomization test, we permute one variable to eliminate any existing relationship between the variables.\nThat is, we set the null hypothesis to be true, and we measure the natural variability in the data due to sampling but **not** due to variables being correlated.\n@fig-permweightScatter shows the observed data and a scatterplot of one permutation of the `weight` variable.\nThe careful observer can see that each of the observed values for `weight` (and for `weeks`) exist in both the original data plot as well as the permuted `weight` plot, but the `weight` and `weeks` gestation are no longer matched for a given birth.\nThat is, each `weight` value is randomly assigned to a new `weeks` gestation.\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-permweightScatter fig-alt='Two scatterplots, both with length of gestation on the x-axis and weight of baby on the y-axis. The left panel is the original data. The right panel is data where the weight of the baby has been permuted across the observations.' width=90%}\n:::\n:::\n\n\nBy repeatedly permuting the response variable, any pattern in the linear model that is observed is due only to random chance (and not an underlying relationship).\nThe randomization test compares the slopes calculated from the permuted response variable with the observed slope.\nIf the observed slope is inconsistent with the slopes from permuting, we can conclude that there is some underlying relationship (and that the slope is not merely due to random chance).\n\n### Observed data\n\nWe will continue to use the births data to investigate the linear relationship between `weight` and `weeks` gestation.\nNote that the least squares model (see @sec-model-slr) describing the relationship is given in @tbl-ls-births.\nThe columns in @tbl-ls-births are further described in @sec-mathslope.\n\n\n::: {#tbl-ls-births .cell tbl-cap='The least squares estimates of the intercept and slope are given in the estimate column. The observed slope is 0.335.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n\n \n (Intercept) | \n -5.72 | \n 1.61 | \n -3.54 | \n 6e-04 | \n
\n \n weeks | \n 0.34 | \n 0.04 | \n 8.07 | \n <0.0001 | \n
\n\n
\n\n`````\n:::\n:::\n\n\n### Variability of the statistic\n\nAfter permuting the data, the least squares estimate of the line can be computed.\nRepeated permutations and slope calculations describe the variability in the line (i.e., in the slope) due only to the natural variability and not due to a relationship between `weight` and `weeks` gestation.\n@fig-permweekslm shows two different permutations of `weight` and the resulting linear models.\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-permweekslm fig-alt='Two scatterplots, both with length of gestation on the x-axis and weight of baby on the y-axis. Each plot includes data where the weight of the baby has been permuted across the observations. The two different permutations produce slightly different least squares regression lines.' width=90%}\n:::\n:::\n\n\nAs you can see, sometimes the slope of the permuted data is positive, sometimes it is negative.\nBecause the randomization happens under the condition of no underlying relationship (because the response variable is completely mixed with the explanatory variable), we expect to see the center of the randomized slope distribution to be zero.\n\n### Observed statistic vs. null statistics\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-nulldistBirths fig-alt='Histogram of slopes describing the linear model from permuted weight regressed on weeks gestation. The permuted slopes range from -0.15 to +0.15 and are nowhere near the observed slope value of 0.335.' width=90%}\n:::\n:::\n\n\n\nAs we can see from @fig-nulldistBirths, a slope estimate as extreme as the observed slope estimate (the red line) never happened in many repeated permutations of the `weight` variable.\nThat is, if indeed there were no linear relationship between `weight` and `weeks`, the natural variability of the slopes would produce estimates between approximately -0.15 and +0.15.\nWe reject the null hypothesis.\nTherefore, we believe that the slope observed on the original data is not just due to natural variability and indeed, there is a linear relationship between `weight` of baby and `weeks` gestation for births in the US.\n\n## Bootstrap confidence interval for the slope {#sec-bootbeta1}\n\nAs we have seen in previous chapters, we can use bootstrapping to estimate the sampling distribution of the statistic of interest (here, the slope) without the null assumption of no relationship (which was the condition in the randomization test).\nBecause interest is now in creating a CI, there is no null hypothesis, so there won't be any reason to permute either of the variables.\n\n\n\n\n\n### Observed data\n\nReturning to the births data, we may want to consider the relationship between `mage` (mother's age) and `weight`.\nIs `mage` a good predictor of `weight`?\nAnd if so, what is the relationship?\nThat is, what is the slope that models average `weight` of baby as a function of `mage` (mother's age)?\nThe linear model regressing `weight` on `mage` is provided in @tbl-ls-births-mage.\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-magePlot fig-alt='Scatterplot with mother\\'s age on the x-axis and baby\\'s weight on the y-axis. A linear model is superimposed. The points show a weak positive linear trend.' width=90%}\n:::\n:::\n\n::: {#tbl-ls-births-mage .cell tbl-cap='The least squares estimates of the intercept and slope are given in the estimate column. The observed slope is 0.036.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n\n \n (Intercept) | \n 6.23 | \n 0.71 | \n 8.79 | \n <0.0001 | \n
\n \n mage | \n 0.04 | \n 0.02 | \n 1.50 | \n 0.1362 | \n
\n\n
\n\n`````\n:::\n:::\n\n\n### Variability of the statistic\n\nBecause the focus here is *not* on a null distribution, we sample with replacement $n = 100$ observations from the original dataset.\nRecall that with bootstrapping the resample always has the same number of observations as the original dataset in order to mimic the process of taking a sample from the population.\nWhen sampling in the linear model case, consider each observation to be a single dot.\nIf the dot is resampled, both the `weight` and the `mage` measurement are observed.\nThe measurements are linked to the dot (i.e., to the birth in the sample).\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-birth2BS fig-alt='Two scatterplots, both with mother\\'s age on the x-axis and baby\\'s weight on the y-axis. The left plot is the original data. The right plot is the bootstrapped data. Comparing the bootstrapped points to the original points, we can see that some observations were sampled more than once, and some observations were not selected for the bootstrap sample at all.' width=90%}\n:::\n:::\n\n\n@fig-birth2BS shows the original data as compared with a single bootstrap sample, resulting in (slightly) different linear models.\nThe red circles represent points in the original data which were not included in the bootstrap sample.\nThe blue circles represent a point that was repeatedly resampled (and is therefore darker) in the bootstrap sample.\nThe green circles represent a particular structure to the data which is observed in both the original and bootstrap samples.\nBy repeatedly resampling, we can see dozens of bootstrapped slopes on the same plot in @fig-birthBS.\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-birthBS fig-alt='An x-y coordinate system with least squares regression lines from many bootstrap samples (no points are plotted). The lines vary around the observed population line. On the x-axis is mother\\'s age; on the y-axis is baby\\'s weight' width=90%}\n:::\n:::\n\n\nRecall that in order to create a confidence interval for the slope, we need to find the range of values that the statistic (here the slope) takes on from different bootstrap samples.\n@fig-mageBSslopes is a histogram of the relevant bootstrapped slopes.\nWe can see that a 95% bootstrap percentile interval for the true population slope is given by (-0.01, 0.081).\nWe are 95% confident that for the model describing the population of births, described by mother's age and `weight` of baby, a one unit increase in `mage` (in years) will be associated with an increase in predicted average baby `weight` of between -0.01 and 0.081 pounds (notice that the CI overlaps zero, so the true relationship *might* be null!).\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-mageBSslopes fig-alt='Histogram of the slopes computed from many bootstrapped samples. The bootstrap samples range from -0.05 (with the 2.5 percentile at -0.01) to +0.1 (with the 97.5 percentile at 0.081). The bootstrapped slopes form a histogram that is reasonably symmetric and bell-shaped.' width=90%}\n:::\n:::\n\n\n\n::: {.workedexample data-latex=\"\"}\nUsing @fig-mageBSslopes, calculate the bootstrap estimate for the standard error of the slope.\nUsing the bootstrap standard error, find a 95% bootstrap SE confidence interval for the true population slope, and interpret the interval in context.\n\n------------------------------------------------------------------------\n\nNotice that most of the bootstrapped slopes fall between -0.01 and +0.08 (a range of 0.09).\nUsing the empirical rule (that with bell-shaped distributions, most observations are within two standard errors of the center), the standard error of the slopes is approximately 0.0225.\nThe normal cutoff for a 95% confidence interval is $z^\\star = 1.96$ which leads to a confidence interval of $b_1 \\pm 1.96 \\cdot SE \\rightarrow 0.036 \\pm 1.96 \\cdot 0.0225 \\rightarrow (-0.0081, 0.0801).$ The bootstrap SE confidence interval is almost identical to the bootstrap percentile interval.\nIn context, we are 95% confident that for the model describing the population of births, described by mother's age and `weight` of baby, a one unit increase in `mage` (in years) will be associated with an increase in predicted average baby `weight` of between -0.0081 and 0.0801 pounds\n:::\n\n## Mathematical model for testing the slope {#sec-mathslope}\n\nWhen certain technical conditions apply, it is convenient to use mathematical approximations to test and estimate the slope parameter.\nThe approximations will build on the t-distribution which was described in @sec-inference-one-mean.\nThe mathematical model is often correct and is usually easy to implement computationally.\nThe validity of the technical conditions will be considered in detail in @sec-tech-cond-linmod.\n\nIn this section, we discuss uncertainty in the estimates of the slope and y-intercept for a regression line.\nJust as we identified standard errors for point estimates in previous chapters, we start by discussing standard errors for the slope and y-intercept estimates.\n\n### Observed data\n\n**Midterm elections and unemployment**\n\nElections for members of the United States House of Representatives occur every two years, coinciding every four years with the U.S.\nPresidential election.\nThe set of House elections occurring during the middle of a Presidential term are called midterm elections.\nIn America's two-party system (the vast majority of House members through history have been either Republicans or Democrats), one political theory suggests the higher the unemployment rate, the worse the President's party will do in the midterm elections.\nIn 2020 there were 232 Democrats, 198 Republicans, and 1 Libertarian in the House.\n\nTo assess the validity of the claim related to unemployment and voting patterns, we can compile historical data and look for a connection.\nWe consider every midterm election from 1898 to 2018, with the exception of those elections during the Great Depression.\nThe House of Representatives is made up of 435 voting members.\n\n::: {.data data-latex=\"\"}\nThe [`midterms_house`](http://openintrostat.github.io/openintro/reference/midterms_house.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\n@fig-unemploymentAndChangeInHouse shows these data and the least-squares regression line:\n\n$$\n\\begin{aligned}\n&\\texttt{percent change in House seats for President's party} \\\\\n&\\qquad\\qquad= -7.36 - 0.89 \\times \\texttt{(unemployment rate)}\n\\end{aligned}\n$$\n\nWe consider the percent change in the number of seats of the President's party (e.g., percent change in the number of seats for Republicans in 2018) against the unemployment rate.\n\nExamining the data, there are no clear deviations from linearity or substantial outliers (see @sec-resids for a discussion on using residuals to visualize how well a linear model fits the data).\nWhile the data are collected sequentially, a separate analysis was used to check for any apparent correlation between successive observations; no such correlation was found.\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-unemploymentAndChangeInHouse fig-alt='Scatterplot with percent unemployed on the x-axis and percent change in House seats for the President\\'s party on the y-axis. Each point represents a different President\\'s midterm and is colored according to their political party (Democrat or Republican). The relationship is moderate and negative.' width=90%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nThe data for the Great Depression (1934 and 1938) were removed because the unemployment rate was 21% and 18%, respectively.\nDo you agree that they should be removed for this investigation?\nWhy or why not?[^24-inf-model-slr-1]\n:::\n\n[^24-inf-model-slr-1]: The answer to this question relies on the idea that statistical data analysis is somewhat of an art.\n That is, in many situations, there is no \"right\" answer.\n As you do more and more analyses on your own, you will come to recognize the nuanced understanding which is needed for a particular dataset.\n In terms of the Great Depression, we will provide two contrasting considerations.\n Each of these points would have very high leverage on any least-squares regression line, and years with such high unemployment may not help us understand what would happen in other years where the unemployment is only modestly high.\n On the other hand, the Depression years are exceptional cases, and we would be discarding important information if we exclude them from a final analysis.\n\nThere is a negative slope in the line shown in @fig-unemploymentAndChangeInHouse.\nHowever, this slope (and the y-intercept) are only estimates of the parameter values.\nWe might wonder, is this convincing evidence that the \"true\" linear model has a negative slope?\nThat is, do the data provide strong evidence that the political theory is accurate, where the unemployment rate is a useful predictor of the midterm election?\nWe can frame this investigation into a statistical hypothesis test:\n\n- $H_0$: $\\beta_1 = 0$. The true linear model has slope zero.\n- $H_A$: $\\beta_1 \\neq 0$. The true linear model has a slope different than zero. The unemployment is predictive of whether the President's party wins or loses seats in the House of Representatives.\n\nWe would reject $H_0$ in favor of $H_A$ if the data provide strong evidence that the true slope parameter is different than zero.\nTo assess the hypotheses, we identify a standard error for the estimate, compute an appropriate test statistic, and identify the p-value.\n\n### Variability of the statistic\n\nJust like other point estimates we have seen before, we can compute a standard error and test statistic for $b_1$.\nWe will generally label the test statistic using a $T$, since it follows the $t$-distribution.\n\nWe will rely on statistical software to compute the standard error and leave the explanation of how this standard error is determined to a second or third statistics course.\n@tbl-midtermUnempRegTable shows software output for the least squares regression line in @fig-unemploymentAndChangeInHouse.\nThe row labeled `unemp` includes all relevant information about the slope estimate (i.e., the coefficient of the unemployment variable, the related SE, the T statistic, and the corresponding p-value).\n\n\n\n\n::: {#tbl-midtermUnempRegTable .cell tbl-cap='Output from statistical software for the regression line modeling the midterm election losses for the President\\'s party as a response to unemployment.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n\n \n (Intercept) | \n -7.36 | \n 5.16 | \n -1.43 | \n 0.16 | \n
\n \n unemp | \n -0.89 | \n 0.83 | \n -1.07 | \n 0.30 | \n
\n\n
\n\n`````\n:::\n:::\n\n\n::: {.workedexample data-latex=\"\"}\nWhat do the first and second columns of @tbl-midtermUnempRegTable represent?\n\n------------------------------------------------------------------------\n\nThe entries in the first column represent the least squares estimates, $b_0$ and $b_1$, and the values in the second column correspond to the standard errors of each estimate.\nUsing the estimates, we could write the equation for the least square regression line as\n\n$$ \\hat{y} = -7.36 - 0.89 x $$\n\nwhere $\\hat{y}$ in this case represents the predicted change in the number of seats for the president's party, and $x$ represents the unemployment rate.\n:::\n\nWe previously used a $t$-test statistic for hypothesis testing in the context of numerical data.\nRegression is very similar.\nIn the hypotheses we consider, the null value for the slope is 0, so we can compute the test statistic using the T score formula:\n\n$$\nT \\ = \\ \\frac{\\text{estimate} - \\text{null value}}{\\text{SE}} = \\ \\frac{-0.89 - 0}{0.835} = \\ -1.07\n$$\n\nThe T score we calculated corresponds to the third column of @tbl-midtermUnempRegTable.\n\n::: {.workedexample data-latex=\"\"}\nUse @tbl-midtermUnempRegTable to determine the p-value for the hypothesis test.\n\n------------------------------------------------------------------------\n\nThe last column of the table gives the p-value for the two-sided hypothesis test for the coefficient of the unemployment rate 0.2961 That is, the data do not provide convincing evidence that a higher unemployment rate has any correspondence with smaller or larger losses for the President's party in the House of Representatives in midterm elections. If there was no linear relationship between the two variables (i.e., if $\\beta_1 = 0)$, then we would expect to see linear models as or more extreme that the observed model roughly 30% of the time.\n:::\n\n### Observed statistic vs. null statistics\n\nAs the final step in a mathematical hypothesis test for the slope, we use the information provided to make a conclusion about whether the data could have come from a population where the true slope was zero (i.e., $\\beta_1 = 0$).\nBefore evaluating the formal hypothesis claim, sometimes it is important to check your intuition.\nBased on everything we have seen in the examples above describing the variability of a line from sample to sample, ask yourself if the linear relationship given by the data could have come from a population in which the slope was truly zero.\n\n::: {.workedexample data-latex=\"\"}\nExamine @fig-elmhurstScatterWLine, which relates the Elmhurst College aid and student family income.\nAre you convinced that the slope is discernibly different from zero?\nThat is, do you think a formal hypothesis test would reject the claim that the true slope of the line should be zero?\n\n------------------------------------------------------------------------\n\nWhile the relationship between the variables is not perfect, there is an evident decreasing trend in the data.\nSuch a distinct trend suggests that the hypothesis test will reject the null claim that the slope is zero.\n:::\n\n::: {.data data-latex=\"\"}\nThe [`elmhurst`](http://openintrostat.github.io/openintro/reference/elmhurst.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n:::\n\nThe tools in this section help you go beyond a visual interpretation of the linear relationship toward a formal mathematical claim about whether the slope estimate is meaningfully different from 0 to suggest that the true population slope is different from 0.\n\n\n::: {#tbl-rOutputForIncomeAidLSRLineInInferenceSection .cell tbl-cap='Summary of least squares fit for the Elmhurst College data, where we are predicting the gift aid by the university based on the family income of students.'}\n::: {.cell-output-display}\n`````{=html}\n\n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n\n \n (Intercept) | \n 24319.33 | \n 1291.45 | \n 18.83 | \n <0.0001 | \n
\n \n family_income | \n -0.04 | \n 0.01 | \n -3.98 | \n 2e-04 | \n
\n\n
\n\n`````\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\n@tbl-rOutputForIncomeAidLSRLineInInferenceSection shows statistical software output from fitting the least squares regression line shown in @fig-elmhurstScatterWLine.\nUse the output to formally evaluate the following hypotheses.[^24-inf-model-slr-2]\n\n- $H_0$: The true coefficient for family income is zero.\n- $H_A$: The true coefficient for family income is not zero.\n:::\n\n[^24-inf-model-slr-2]: We look in the second row corresponding to the family income variable.\n We see the point estimate of the slope of the line is -0.0431, the standard error of this estimate is 0.0108, and the $t$-test statistic is $T = -3.98$.\n The p-value corresponds exactly to the two-sided test we are interested in: 0.0002.\n The p-value is so small that we reject the null hypothesis and conclude that family income and financial aid at Elmhurst College for freshman entering in the year 2011 are negatively correlated and the true slope parameter is indeed less than 0, just as we believed in our analysis of @fig-elmhurstScatterWLine.\n\n::: {.important data-latex=\"\"}\n**Inference for regression.**\n\nWe usually rely on statistical software to identify point estimates, standard errors, test statistics, and p-values in practice.\nHowever, be aware that software will not generally check whether the method is appropriate, meaning we must still verify conditions are met.\nSee @sec-tech-cond-linmod.\n:::\n\n\\clearpage\n\n## Mathematical model, interval for the slope\n\nSimilar to how we can conduct a hypothesis test for a model coefficient using regression output, we can also construct confidence intervals for the slope and intercept coefficients.\n\n::: {.important data-latex=\"\"}\n**Confidence intervals for coefficients.**\n\nConfidence intervals for model coefficients (e.g., the intercept or the slope) can be computed using the $t$-distribution:\n\n$$ b_i \\ \\pm\\ t_{df}^{\\star} \\times SE_{b_{i}} $$\n\nwhere $t_{df}^{\\star}$ is the appropriate $t^{\\star}$ cutoff corresponding to the confidence level with the model's degrees of freedom, $df = n - 2$.\n:::\n\n::: {.workedexample data-latex=\"\"}\nCompute the 95% confidence interval for the coefficient using the regression output from @tbl-rOutputForIncomeAidLSRLineInInferenceSection.\n\n------------------------------------------------------------------------\n\nThe point estimate is -0.0431 and the standard error is $SE = 0.0108$.\nWhen constructing a confidence interval for a model coefficient, we generally use a $t$-distribution.\nThe degrees of freedom for the distribution are noted in the regression output, $df = 48$, allowing us to identify $t_{48}^{\\star} = 2.01$ for use in the confidence interval.\n\nWe can now construct the confidence interval in the usual way:\n\n$$\n\\begin{aligned}\n\\text{point estimate} &\\pm t_{48}^{\\star} \\times SE \\\\\n-0.0431 &\\pm 2.01 \\times 0.0108 \\\\\n(-0.0648 &, -0.0214)\n\\end{aligned}\n$$\n\nWe are 95% confident that for an additional one unit (i.e., $1000 increase) in family income, the university's gift aid is predicted to decrease on average by \\$21.40 to \\$64.80.\n:::\n\nOn the topic of intervals in this book, we have focused exclusively on confidence intervals for model parameters.\nHowever, there are other types of intervals that may be of interest (and are outside the scope of this book), including prediction intervals for a response value and confidence intervals for a mean response value in the context of regression.\n\n\\clearpage\n\n## Checking model conditions {#sec-tech-cond-linmod}\n\nIn the previous sections, we used randomization and bootstrapping to perform inference when the mathematical model was not valid due to violations of the technical conditions.\nIn this section, we'll provide details for when the mathematical model is appropriate and a discussion of technical conditions needed for the randomization and bootstrapping procedures.\nRecall from @sec-resids that residual plots can be used to visualize how well a linear model fits the data.\n\n\n\n\n\n### What are the technical conditions for the mathematical model?\n\nWhen fitting a least squares line, we generally require the following:\n\n- **Linearity.** The data should show a linear trend.\n If there is a nonlinear trend (e.g., first panel of @fig-whatCanGoWrongWithLinearModel) an advanced regression method from another book or later course should be applied.\n\n- **Independent observations.** Be cautious about applying regression to data that are sequential observations in time such as a stock price each day.\n Such data may have an underlying structure that should be considered in a different type of model and analysis.\n An example of a dataset where successive observations are not independent is shown in the fourth panel of @fig-whatCanGoWrongWithLinearModel.\n There are also other instances where correlations within the data are important, which is further discussed in @sec-inf-model-mlr.\n\n- **Nearly normal residuals.** Generally, the residuals should be nearly normal.\n When this condition is found to be unreasonable, it is often because of outliers or concerns about influential points, which we'll talk about more in @sec-outliers-in-regression.\n An example of a residual that would be potentially concerning is shown in the second panel of @fig-whatCanGoWrongWithLinearModel, where one observation is clearly much further from the regression line than the others. Outliers should be treated extremely carefully. Do not automatically remove an outlier if it truly belongs in the dataset. However, be honest about its impact on the analysis. A strategy for dealing with outliers is to present two analyses: one with the outlier and one without the outlier.\nAdditionally, a type of violation of normality happens when the positive residuals are smaller in magnitude than the negative residuals (or vice versa). That is, when the residuals are not symmetrically distributed around the line $y=0.$ \n\n- **Constant or equal variability.** The variability of points around the least squares line remains roughly constant.\n An example of non-constant variability is shown in the third panel of @fig-whatCanGoWrongWithLinearModel, which represents the most common pattern observed when this condition fails: the variability of $y$ is larger when $x$ is larger.\n\n\n::: {.cell}\n::: {.cell-output-display}\n{#fig-whatCanGoWrongWithLinearModel fig-alt='A grid of 2 by 4 scatterplots with fabricated data. The top row of plots contains original x-y data plots with a least squares regression line. The bottom row of plots is a series of residual plot with predicted value on the x-axis and residual on the y-axis. The first column of plots gives an example of points that have a quadratic relationship instead of a linear relationship. The second column of plots gives an example where a single outlying point does not fit the linear model. The third column of points gives an example where the points have increasing variability as the value of x increases. The last column of points gives an example where the points are correlated with one another, possibly as part of a time series.' width=100%}\n:::\n:::\n\n\n::: {.guidedpractice data-latex=\"\"}\nShould we have concerns about applying least squares regression to the Elmhurst data in @fig-elmhurstScatterW2Lines?[^24-inf-model-slr-3]\n:::\n\n[^24-inf-model-slr-3]: The trend appears to be linear, the data fall around the line with no obvious outliers, the variance is roughly constant.\n The data do not come from a time series or other obvious violation to independence.\n Least squares regression can be applied to these data.\n\nThe technical conditions are often remembered using the **LINE** mnemonic.\nThe linearity, normality, and equality of variance conditions usually can be assessed through residual plots, as seen in @fig-whatCanGoWrongWithLinearModel.\nA careful consideration of the experimental design should be undertaken to confirm that the observed values are indeed independent.\n\n- L: **linear** model\n- I: **independent** observations\n- N: points are **normally** distributed around the line\n- E: **equal** variability around the line for all values of the explanatory variable\n\n### Why do we need technical conditions?\n\nAs with other inferential techniques we have covered in this text, if the technical conditions above do not hold, then it is not possible to make concluding claims about the population.\nThat is, without the technical conditions, the T score will not have the assumed t-distribution.\nThat said, it is almost always impossible to check the conditions precisely, so we look for large deviations from the conditions.\nIf there are large deviations, we will be unable to trust the calculated p-value or the endpoints of the resulting confidence interval.\n\n**The model based on Linearity**\n\nThe linearity condition is among the most important if your goal is to understand a linear model between $x$ and $y$.\nFor example, the value of the slope will not be at all meaningful if the true relationship between $x$ and $y$ is quadratic, as in @fig-notGoodAtAllForALinearModel.\nNot only should we be cautious about the inference, but the model *itself* is also not an accurate portrayal of the relationship between the variables.\nHowever, an extended discussion on the different methods for modeling functional forms other than linear is outside the scope of this text.\n\n**The importance of Independence**\n\nThe technical condition describing the independence of the observations is often the most crucial but also the most difficult to diagnose.\nIt is also extremely difficult to gather a dataset which is a true random sample from the population of interest.\n(Note: a true randomized experiment from a fixed set of individuals is much easier to implement, and indeed, randomized experiments are done in most medical studies these days.)\n\nDependent observations can bias results in ways that produce fundamentally flawed analyses.\nThat is, if you hang out at the gym measuring height and weight, your linear model is surely not a representation of all students at your university.\nAt best it is a model describing students who use the gym (but also who are willing to talk to you, that use the gym at the times you were there measuring, etc.).\n\nIn lieu of trying to answer whether your observations are a true random sample, you might instead focus on whether you believe your observations are representative of a population of interest.\nHumans are notoriously bad at implementing random procedures, so you should be wary of any process that used human intuition to balance the data with respect to, for example, the demographics of the individuals in the sample.\n\n\\clearpage\n\n**Some thoughts on Normality**\n\nThe normality condition requires that points vary symmetrically around the line, spreading out in a bell-shaped fashion.\nYou should consider the \"bell\" of the normal distribution as sitting on top of the line (coming off the paper in a 3-D sense) so as to indicate that the points are dense close to the line and disperse gradually as they get farther from the line.\n\nThe normality condition is less important than linearity or independence for a few reasons.\nFirst, the linear model fit with least squares will still be an unbiased estimate of the true population model.\nHowever, the distribution of the estimate will be unknown.\nFortunately the Central Limit Theorem (described in @sec-one-mean-math) tells us that most of the analyses (e.g., SEs, p-values, confidence intervals) done using the mathematical model (with the $t$-distribution) will still hold (even if the data are not normally distributed around the line) as long as the sample size is large enough.\nOne analysis method that *does* require normality, regardless of sample size, is creating intervals which predict the response of individual outcomes at a given $x$ value, using the linear model.\nOne additional reason to worry slightly less about normality is that neither the randomization test nor the bootstrapping procedures require the data to be normal around the line.\n\n**Equal variability for prediction in particular**\n\nAs with normality, the equal variability condition (that points are spread out in similar ways around the line for all values of $x$) will not cause problems for the estimate of the linear model.\nThat said, the **inference** on the model (e.g., computing p-values) will be incorrect if the variability around the line is extremely heterogeneous.\nData that exhibit non-equal variance across the range of x-values will have the potential to seriously mis-estimate the variability of the slope which will have consequences for the inference results (i.e., hypothesis tests and confidence intervals).\n\nIn many cases, the inference results for both a randomization test or a bootstrap confidence interval are also robust to the equal variability condition, so they provide the analyst a set of methods to use when the data are heteroskedastic (that is, exhibit unequal variability around the regression line).\nAlthough randomization tests and bootstrapping allow us to analyze data using fewer conditions, some technical conditions are required for all methods described in this text (e.g., independent observation).\nWhen the equal variability condition is violated and a mathematical analysis (e.g., p-value from T score) is needed, there are other existing methods (outside the scope of this text) which can handle the unequal variance (e.g., weighted least squares analysis).\n\n### What if all the technical conditions are met?\n\nWhen the technical conditions are met, the least squares regression model and inference is provided by virtually all statistical software.\nIn addition to being ubiquitous, however, an additional advantage to the least squares regression model (and related inference) is that the linear model has important extensions (which are not trivial to implement with bootstrapping and randomization tests).\nIn particular, random effects models, repeated measures, and interaction are all linear model extensions which require the above technical conditions.\nWhen the technical conditions hold, the extensions to the linear model can provide important insight into the data and research question at hand.\nWe will discuss some of the extended modeling and associated inference in @sec-inf-model-mlr and @sec-inf-model-logistic.\nMany of the techniques used to deal with technical condition violations are outside the scope of this text, but they are taught in universities in the very next class after this one.\nIf you are working with linear models or are curious to learn more, we recommend that you continue learning about statistical methods applicable to a larger class of datasets.\n\n\\clearpage\n\n## Chapter review {#sec-chp24-review}\n\n### Summary\n\nRecall that early in the text we presented graphical techniques which communicated relationships across multiple variables.\nWe also used modeling to formalize the relationships.\nMany chapters were dedicated to inferential methods which allowed claims about the population to be made based on samples of data.\nNot only did we present the mathematical model for each of the inferential techniques, but when appropriate, we also presented bootstrapping and permutation methods.\n\nHere in @sec-inf-model-slr we brought all of those ideas together by considering inferential claims on linear models through randomization tests, bootstrapping, and mathematical modeling.\nWe continue to emphasize the importance of experimental design in making conclusions about research claims.\nIn particular, recall that variability can come from different sources (e.g., random sampling vs. random allocation, see @fig-randsampValloc).\n\n### Terms\n\nWe introduced the following terms in the chapter.\nIf you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.\nWe are purposefully presenting them in alphabetical order, instead of in order of appearance, so they will be a little more challenging to locate.\nHowever, you should be able to easily spot them as **bolded text**.\n\n\n::: {.cell}\n::: {.cell-output-display}\n`````{=html}\n\n\n \n bootstrap CI for the slope | \n randomization test for the slope | \n technical conditions linear regression | \n
\n \n inference with single precictor regression | \n t-distribution for slope | \n variability of the slope | \n
\n\n
\n\n`````\n:::\n:::\n\n\n\\clearpage\n\n## Exercises {#sec-chp24-exercises}\n\nAnswers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-24].\n\n::: {.exercises data-latex=\"\"}\n1. **Body measurements, randomization test.** Researchers studying anthropometry collected body and skeletal diameter measurements, as well as age, weight, height and sex for 507 physically active individuals.\n A linear model is built to predict height based on shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters.[^_24-ex-inf-model-slr-1]\n [@Heinz:2003]\n\n Below are two items.\n The first is the standard linear model output for predicting height from shoulder girth.\n The second is a histogram of slopes from 1,000 randomized datasets (1,000 times, `hgt` was permuted and regressed against `sho_gi`).\n The red vertical line is drawn at the observed slope value which was produced in the linear model output.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n 105.832 | \n 3.27 | \n 32.3 | \n <0.0001 | \n
\n \n sho_gi | \n 0.604 | \n 0.03 | \n 20.0 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n \n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. What are the null and alternative hypotheses for evaluating whether the slope of the model predicting height from shoulder girth is differen than 0.\n\n b. Using the histogram which describes the distribution of slopes when the null hypothesis is true, find the p-value and conclude the hypothesis test in the context of the problem (use words like shoulder girth and height).\n\n c. Is the conclusion based on the histogram of randomized slopes consistent with the conclusion which would have been obtained using the mathematical model?\n Explain.\n\n2. **Baby's weight and father's age, randomization test.** US Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.\n The data used here are a random sample of 1000 births from 2014.\n Here, we study the relationship between the father's age and the weight of the baby.[^_24-ex-inf-model-slr-2]\n [@data:births14]\n\n Below are two items.\n The first is the standard linear model output for predicting baby's weight (in pounds) from father's age (in years).\n The second is a histogram of slopes from 1000 randomized datasets (1000 times, `weight` was permuted and regressed against `fage`).\n The red vertical line is drawn at the observed slope value which was produced in the linear model output.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n 7.101 | \n 0.199 | \n 35.674 | \n <0.0001 | \n
\n \n fage | \n 0.005 | \n 0.006 | \n 0.757 | \n 0.4495 | \n
\n \n
\n \n `````\n :::\n \n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. What are the null and alternative hypotheses for evaluating whether the slope of the model for predicting baby's weight from father's age is different than 0?\n\n b. Using the histogram which describes the distribution of slopes when the null hypothesis is true, find the p-value and conclude the hypothesis test in the context of the problem (use words like father's age and weight of baby).\n What does the conclusion of your test say about whether the father's age is a useful predictor of baby's weight?\n\n c. Is the conclusion based on the histogram of randomized slopes consistent with the conclusion which would have been obtained using the mathematical model?\n Explain.\n\n3. **Body measurements, mathematical test.** The scatterplot and least squares summary below show the relationship between weight measured in kilograms and height measured in centimeters of 507 physically active individuals.\n [@Heinz:2003]\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=90%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n -105.01 | \n 7.54 | \n -13.9 | \n <0.0001 | \n
\n \n hgt | \n 1.02 | \n 0.04 | \n 23.1 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n :::\n\n a. Describe the relationship between height and weight.\n\n b. Write the equation of the regression line.\n Interpret the slope and intercept in context.\n\n c. Do the data provide convincing evidence that the true slope parameter is different than 0?\n State the null and alternative hypotheses, report the p-value (using a mathematical model), and state your conclusion.\n\n d. The correlation coefficient for height and weight is 0.72.\n Calculate $R^2$ and interpret it in context.\n\n4. **Baby's weight and father's age, mathematical test.** Is the father's age useful in predicting the baby's weight?\n The scatterplot and least squares summary below show the relationship between baby's weight (measured in pounds) and father's age for a random sample of babies.\n [@data:births14]\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=90%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n 7.1042 | \n 0.1936 | \n 36.698 | \n <0.0001 | \n
\n \n fage | \n 0.0047 | \n 0.0061 | \n 0.779 | \n 0.4359 | \n
\n \n
\n \n `````\n :::\n :::\n\n a. What is the predicted weight of a baby whose father is 30 years old.\n\n b. Do the data provide convincing evidence that the model for predicting baby weights from father's age has a slope different than 0?\n State the null and alternative hypotheses, report the p-value (using a mathematical model), and state your conclusion.\n\n c. Based on your conclusion, is father's age a useful predictor of baby's weight?\n\n5. **Body measurements, bootstrap percentile interval.** In order to estimate the slope of the model predicting height based on shoulder girth (circumference of shoulders measured over deltoid muscles), 1,000 bootstrap samples are taken from a dataset of body measurements from 507 people.\n A linear model predicting height based on shoulder girth is fit to each bootstrap sample, and the slope is estimated.\n A histogram of these slopes is shown below.\n [@Heinz:2003]\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. Using the bootstrap percentile method and the histogram above, find a 98% confidence interval for the slope parameter.\n\n b. Interpret the confidence interval in the context of the problem.\n\n6. **Baby's weight and father's age, bootstrap percentile interval.** US Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.\n The data used here are a random sample of 1000 births from 2014.\n Here, we study the relationship between the father's age and the weight of the baby.\n Below is the bootstrap distribution of the slope statistic from 1,000 different bootstrap samples of the data.\n [@data:births14]\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. Using the bootstrap percentile method and the histogram above, find a 95% confidence interval for the slope parameter.\n\n b. Interpret the confidence interval in the context of the problem.\n\n7. **Body measurements, standard error bootstrap interval.** A linear model is built to predict height based on shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters.\n [@Heinz:2003]\n\n Below are two items.\n The first is the standard linear model output for predicting height from shoulder girth.\n The second is the bootstrap distribution of the slope statistic from 1,000 different bootstrap samples of the data.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n 105.832 | \n 3.27 | \n 32.3 | \n <0.0001 | \n
\n \n sho_gi | \n 0.604 | \n 0.03 | \n 20.0 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n \n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify the variability of the slope statistic from sample to sample).\n\n b. Find a 98% bootstrap SE confidence interval for the slope parameter.\n\n c. Interpret the confidence interval in the context of the problem.\n\n8. **Baby's weight and father's age, standard error bootstrap interval.** US Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.\n The data used here are a random sample of 1000 births from 2014.\n Here, we study the relationship between the father's age and the weight of the baby.\n [@data:births14]\n\n Below are two items.\n The first is the standard linear model output for predicting baby's weight (in pounds) from father's age (in years).\n The second is the bootstrap distribution of the slope statistic from 1000 different bootstrap samples of the data.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n 7.101 | \n 0.199 | \n 35.674 | \n <0.0001 | \n
\n \n fage | \n 0.005 | \n 0.006 | \n 0.757 | \n 0.4495 | \n
\n \n
\n \n `````\n :::\n \n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify the variability of the slope statistic from sample to sample).\n\n b. Find a 95% bootstrap SE confidence interval for the slope parameter.\n\n c. Interpret the confidence interval in the context of the problem.\n\n9. **Body measurements, conditions.** The scatterplot below shows the residuals (on the y-axis) from the linear model of weight vs. height from a dataset of body measurements from 507 physically active individuals.\n The x-axis is the height of the individuals, in cm.\n [@Heinz:2003]\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=100%}\n :::\n :::\n\n a. For these data, $R^2$ is 51.84%.\n What is the value of the correlation coefficient?\n How can you tell if it is positive or negative?\n \\[Hint: you may need to look at a previous exercise.\\]\n\n b. Examine the residual plot.\n What do you observe?\n Is a simple least squares fit appropriate for these data?\n Which of the LINE conditions are met or not met?\n\n10. **Baby's weight and father's age, conditions.** The scatterplot below shows the residuals (on the y-axis) from the linear model of baby's weight (measured in pounds) vs. father's age for a random sample of babies.\n Father's age is on the x-axis.\n [@data:births14]\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=100%}\n :::\n :::\n\n a. For these data, $R^2$ is 0.09%.\n What is the value of the correlation coefficient?\n How can you tell if it is positive or negative?\n \\[Hint: you may need to look at a previous exercise.\\]\n\n b. Examine the residual plot.\n What do you observe?\n Is a simple least squares fit appropriate for these data?\n Which of the LINE conditions are met or not met?\n\n11. **Murders and poverty, randomization test.** The following regression output is for predicting annual murders per million (`annual_murders_per_mil`) from percentage living in poverty (`perc_pov`) in a random sample of 20 metropolitan areas.\n\n Below are two items.\n The first is the standard linear model output for predicting annual murders per million from percentage living in poverty for metropolitan areas.\n The second is a histogram of slopes from 1000 randomized datasets (1000 times, `annual_murders_per_mil` was permuted and regressed against `perc_pov`).\n The red vertical line is drawn at the observed slope value which was produced in the linear model output.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n -29.90 | \n 7.79 | \n -3.84 | \n 0.0012 | \n
\n \n perc_pov | \n 2.56 | \n 0.39 | \n 6.56 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n \n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. What are the null and alternative hypotheses for evaluating whether the slope of the model for predicting annual murder rate from poverty percentage is different than 0?\n\n b. Using the histogram which describes the distribution of slopes when the null hypothesis is true, find the p-value and conclude the hypothesis test in the context of the problem (use words like murder rate and poverty).\n\n c. Is the conclusion based on the histogram of randomized slopes consistent with the conclusion which would have been obtained using the mathematical model?\n Explain.\n\n12. **Murders and poverty, mathematical test.** The table below shows the output of a linear model annual murders per million (`annual_murders_per_mil`) from percentage living in poverty (`perc_pov`) in a random sample of 20 metropolitan areas.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n -29.90 | \n 7.79 | \n -3.84 | \n 0.0012 | \n
\n \n perc_pov | \n 2.56 | \n 0.39 | \n 6.56 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n :::\n\n a. What are the hypotheses for evaluating whether the slope of the model predicting annual murder rate from poverty percentage is different than 0?\n\n b. State the conclusion of the hypothesis test from part (a) in context of the data.\n What does this say about whether poverty percentage is a useful predictor of annual murder rate?\n\n c. Calculate a 95% confidence interval for the slope of poverty percentage, and interpret it in context of the data.\n\n d. Do your results from the hypothesis test and the confidence interval agree?\n Explain.\n\n13. **Murders and poverty, bootstrap percentile interval.** Data on annual murders per million (`annual_murders_per_mil`) and percentage living in poverty (`perc_pov`) is collected from a random sample of 20 metropolitan areas.\n Using these data we want to estimate the slope of the model predicting `annual_murders_per_mil` from `perc_pov`.\n We take 1,000 bootstrap samples of the data and fit a linear model predicting `annual_murders_per_mil` from `perc_pov` to each bootstrap sample.\n A histogram of these slopes is shown below.\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. Using the percentile bootstrap method and the histogram above, find a 90% confidence interval for the slope parameter.\n\n b. Interpret the confidence interval in the context of the problem.\n\n14. **Murders and poverty, standard error bootstrap interval.** A linear model is built to predict annual murders per million (`annual_murders_per_mil`) from percentage living in poverty (`perc_pov`) in a random sample of 20 metropolitan areas.\n\n Below are two items.\n The first is the standard linear model output for predicting annual murders per million from percentage living in poverty for metropolitan areas.\n The second is the bootstrap distribution of the slope statistic from 1000 different bootstrap samples of the data.\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n -29.90 | \n 7.79 | \n -3.84 | \n 0.0012 | \n
\n \n perc_pov | \n 2.56 | \n 0.39 | \n 6.56 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n \n ::: {.cell-output-display}\n {width=90%}\n :::\n :::\n\n a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify the variability of the slope statistic from sample to sample).\n\n b. Find a 90% bootstrap SE confidence interval for the slope parameter.\n\n c. Interpret the confidence interval in the context of the problem.\n\n15. **Murders and poverty, conditions.** The scatterplot below shows the annual murders per million vs. percentage living in poverty in a random sample of 20 metropolitan areas.\n The second figure plots residuals on the y-axis and percent living in poverty on the x-axis.\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=100%}\n :::\n \n ::: {.cell-output-display}\n {width=100%}\n :::\n :::\n\n a. For these data, $R^2$ is 70.56%.\n What is the value of the correlation coefficient?\n How can you tell if it is positive or negative?\n\n b. Examine the residual plot.\n What do you observe?\n Is a simple least squares fit appropriate for the data?\n Which of the LINE conditions are met or not met?\n\n16. **I heart cats.** Researchers collected data on heart and body weights of 144 domestic adult cats.\n The table below shows the output of a linear model predicting heat weight (measured in grams) from body weight (measured in kilograms) of these cats.[^_24-ex-inf-model-slr-3]\n\n ::: {.cell}\n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n -0.357 | \n 0.692 | \n -0.515 | \n 0.6072 | \n
\n \n Bwt | \n 4.034 | \n 0.250 | \n 16.119 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n :::\n\n a. What are the hypotheses for evaluating whether body weight is positively associated with heart weight in cats?\n\n b. State the conclusion of the hypothesis test from part (a) in context of the data.\n\n c. Calculate a 95% confidence interval for the slope of body weight, and interpret it in context of the data.\n\n d. Do your results from the hypothesis test and the confidence interval agree?\n Explain.\n\n17. **Beer and blood alcohol content** Many people believe that weight, drinking habits, and many other factors are much more important in predicting blood alcohol content (BAC) than simply considering the number of drinks a person consumed.\n Here we examine data from sixteen student volunteers at Ohio State University who each drank a randomly assigned number of cans of beer.\n These students were evenly divided between men and women, and they differed in weight and drinking habits.\n Thirty minutes later, a police officer measured their blood alcohol content (BAC) in grams of alcohol per deciliter of blood.\n The scatterplot and regression table summarize the findings.\n [^_24-ex-inf-model-slr-4] [@Malkevitc+Lesser:2008]\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=90%}\n :::\n \n ::: {.cell-output-display}\n `````{=html}\n \n \n \n term | \n estimate | \n std.error | \n statistic | \n p.value | \n
\n \n \n \n (Intercept) | \n -0.0127 | \n 0.0126 | \n -1.00 | \n 0.332 | \n
\n \n beers | \n 0.0180 | \n 0.0024 | \n 7.48 | \n <0.0001 | \n
\n \n
\n \n `````\n :::\n :::\n\n a. Describe the relationship between the number of cans of beer and BAC.\n\n b. Write the equation of the regression line.\n Interpret the slope and intercept in context.\n\n c. Do the data provide convincing evidence that drinking more cans of beer is associated with an increase in blood alcohol?\n State the null and alternative hypotheses, report the p-value, and state your conclusion.\n\n d. The correlation coefficient for number of cans of beer and BAC is 0.89.\n Calculate $R^2$ and interpret it in context.\n\n e. Suppose we visit a bar, ask people how many drinks they have had, and also take their BAC.\n Do you think the relationship between number of drinks and BAC would be as strong as the relationship found in the Ohio State study?\n\n18. **Urban homeowners, conditions.** The scatterplot below shows the percent of families who own their home vs. the percent of the population living in urban areas.\n [@data:urbanOwner] There are 52 observations, each corresponding to a state in the US.\n Puerto Rico and District of Columbia are also included.\n The second figure plots residuals on the y-axis and percent of the population living in urban areas on the x-axis.\n\n ::: {.cell}\n ::: {.cell-output-display}\n {width=100%}\n :::\n \n ::: {.cell-output-display}\n {width=100%}\n :::\n :::\n\n a. For these data, $R^2$ is 29.16%.\n What is the value of the correlation coefficient?\n How can you tell if it is positive or negative?\n\n b. Examine the residual plot.\n What do you observe?\n Is a simple least squares fit appropriate for the data?\n Which of the LINE conditions are met or not met?\n\n[^_24-ex-inf-model-slr-1]: The [`bdims`](http://openintrostat.github.io/openintro/reference/bdims.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_24-ex-inf-model-slr-2]: The [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n[^_24-ex-inf-model-slr-3]: The [`cats`](https://stat.ethz.ch/R-manual/R-patched/library/MASS/html/cats.html) data used in this exercise can be found in the [**MASS**](https://cran.r-project.org/web/packages/MASS/index.html) R package.\n\n[^_24-ex-inf-model-slr-4]: The [`bac`](http://openintrostat.github.io/openintro/reference/bac.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.\n\n\n:::\n",
"supporting": [
"24-inf-model-slr_files"
],
diff --git a/_freeze/24-inf-model-slr/figure-html/fig-birth2BS-1.png b/_freeze/24-inf-model-slr/figure-html/fig-birth2BS-1.png
index d8cae071..180c6451 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/fig-birth2BS-1.png and b/_freeze/24-inf-model-slr/figure-html/fig-birth2BS-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/fig-magePlot-1.png b/_freeze/24-inf-model-slr/figure-html/fig-magePlot-1.png
index cde14c9f..3096e64f 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/fig-magePlot-1.png and b/_freeze/24-inf-model-slr/figure-html/fig-magePlot-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/fig-nulldistBirths-1.png b/_freeze/24-inf-model-slr/figure-html/fig-nulldistBirths-1.png
index 0273aa8d..22b1bae8 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/fig-nulldistBirths-1.png and b/_freeze/24-inf-model-slr/figure-html/fig-nulldistBirths-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/fig-permweekslm-1.png b/_freeze/24-inf-model-slr/figure-html/fig-permweekslm-1.png
index 6a4d97b2..caca696c 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/fig-permweekslm-1.png and b/_freeze/24-inf-model-slr/figure-html/fig-permweekslm-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/fig-permweightScatter-1.png b/_freeze/24-inf-model-slr/figure-html/fig-permweightScatter-1.png
index 49d1fc13..6efaafb0 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/fig-permweightScatter-1.png and b/_freeze/24-inf-model-slr/figure-html/fig-permweightScatter-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/fig-sand-samp1-1.png b/_freeze/24-inf-model-slr/figure-html/fig-sand-samp1-1.png
index 75d84a2b..c03803d3 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/fig-sand-samp1-1.png and b/_freeze/24-inf-model-slr/figure-html/fig-sand-samp1-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/fig-sand-samp12-1.png b/_freeze/24-inf-model-slr/figure-html/fig-sand-samp12-1.png
index e8f8ffc1..7feaafd8 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/fig-sand-samp12-1.png and b/_freeze/24-inf-model-slr/figure-html/fig-sand-samp12-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/fig-sand-samp2-1.png b/_freeze/24-inf-model-slr/figure-html/fig-sand-samp2-1.png
index 64474bb0..fe93f80f 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/fig-sand-samp2-1.png and b/_freeze/24-inf-model-slr/figure-html/fig-sand-samp2-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/fig-sandpop-1.png b/_freeze/24-inf-model-slr/figure-html/fig-sandpop-1.png
index abceeb72..3e855b36 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/fig-sandpop-1.png and b/_freeze/24-inf-model-slr/figure-html/fig-sandpop-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/fig-slopes-1.png b/_freeze/24-inf-model-slr/figure-html/fig-slopes-1.png
index b2bfc3ec..db8f4c0d 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/fig-slopes-1.png and b/_freeze/24-inf-model-slr/figure-html/fig-slopes-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/fig-unemploymentAndChangeInHouse-1.png b/_freeze/24-inf-model-slr/figure-html/fig-unemploymentAndChangeInHouse-1.png
index b25b3bf8..8f207153 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/fig-unemploymentAndChangeInHouse-1.png and b/_freeze/24-inf-model-slr/figure-html/fig-unemploymentAndChangeInHouse-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-32-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-32-1.png
index 0b173d7b..a1e3c2cd 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-32-1.png and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-32-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-33-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-33-1.png
index 73e318d2..0b173d7b 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-33-1.png and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-33-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-34-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-34-1.png
index 73e318d2..1b4a4849 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-34-1.png and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-34-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-35-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-35-1.png
index 22d1eb75..0eade406 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-35-1.png and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-35-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-36-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-36-1.png
new file mode 100644
index 00000000..856e4da0
Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-36-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-37-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-37-1.png
index 12062b29..73e318d2 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-37-1.png and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-37-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-38-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-38-1.png
index fdb721d9..856e4da0 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-38-1.png and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-38-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-39-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-39-1.png
index a1e3c2cd..4777d8fa 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-39-1.png and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-39-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-39-2.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-39-2.png
new file mode 100644
index 00000000..90d0a2b5
Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-39-2.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-40-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-40-1.png
index 1b4a4849..49b22297 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-40-1.png and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-40-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-40-2.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-40-2.png
new file mode 100644
index 00000000..90d0a2b5
Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-40-2.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-41-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-41-1.png
index 856e4da0..22d1eb75 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-41-1.png and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-41-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-43-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-43-1.png
new file mode 100644
index 00000000..12062b29
Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-43-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-44-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-44-1.png
index 232730ea..fdb721d9 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-44-1.png and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-44-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-45-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-45-1.png
index 82f2d028..53e639ff 100644
Binary files a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-45-1.png and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-45-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-45-2.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-45-2.png
new file mode 100644
index 00000000..4a3b0b1d
Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-45-2.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-47-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-47-1.png
new file mode 100644
index 00000000..232730ea
Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-47-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-48-1.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-48-1.png
new file mode 100644
index 00000000..f0a615fc
Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-48-1.png differ
diff --git a/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-48-2.png b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-48-2.png
new file mode 100644
index 00000000..deb83482
Binary files /dev/null and b/_freeze/24-inf-model-slr/figure-html/unnamed-chunk-48-2.png differ
diff --git a/exercises/_24-ex-inf-model-slr.qmd b/exercises/_24-ex-inf-model-slr.qmd
index 2230d36f..379e43e7 100644
--- a/exercises/_24-ex-inf-model-slr.qmd
+++ b/exercises/_24-ex-inf-model-slr.qmd
@@ -1,7 +1,7 @@
-1. **Body measurements, randomization test.**
-Researchers studying anthropometry collected body and skeletal diameter measurements, as well as age, weight, height and sex for 507 physically active individuals.
-A linear model is built to predict height based on shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters.^[The [`bdims`](http://openintrostat.github.io/openintro/reference/bdims.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Heinz:2003]
-
+1. **Body measurements, randomization test.** Researchers studying anthropometry collected body and skeletal diameter measurements, as well as age, weight, height and sex for 507 physically active individuals.
+ A linear model is built to predict height based on shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters.[^_24-ex-inf-model-slr-1]
+ [@Heinz:2003]
+
Below are two items.
The first is the standard linear model output for predicting height from shoulder girth.
The second is a histogram of slopes from 1,000 randomized datasets (1,000 times, `hgt` was permuted and regressed against `sho_gi`).
@@ -13,7 +13,7 @@ A linear model is built to predict height based on shoulder girth (circumference
library(infer)
library(broom)
library(kableExtra)
-
+
lm(hgt ~ sho_gi, data = bdims) %>%
tidy() %>%
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
@@ -40,31 +40,80 @@ A linear model is built to predict height based on shoulder girth (circumference
geom_vline(xintercept = 0.604, color = IMSCOL["red", "full"], size = 1)
```
- a. What are the null and alternative hypotheses for evaluating whether the slope of the model predicting height from shoulder girth is differen than 0.
-
- b. Using the histogram which describes the distribution of slopes when the null hypothesis is true, find the p-value and conclude the hypothesis test in the context of the problem (use words like shoulder girth and height).
-
- c. Is the conclusion based on the histogram of randomized slopes consistent with the conclusion which would have been obtained using the mathematical model? Explain.
+ a. What are the null and alternative hypotheses for evaluating whether the slope of the model predicting height from shoulder girth is differen than 0.
+
+ b. Using the histogram which describes the distribution of slopes when the null hypothesis is true, find the p-value and conclude the hypothesis test in the context of the problem (use words like shoulder girth and height).
+
+ c. Is the conclusion based on the histogram of randomized slopes consistent with the conclusion which would have been obtained using the mathematical model?
+ Explain.
+
+2. **Baby's weight and father's age, randomization test.** US Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.
+ The data used here are a random sample of 1000 births from 2014.
+ Here, we study the relationship between the father's age and the weight of the baby.[^_24-ex-inf-model-slr-2]
+ [@data:births14]
-1. **Body measurements, mathematical test.**
-The scatterplot and least squares summary below show the relationship between weight measured in kilograms and height measured in centimeters of 507 physically active individuals. [@Heinz:2003]
+ Below are two items.
+ The first is the standard linear model output for predicting baby's weight (in pounds) from father's age (in years).
+ The second is a histogram of slopes from 1000 randomized datasets (1000 times, `weight` was permuted and regressed against `fage`).
+ The red vertical line is drawn at the observed slope value which was produced in the linear model output.
```{r}
- library(openintro)
library(tidyverse)
- library(kableExtra)
+ library(openintro)
+ library(infer)
library(broom)
-
+
+ births14 %>%
+ drop_na() %>%
+ lm(weight ~ fage, data = .) %>%
+ tidy() %>%
+ mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
+ kbl(linesep = "", booktabs = TRUE, align = "lrrrr", digits = 3) %>%
+ kable_styling(bootstrap_options = c("striped", "condensed"),
+ latex_options = "HOLD_position",
+ full_width = FALSE) %>%
+ column_spec(1, width = "10em", monospace = TRUE) %>%
+ column_spec(2:5, width = "5em")
+
+ set.seed(47)
+ births14 %>%
+ drop_na() %>%
+ specify(weight ~ fage) %>%
+ hypothesize(null = "independence") %>%
+ generate(reps = 1000, type = "permute") %>%
+ calculate(stat = "slope") %>%
+ ggplot(aes(x = stat)) +
+ geom_histogram(binwidth = 0.0025, fill = IMSCOL["green", "full"]) +
+ labs(
+ title = "1,000 randomized slopes",
+ x = "Slope from randomly permuted data",
+ y = "Count"
+ ) +
+ geom_vline(xintercept = 0.005, color = IMSCOL["red", "full"], size = 1)
+ ```
+
+ a. What are the null and alternative hypotheses for evaluating whether the slope of the model for predicting baby's weight from father's age is different than 0?
+
+ b. Using the histogram which describes the distribution of slopes when the null hypothesis is true, find the p-value and conclude the hypothesis test in the context of the problem (use words like father's age and weight of baby).
+ What does the conclusion of your test say about whether the father's age is a useful predictor of baby's weight?
+
+ c. Is the conclusion based on the histogram of randomized slopes consistent with the conclusion which would have been obtained using the mathematical model?
+ Explain.
+
+3. **Body measurements, mathematical test.** The scatterplot and least squares summary below show the relationship between weight measured in kilograms and height measured in centimeters of 507 physically active individuals.
+ [@Heinz:2003]
+
+ ```{r}
ggplot(bdims, aes(x = hgt, y = wgt)) +
geom_point() +
labs(
x = "Height (cm)",
y = "Weight (kg)"
)
-
+
m_wgt_hgt <- lm(wgt ~ hgt, data = bdims)
r_wgt_hgt <- round(cor(bdims$wgt, bdims$hgt), 2)
-
+
tidy(m_wgt_hgt) %>%
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
kbl(linesep = "", booktabs = TRUE, align = "lrrrr", digits = 2) %>%
@@ -77,22 +126,52 @@ The scatterplot and least squares summary below show the relationship between we
a. Describe the relationship between height and weight.
- b. Write the equation of the regression line. Interpret the slope and intercept in context.
+ b. Write the equation of the regression line.
+ Interpret the slope and intercept in context.
- c. Do the data provide convincing evidence that the true slope parameter is different than 0? State the null and alternative hypotheses, report the p-value (using a mathematical model), and state your conclusion.
+ c. Do the data provide convincing evidence that the true slope parameter is different than 0?
+ State the null and alternative hypotheses, report the p-value (using a mathematical model), and state your conclusion.
- d. The correlation coefficient for height and weight is `r r_wgt_hgt`. Calculate $R^2$ and interpret it in context.
+ d. The correlation coefficient for height and weight is `r r_wgt_hgt`.
+ Calculate $R^2$ and interpret it in context.
-1. **Body measurements, bootstrap percentile interval.**
-In order to estimate the slope of the model predicting height based on shoulder girth (circumference of shoulders measured over deltoid muscles), 1,000 bootstrap samples are taken from a dataset of body measurements from 507 people.
-A linear model predicting height based on shoulder girth is fit to each bootstrap sample, and the slope is estimated.
-A histogram of these slopes is shown below. [@Heinz:2003]
+4. **Baby's weight and father's age, mathematical test.** Is the father's age useful in predicting the baby's weight?
+ The scatterplot and least squares summary below show the relationship between baby's weight (measured in pounds) and father's age for a random sample of babies.
+ [@data:births14]
+
+ ```{r}
+ ggplot(births14, aes(x = fage, y = weight)) +
+ geom_point() +
+ labs(
+ x = "Father's age",
+ y = "Weight (lbs)"
+ )
+
+ m_weight_fage <- lm(weight ~ fage, data = births14)
+
+ tidy(m_weight_fage) %>%
+ mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
+ kbl(linesep = "", booktabs = TRUE, align = "lrrrr", digits = 4) %>%
+ kable_styling(bootstrap_options = c("striped", "condensed"),
+ latex_options = "HOLD_position",
+ full_width = FALSE) %>%
+ column_spec(1, width = "10em", monospace = TRUE) %>%
+ column_spec(2:5, width = "5em")
+ ```
+
+ a. What is the predicted weight of a baby whose father is 30 years old.
+
+ b. Do the data provide convincing evidence that the model for predicting baby weights from father's age has a slope different than 0?
+ State the null and alternative hypotheses, report the p-value (using a mathematical model), and state your conclusion.
+
+ c. Based on your conclusion, is father's age a useful predictor of baby's weight?
+
+5. **Body measurements, bootstrap percentile interval.** In order to estimate the slope of the model predicting height based on shoulder girth (circumference of shoulders measured over deltoid muscles), 1,000 bootstrap samples are taken from a dataset of body measurements from 507 people.
+ A linear model predicting height based on shoulder girth is fit to each bootstrap sample, and the slope is estimated.
+ A histogram of these slopes is shown below.
+ [@Heinz:2003]
```{r}
- library(tidyverse)
- library(openintro)
- library(infer)
-
set.seed(47)
bdims %>%
specify(hgt ~ sho_gi) %>%
@@ -107,12 +186,42 @@ A histogram of these slopes is shown below. [@Heinz:2003]
)
```
- a. Using the bootstrap percentile method and the histogram above, find a 98% confidence interval for the slope parameter.
-
- b. Interpret the confidence interval in the context of the problem.
+ a. Using the bootstrap percentile method and the histogram above, find a 98% confidence interval for the slope parameter.
+
+ b. Interpret the confidence interval in the context of the problem.
+
+6. **Baby's weight and father's age, bootstrap percentile interval.** US Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.
+ The data used here are a random sample of 1000 births from 2014.
+ Here, we study the relationship between the father's age and the weight of the baby.
+ Below is the bootstrap distribution of the slope statistic from 1,000 different bootstrap samples of the data.
+ [@data:births14]
-1. **Body measurements, standard error bootstrap interval.**
-A linear model is built to predict height based on shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters. [@Heinz:2003]
+ ```{r}
+ library(tidyverse)
+ library(openintro)
+ library(infer)
+
+ set.seed(47)
+ births14 %>%
+ drop_na() %>%
+ specify(weight ~ fage) %>%
+ generate(reps = 1000, type = "bootstrap") %>%
+ calculate(stat = "slope") %>%
+ ggplot(aes(x = stat)) +
+ geom_histogram(binwidth = 0.005, fill = IMSCOL["green", "full"]) +
+ labs(
+ title = "1,000 bootstrapped slopes",
+ x = "Slope from bootstrapped data",
+ y = "Count"
+ )
+ ```
+
+ a. Using the bootstrap percentile method and the histogram above, find a 95% confidence interval for the slope parameter.
+
+ b. Interpret the confidence interval in the context of the problem.
+
+7. **Body measurements, standard error bootstrap interval.** A linear model is built to predict height based on shoulder girth (circumference of shoulders measured over deltoid muscles), both measured in centimeters.
+ [@Heinz:2003]
Below are two items.
The first is the standard linear model output for predicting height from shoulder girth.
@@ -123,7 +232,7 @@ A linear model is built to predict height based on shoulder girth (circumference
library(openintro)
library(infer)
library(broom)
-
+
lm(hgt ~ sho_gi, data = bdims) %>%
tidy() %>%
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
@@ -133,7 +242,7 @@ A linear model is built to predict height based on shoulder girth (circumference
full_width = FALSE) %>%
column_spec(1, width = "10em", monospace = TRUE) %>%
column_spec(2:5, width = "5em")
-
+
set.seed(47)
bdims %>%
specify(hgt ~ sho_gi) %>%
@@ -148,28 +257,30 @@ A linear model is built to predict height based on shoulder girth (circumference
)
```
- a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify the variability of the slope statistic from sample to sample).
-
- b. Find a 98% bootstrap SE confidence interval for the slope parameter.
-
- c. Interpret the confidence interval in the context of the problem.
+ a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify the variability of the slope statistic from sample to sample).
+
+ b. Find a 98% bootstrap SE confidence interval for the slope parameter.
+
+ c. Interpret the confidence interval in the context of the problem.
-1. **Murders and poverty, randomization test.**
-The following regression output is for predicting annual murders per million (`annual_murders_per_mil`) from percentage living in poverty (`perc_pov`) in a random sample of 20 metropolitan areas.
+8. **Baby's weight and father's age, standard error bootstrap interval.** US Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.
+ The data used here are a random sample of 1000 births from 2014.
+ Here, we study the relationship between the father's age and the weight of the baby.
+ [@data:births14]
Below are two items.
- The first is the standard linear model output for predicting annual murders per million from percentage living in poverty for metropolitan areas.
- The second is a histogram of slopes from 1000 randomized datasets (1000 times, `annual_murders_per_mil` was permuted and regressed against `perc_pov`).
- The red vertical line is drawn at the observed slope value which was produced in the linear model output.
+ The first is the standard linear model output for predicting baby's weight (in pounds) from father's age (in years).
+ The second is the bootstrap distribution of the slope statistic from 1000 different bootstrap samples of the data.
```{r}
library(tidyverse)
library(openintro)
library(infer)
library(broom)
- library(kableExtra)
- lm(annual_murders_per_mil ~ perc_pov, data = murders) %>%
+ births14 %>%
+ drop_na() %>%
+ lm(weight ~ fage, data = .) %>%
tidy() %>%
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
kbl(linesep = "", booktabs = TRUE, align = "lrrrr", digits = 3) %>%
@@ -178,147 +289,97 @@ The following regression output is for predicting annual murders per million (`a
full_width = FALSE) %>%
column_spec(1, width = "10em", monospace = TRUE) %>%
column_spec(2:5, width = "5em")
-
+
set.seed(47)
- murders %>%
- specify(annual_murders_per_mil ~ perc_pov) %>%
- hypothesize(null = "independence") %>%
- generate(reps = 1000, type = "permute") %>%
+ births14 %>%
+ drop_na() %>%
+ specify(weight ~ fage) %>%
+ generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "slope") %>%
ggplot(aes(x = stat)) +
- geom_histogram(binwidth = 0.2, fill = IMSCOL["green", "full"]) +
+ geom_histogram(binwidth = 0.005, fill = IMSCOL["green", "full"]) +
labs(
- title = "1,000 randomized slopes",
- x = "Slope from randomly permuted data",
+ title = "1,000 bootstrapped slopes",
+ x = "Slope from bootstrapped data",
y = "Count"
- ) +
- geom_vline(xintercept = 2.559, color = IMSCOL["red", "full"], size = 1)
+ )
```
-
- a. What are the null and alternative hypotheses for evaluating whether the slope of the model for predicting annual murder rate from poverty percentage is different than 0?
-
- b. Using the histogram which describes the distribution of slopes when the null hypothesis is true, find the p-value and conclude the hypothesis test in the context of the problem (use words like murder rate and poverty).
-
- c. Is the conclusion based on the histogram of randomized slopes consistent with the conclusion which would have been obtained using the mathematical model? Explain.
-
-1. **Murders and poverty, mathematical test.**
-The table below shows the output of a linear model annual murders per million (`annual_murders_per_mil`) from percentage living in poverty (`perc_pov`) in a random sample of 20 metropolitan areas.
- ```{r}
- library(MASS)
- library(kableExtra)
- library(broom)
-
- lm(annual_murders_per_mil ~ perc_pov, data = murders) %>%
- tidy() %>%
- mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
- kbl(linesep = "", booktabs = TRUE, align = "lrrrr", digits = 4) %>%
- kable_styling(bootstrap_options = c("striped", "condensed"),
- latex_options = "HOLD_position",
- full_width = FALSE) %>%
- column_spec(1, width = "10em", monospace = TRUE) %>%
- column_spec(2:5, width = "5em")
- ```
+ a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify the variability of the slope statistic from sample to sample).
- a. What are the hypotheses for evaluating whether the slope of the model predicting annual murder rate from poverty percentage is different than 0?
+ b. Find a 95% bootstrap SE confidence interval for the slope parameter.
- b. State the conclusion of the hypothesis test from part (a) in context of the data. What does this say about whether poverty percentage is a useful predictor of annual murder rate?
+ c. Interpret the confidence interval in the context of the problem.
- c. Calculate a 95% confidence interval for the slope of poverty percentage, and interpret it in context of the data.
+9. **Body measurements, conditions.** The scatterplot below shows the residuals (on the y-axis) from the linear model of weight vs. height from a dataset of body measurements from 507 physically active individuals.
+ The x-axis is the height of the individuals, in cm.
+ [@Heinz:2003]
- d. Do your results from the hypothesis test and the confidence interval agree? Explain.
+ ```{r}
+ #| out-width: 100%
+ #| fig-asp: 0.4
+ m_wgt_hgt <- lm(wgt ~ hgt, data = bdims)
+ rsq_uo <- round(cor(bdims$wgt, bdims$hgt), 2)^2*100
-1. **Murders and poverty, bootstrap percentile interval.**
-Data on annual murders per million (`annual_murders_per_mil`) and percentage living in poverty (`perc_pov`) is collected from a random sample of 20 metropolitan areas.
-Using these data we want to estimate the slope of the model predicting `annual_murders_per_mil` from `perc_pov`.
-We take 1,000 bootstrap samples of the data and fit a linear model predicting `annual_murders_per_mil` from `perc_pov` to each bootstrap sample.
-A histogram of these slopes is shown below.
+ m_wgt_hgt |>
+ augment() |>
+ ggplot(aes(x = hgt, y = .resid)) +
+ geom_point(col = IMSCOL["blue", "full"]) +
+ geom_smooth(method = "lm", col = IMSCOL["gray", "full"], lty = 2, se = FALSE) +
+ xlab("Height (cm)") +
+ ylab("residuals")
- ```{r}
- library(openintro)
- library(infer)
- library(tidyverse)
-
- set.seed(470)
- murders %>%
- specify(annual_murders_per_mil ~ perc_pov) %>%
- generate(reps = 1000, type = "bootstrap") %>%
- calculate(stat = "slope") %>%
- ggplot(aes(x = stat)) +
- geom_histogram(binwidth = 0.2, fill = IMSCOL["green", "full"]) +
- labs(
- title = "1,000 bootstrap slopes",
- x = "Slope from bootstrapped data",
- y = "Count"
- )
```
-
- a. Using the percentile bootstrap method and the histogram above, find a 90% confidence interval for the slope parameter.
-
- b. Interpret the confidence interval in the context of the problem.
-1. **Murders and poverty, standard error bootstrap interval.**
-A linear model is built to predict annual murders per million (`annual_murders_per_mil`) from percentage living in poverty (`perc_pov`) in a random sample of 20 metropolitan areas.
+ a. For these data, $R^2$ is `r rsq_uo`%.
+ What is the value of the correlation coefficient?
+ How can you tell if it is positive or negative?
+ \[Hint: you may need to look at a previous exercise.\]
- Below are two items.
- The first is the standard linear model output for predicting annual murders per million from percentage living in poverty for metropolitan areas.
- The second is the bootstrap distribution of the slope statistic from 1000 different bootstrap samples of the data.
+ b. Examine the residual plot.
+ What do you observe?
+ Is a simple least squares fit appropriate for these data?
+ Which of the LINE conditions are met or not met?
+
+10. **Baby's weight and father's age, conditions.** The scatterplot below shows the residuals (on the y-axis) from the linear model of baby's weight (measured in pounds) vs. father's age for a random sample of babies.
+ Father's age is on the x-axis.
+ [@data:births14]
```{r}
- library(openintro)
- library(broom)
- library(knitr)
- library(infer)
+ #| out-width: 100%
+ #| fig-asp: 0.4
+ m_weight_fage_lm <- lm(weight ~ fage, data = births14)
+ rsq_uo <- round(cor(births14$weight, births14$fage, use = "pairwise.complete"), 2)^2*100
+
+ m_weight_fage_lm |>
+ augment() |>
+ ggplot(aes(x = fage, y = .resid)) +
+ geom_point(col = IMSCOL["blue", "full"]) +
+ geom_smooth(method = "lm", col = IMSCOL["gray", "full"], lty = 2, se = FALSE) +
+ xlab("Father's age") +
+ ylab("residuals")
- lm(annual_murders_per_mil ~ perc_pov, data = murders) %>%
- tidy() %>%
- mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
- kbl(linesep = "", booktabs = TRUE, align = "lrrrr", digits = 3) %>%
- kable_styling(bootstrap_options = c("striped", "condensed"),
- latex_options = "HOLD_position",
- full_width = FALSE) %>%
- column_spec(1, width = "10em", monospace = TRUE) %>%
- column_spec(2:5, width = "5em")
-
- set.seed(470)
- murders %>%
- specify(annual_murders_per_mil ~ perc_pov) %>%
- generate(reps = 1000, type = "bootstrap") %>%
- calculate(stat = "slope") %>%
- ggplot(aes(x = stat)) +
- geom_histogram(binwidth = 0.2, fill = IMSCOL["green", "full"]) +
- labs(
- title = "1,000 bootstrapped slopes",
- x = "Slope from bootstrapped data",
- y = "Count"
- )
```
-
- a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify the variability of the slope statistic from sample to sample).
-
- b. Find a 90% bootstrap SE confidence interval for the slope parameter.
-
- c. Interpret the confidence interval in the context of the problem.
-
-1. **Baby's weight and father's age, randomization test.**
-US Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.
-The data used here are a random sample of 1000 births from 2014.
-Here, we study the relationship between the father's age and the weight of the baby.^[The [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@data:births14]
+
+ a. For these data, $R^2$ is `r rsq_uo`%.
+ What is the value of the correlation coefficient?
+ How can you tell if it is positive or negative?
+ \[Hint: you may need to look at a previous exercise.\]
+
+ b. Examine the residual plot.
+ What do you observe?
+ Is a simple least squares fit appropriate for these data?
+ Which of the LINE conditions are met or not met?
+
+11. **Murders and poverty, randomization test.** The following regression output is for predicting annual murders per million (`annual_murders_per_mil`) from percentage living in poverty (`perc_pov`) in a random sample of 20 metropolitan areas.
Below are two items.
- The first is the standard linear model output for predicting baby's weight (in pounds) from father's age (in years).
- The second is a histogram of slopes from 1000 randomized datasets (1000 times, `weight` was permuted and regressed against `fage`).
+ The first is the standard linear model output for predicting annual murders per million from percentage living in poverty for metropolitan areas.
+ The second is a histogram of slopes from 1000 randomized datasets (1000 times, `annual_murders_per_mil` was permuted and regressed against `perc_pov`).
The red vertical line is drawn at the observed slope value which was produced in the linear model output.
```{r}
- library(tidyverse)
- library(openintro)
- library(infer)
- library(broom)
-
- births14 %>%
- drop_na() %>%
- lm(weight ~ fage, data = .) %>%
+ lm(annual_murders_per_mil ~ perc_pov, data = murders) %>%
tidy() %>%
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
kbl(linesep = "", booktabs = TRUE, align = "lrrrr", digits = 3) %>%
@@ -329,48 +390,33 @@ Here, we study the relationship between the father's age and the weight of the b
column_spec(2:5, width = "5em")
set.seed(47)
- births14 %>%
- drop_na() %>%
- specify(weight ~ fage) %>%
+ murders %>%
+ specify(annual_murders_per_mil ~ perc_pov) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "slope") %>%
ggplot(aes(x = stat)) +
- geom_histogram(binwidth = 0.0025, fill = IMSCOL["green", "full"]) +
+ geom_histogram(binwidth = 0.2, fill = IMSCOL["green", "full"]) +
labs(
title = "1,000 randomized slopes",
x = "Slope from randomly permuted data",
y = "Count"
) +
- geom_vline(xintercept = 0.005, color = IMSCOL["red", "full"], size = 1)
+ geom_vline(xintercept = 2.559, color = IMSCOL["red", "full"], size = 1)
```
- a. What are the null and alternative hypotheses for evaluating whether the slope of the model for predicting baby's weight from father's age is different than 0?
+ a. What are the null and alternative hypotheses for evaluating whether the slope of the model for predicting annual murder rate from poverty percentage is different than 0?
- b. Using the histogram which describes the distribution of slopes when the null hypothesis is true, find the p-value and conclude the hypothesis test in the context of the problem (use words like father's age and weight of baby). What does the conclusion of your test say about whether the father's age is a useful predictor of baby's weight?
+ b. Using the histogram which describes the distribution of slopes when the null hypothesis is true, find the p-value and conclude the hypothesis test in the context of the problem (use words like murder rate and poverty).
- c. Is the conclusion based on the histogram of randomized slopes consistent with the conclusion which would have been obtained using the mathematical model? Explain.
+ c. Is the conclusion based on the histogram of randomized slopes consistent with the conclusion which would have been obtained using the mathematical model?
+ Explain.
-1. **Baby's weight and father's age, mathematical test.**
-Is the father's age useful in predicting the baby's weight?
-The scatterplot and least squares summary below show the relationship between baby's weight (measured in pounds) and father's age for a random sample of babies. [@data:births14]
+12. **Murders and poverty, mathematical test.** The table below shows the output of a linear model annual murders per million (`annual_murders_per_mil`) from percentage living in poverty (`perc_pov`) in a random sample of 20 metropolitan areas.
```{r}
- library(openintro)
- library(tidyverse)
- library(kableExtra)
- library(broom)
-
- ggplot(births14, aes(x = fage, y = weight)) +
- geom_point() +
- labs(
- x = "Father's age",
- y = "Weight (lbs)"
- )
-
- m_weight_fage <- lm(weight ~ fage, data = births14)
-
- tidy(m_weight_fage) %>%
+ lm(annual_murders_per_mil ~ perc_pov, data = murders) %>%
+ tidy() %>%
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
kbl(linesep = "", booktabs = TRUE, align = "lrrrr", digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "condensed"),
@@ -379,61 +425,49 @@ The scatterplot and least squares summary below show the relationship between ba
column_spec(1, width = "10em", monospace = TRUE) %>%
column_spec(2:5, width = "5em")
```
-
- a. What is the predicted weight of a baby whose father is 30 years old.
-
- b. Do the data provide convincing evidence that the model for predicting baby weights from father's age has a slope different than 0? State the null and alternative hypotheses, report the p-value (using a mathematical model), and state your conclusion.
-
- c. Based on your conclusion, is father's age a useful predictor of baby's weight?
-
-1. **Baby's weight and father's age, bootstrap percentile interval.**
-US Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.
-The data used here are a random sample of 1000 births from 2014.
-Here, we study the relationship between the father's age and the weight of the baby.
-Below is the bootstrap distribution of the slope statistic from 1,000 different bootstrap samples of the data. [@data:births14]
+
+ a. What are the hypotheses for evaluating whether the slope of the model predicting annual murder rate from poverty percentage is different than 0?
+
+ b. State the conclusion of the hypothesis test from part (a) in context of the data.
+ What does this say about whether poverty percentage is a useful predictor of annual murder rate?
+
+ c. Calculate a 95% confidence interval for the slope of poverty percentage, and interpret it in context of the data.
+
+ d. Do your results from the hypothesis test and the confidence interval agree?
+ Explain.
+
+13. **Murders and poverty, bootstrap percentile interval.** Data on annual murders per million (`annual_murders_per_mil`) and percentage living in poverty (`perc_pov`) is collected from a random sample of 20 metropolitan areas.
+ Using these data we want to estimate the slope of the model predicting `annual_murders_per_mil` from `perc_pov`.
+ We take 1,000 bootstrap samples of the data and fit a linear model predicting `annual_murders_per_mil` from `perc_pov` to each bootstrap sample.
+ A histogram of these slopes is shown below.
```{r}
- library(tidyverse)
- library(openintro)
- library(infer)
-
- set.seed(47)
- births14 %>%
- drop_na() %>%
- specify(weight ~ fage) %>%
+ set.seed(470)
+ murders %>%
+ specify(annual_murders_per_mil ~ perc_pov) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "slope") %>%
ggplot(aes(x = stat)) +
- geom_histogram(binwidth = 0.005, fill = IMSCOL["green", "full"]) +
+ geom_histogram(binwidth = 0.2, fill = IMSCOL["green", "full"]) +
labs(
- title = "1,000 bootstrapped slopes",
+ title = "1,000 bootstrap slopes",
x = "Slope from bootstrapped data",
y = "Count"
)
```
- a. Using the bootstrap percentile method and the histogram above, find a 95% confidence interval for the slope parameter.
-
- b. Interpret the confidence interval in the context of the problem.
+ a. Using the percentile bootstrap method and the histogram above, find a 90% confidence interval for the slope parameter.
-1. **Baby's weight and father's age, standard error bootstrap interval.**
-US Department of Health and Human Services, Centers for Disease Control and Prevention collect information on births recorded in the country.
-The data used here are a random sample of 1000 births from 2014.
-Here, we study the relationship between the father's age and the weight of the baby. [@data:births14]
+ b. Interpret the confidence interval in the context of the problem.
+
+14. **Murders and poverty, standard error bootstrap interval.** A linear model is built to predict annual murders per million (`annual_murders_per_mil`) from percentage living in poverty (`perc_pov`) in a random sample of 20 metropolitan areas.
Below are two items.
- The first is the standard linear model output for predicting baby's weight (in pounds) from father's age (in years).
+ The first is the standard linear model output for predicting annual murders per million from percentage living in poverty for metropolitan areas.
The second is the bootstrap distribution of the slope statistic from 1000 different bootstrap samples of the data.
```{r}
- library(tidyverse)
- library(openintro)
- library(infer)
- library(broom)
-
- births14 %>%
- drop_na() %>%
- lm(weight ~ fage, data = .) %>%
+ lm(annual_murders_per_mil ~ perc_pov, data = murders) %>%
tidy() %>%
mutate(p.value = ifelse(p.value < 0.0001, "<0.0001", round(p.value, 4))) %>%
kbl(linesep = "", booktabs = TRUE, align = "lrrrr", digits = 3) %>%
@@ -442,15 +476,14 @@ Here, we study the relationship between the father's age and the weight of the b
full_width = FALSE) %>%
column_spec(1, width = "10em", monospace = TRUE) %>%
column_spec(2:5, width = "5em")
-
- set.seed(47)
- births14 %>%
- drop_na() %>%
- specify(weight ~ fage) %>%
+
+ set.seed(470)
+ murders %>%
+ specify(annual_murders_per_mil ~ perc_pov) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "slope") %>%
ggplot(aes(x = stat)) +
- geom_histogram(binwidth = 0.005, fill = IMSCOL["green", "full"]) +
+ geom_histogram(binwidth = 0.2, fill = IMSCOL["green", "full"]) +
labs(
title = "1,000 bootstrapped slopes",
x = "Slope from bootstrapped data",
@@ -458,21 +491,56 @@ Here, we study the relationship between the father's age and the weight of the b
)
```
- a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify the variability of the slope statistic from sample to sample).
-
- b. Find a 95% bootstrap SE confidence interval for the slope parameter.
-
- c. Interpret the confidence interval in the context of the problem.
+ a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify the variability of the slope statistic from sample to sample).
+
+ b. Find a 90% bootstrap SE confidence interval for the slope parameter.
+
+ c. Interpret the confidence interval in the context of the problem.
+
+15. **Murders and poverty, conditions.** The scatterplot below shows the annual murders per million vs. percentage living in poverty in a random sample of 20 metropolitan areas.
+ The second figure plots residuals on the y-axis and percent living in poverty on the x-axis.
+
+ ```{r}
+ #| out-width: 100%
+ #| fig-asp: 0.4
+ murders_lm <- lm(annual_murders_per_mil ~ perc_pov, data = murders)
+ rsq_uo <- round(cor(murders$perc_pov, murders$annual_murders_per_mil), 2)^2*100
+
+ murders_lm |>
+ augment() |>
+ ggplot(aes(y = annual_murders_per_mil, x = perc_pov)) +
+ geom_point(col = IMSCOL["blue", "full"]) +
+ geom_smooth(method = "lm", col = IMSCOL["gray", "full"], lwd = 1, se = FALSE) +
+ xlab("% Living in poverty") +
+ ylab("Annual muders\nper million")
+
+ murders_lm |>
+ augment() |>
+ ggplot(aes(x = perc_pov, y = .resid)) +
+ geom_point(col = IMSCOL["blue", "full"]) +
+ geom_smooth(method = "lm", col = IMSCOL["gray", "full"], lty = 2, se = FALSE) +
+ xlab("") +
+ ylab("residuals")
+
+ ```
+
+ a. For these data, $R^2$ is `r rsq_uo`%.
+ What is the value of the correlation coefficient?
+ How can you tell if it is positive or negative?
-1. **I heart cats.**
-Researchers collected data on heart and body weights of 144 domestic adult cats.
-The table below shows the output of a linear model predicting heat weight (measured in grams) from body weight (measured in kilograms) of these cats.^[The [`cats`](https://stat.ethz.ch/R-manual/R-patched/library/MASS/html/cats.html) data used in this exercise can be found in the [**MASS**](https://cran.r-project.org/web/packages/MASS/index.html) R package.]
+ b. Examine the residual plot.
+ What do you observe?
+ Is a simple least squares fit appropriate for the data?
+ Which of the LINE conditions are met or not met?
+
+16. **I heart cats.** Researchers collected data on heart and body weights of 144 domestic adult cats.
+ The table below shows the output of a linear model predicting heat weight (measured in grams) from body weight (measured in kilograms) of these cats.[^_24-ex-inf-model-slr-3]
```{r}
library(MASS)
library(kableExtra)
library(broom)
-
+
m_cat <- lm(Hwt ~ Bwt, data = cats)
tidy(m_cat) %>%
@@ -485,30 +553,35 @@ The table below shows the output of a linear model predicting heat weight (measu
column_spec(2:5, width = "5em")
```
- a. What are the hypotheses for evaluating whether body weight is positively associated with heart weight in cats?
+ a. What are the hypotheses for evaluating whether body weight is positively associated with heart weight in cats?
- b. State the conclusion of the hypothesis test from part (a) in context of the data.
+ b. State the conclusion of the hypothesis test from part (a) in context of the data.
- c. Calculate a 95% confidence interval for the slope of body weight, and interpret it in context of the data.
+ c. Calculate a 95% confidence interval for the slope of body weight, and interpret it in context of the data.
- d. Do your results from the hypothesis test and the confidence interval agree? Explain.
+ d. Do your results from the hypothesis test and the confidence interval agree?
+ Explain.
-1. **Beer and blood alcohol content**
-Many people believe that weight, drinking habits, and many other factors are much more important in predicting blood alcohol content (BAC) than simply considering the number of drinks a person consumed. Here we examine data from sixteen student volunteers at Ohio State University who each drank a randomly assigned number of cans of beer. These students were evenly divided between men and women, and they differed in weight and drinking habits. Thirty minutes later, a police officer measured their blood alcohol content (BAC) in grams of alcohol per deciliter of blood. The scatterplot and regression table summarize the findings. ^[The [`bac`](http://openintrostat.github.io/openintro/reference/bac.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.] [@Malkevitc+Lesser:2008]
+17. **Beer and blood alcohol content** Many people believe that weight, drinking habits, and many other factors are much more important in predicting blood alcohol content (BAC) than simply considering the number of drinks a person consumed.
+ Here we examine data from sixteen student volunteers at Ohio State University who each drank a randomly assigned number of cans of beer.
+ These students were evenly divided between men and women, and they differed in weight and drinking habits.
+ Thirty minutes later, a police officer measured their blood alcohol content (BAC) in grams of alcohol per deciliter of blood.
+ The scatterplot and regression table summarize the findings.
+ [^_24-ex-inf-model-slr-4] [@Malkevitc+Lesser:2008]
```{r}
library(openintro)
library(tidyverse)
library(kableExtra)
library(broom)
-
+
ggplot(bac, aes(x = beers, y = bac)) +
geom_point(size = 2) +
labs(
x = "Cans of beer",
y = "BAC (grams / deciliter)"
)
-
+
m_bac <- lm(bac ~ beers, data = bac)
r_bac <- round(cor(bac$bac, bac$beers), 2)
@@ -524,34 +597,60 @@ Many people believe that weight, drinking habits, and many other factors are muc
a. Describe the relationship between the number of cans of beer and BAC.
- b. Write the equation of the regression line. Interpret the slope and intercept in context.
+ b. Write the equation of the regression line.
+ Interpret the slope and intercept in context.
- c. Do the data provide convincing evidence that drinking more cans of beer is associated with an increase in blood alcohol? State the null and alternative hypotheses, report the p-value, and state your conclusion.
+ c. Do the data provide convincing evidence that drinking more cans of beer is associated with an increase in blood alcohol?
+ State the null and alternative hypotheses, report the p-value, and state your conclusion.
- d. The correlation coefficient for number of cans of beer and BAC is `r r_bac`. Calculate $R^2$ and interpret it in context.
+ d. The correlation coefficient for number of cans of beer and BAC is `r r_bac`.
+ Calculate $R^2$ and interpret it in context.
- e. Suppose we visit a bar, ask people how many drinks they have had, and also take their BAC. Do you think the relationship between number of drinks and BAC would be as strong as the relationship found in the Ohio State study?
+ e. Suppose we visit a bar, ask people how many drinks they have had, and also take their BAC.
+ Do you think the relationship between number of drinks and BAC would be as strong as the relationship found in the Ohio State study?
-1. **Urban homeowners, conditions.**
-The scatterplot below shows the percent of families who own their home vs. the percent of the population living in urban areas. [@data:urbanOwner] There are 52 observations, each corresponding to a state in the US. Puerto Rico and District of Columbia are also included.
+18. **Urban homeowners, conditions.** The scatterplot below shows the percent of families who own their home vs. the percent of the population living in urban areas.
+ [@data:urbanOwner] There are 52 observations, each corresponding to a state in the US.
+ Puerto Rico and District of Columbia are also included.
+ The second figure plots residuals on the y-axis and percent of the population living in urban areas on the x-axis.
- ````{r}
- #| out-width: 50%
- #| fig-asp: 1
- library(openintro)
-
- lmPlot(urban_owner$poppct_urban,
- urban_owner$pct_owner_occupied,
- col = IMSCOL["blue", "full"],
- xlab = "% Urban population", ylab = "% Who own home",
- lCol = IMSCOL["gray", "full"], lwd = 2,
- resSymm = TRUE,
- resAxis = 3, xAxis = 5, yAxis = 5,
- cex.lab = 1.5, cex.axis = 1.5)
-
+ ```{r}
+ #| out-width: 100%
+ #| fig-asp: .4
+ urban_owner_lm <- lm(pct_owner_occupied ~ poppct_urban, data = urban_owner)
rsq_uo <- round(cor(urban_owner$poppct_urban, urban_owner$pct_owner_occupied), 2)^2*100
- ````
- a. For these data, $R^2$ is `r rsq_uo`%. What is the value of the correlation coefficient? How can you tell if it is positive or negative?
+ urban_owner_lm |>
+ augment() |>
+ ggplot(aes(y = pct_owner_occupied, x = poppct_urban)) +
+ geom_point(col = IMSCOL["blue", "full"]) +
+ geom_smooth(method = "lm", col = IMSCOL["gray", "full"], lwd = 1, se = FALSE) +
+ xlab("% Urban population") +
+ ylab("% Who own home")
+
+ urban_owner_lm |>
+ augment() |>
+ ggplot(aes(x = poppct_urban, y = .resid)) +
+ geom_point(col = IMSCOL["blue", "full"]) +
+ geom_smooth(method = "lm", col = IMSCOL["gray", "full"], lty = 2, se = FALSE) +
+ xlab("") +
+ ylab("residuals")
+
+ ```
+
+ a. For these data, $R^2$ is `r rsq_uo`%.
+ What is the value of the correlation coefficient?
+ How can you tell if it is positive or negative?
+
+ b. Examine the residual plot.
+ What do you observe?
+ Is a simple least squares fit appropriate for the data?
+ Which of the LINE conditions are met or not met?
+
+[^_24-ex-inf-model-slr-1]: The [`bdims`](http://openintrostat.github.io/openintro/reference/bdims.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
+
+[^_24-ex-inf-model-slr-2]: The [`births14`](http://openintrostat.github.io/openintro/reference/births14.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
+
+[^_24-ex-inf-model-slr-3]: The [`cats`](https://stat.ethz.ch/R-manual/R-patched/library/MASS/html/cats.html) data used in this exercise can be found in the [**MASS**](https://cran.r-project.org/web/packages/MASS/index.html) R package.
- b. Examine the residual plot. What do you observe? Is a simple least squares fit appropriate for these data? Which of the LINE conditions are met or not met?
+[^_24-ex-inf-model-slr-4]: The [`bac`](http://openintrostat.github.io/openintro/reference/bac.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.