Skip to content

Commit

Permalink
chapter 4 edits...
Browse files Browse the repository at this point in the history
  • Loading branch information
hardin47 committed Feb 2, 2024
1 parent b01d94a commit 94d5340
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 42 deletions.
72 changes: 40 additions & 32 deletions exercises/_04-ex-explore-categorical.qmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
1. **Antibiotic use in children.** The bar plot and the pie chart below show the distribution of pre-existing medical conditions of children involved in a study on the optimal duration of antibiotic use in treatment of tracheitis, which is an upper respiratory infection.[^antibiotic_use_children_q-1]
1. **Antibiotic use in children.** The bar plot and the pie chart below show the distribution of pre-existing medical conditions of children involved in a study on the optimal duration of antibiotic use in treatment of tracheitis, which is an upper respiratory infection.[^_04-ex-explore-categorical-1]

```{r}
#| out-width: 100%
Expand Down Expand Up @@ -28,10 +28,8 @@
c. Which graph would you prefer to use for displaying these categorical data?
[^antibiotic_use_children_q-1]: The [`antibiotics`](http://openintrostat.github.io/openintro/reference/antibiotics.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
1. **Views on immigration.** Nine-hundred and ten (910) randomly sampled registered voters from Tampa, FL were asked if they thought workers who have illegally entered the US should be (i) allowed to keep their jobs and apply for US citizenship, (ii) allowed to keep their jobs as temporary guest workers but not allowed to apply for US citizenship, or (iii) lose their jobs and have to leave the country.
The results of the survey by political ideology are shown below.[^immigration_contingency_table_q-1]
2. **Views on immigration.** Nine-hundred and ten (910) randomly sampled registered voters from Tampa, FL were asked if they thought workers who have illegally entered the US should be (i) allowed to keep their jobs and apply for US citizenship, (ii) allowed to keep their jobs as temporary guest workers but not allowed to apply for US citizenship, or (iii) lose their jobs and have to leave the country.
The results of the survey by political ideology are shown below.[^_04-ex-explore-categorical-2]
```{r}
immigration |>
Expand Down Expand Up @@ -61,9 +59,7 @@
f. Conjecture other possible variables that might explain the potential relationship between these two variables.
[^immigration_contingency_table_q-1]: The [`immigration`](http://openintrostat.github.io/openintro/reference/immigration.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
1. **Black Lives Matter.** A Washington Post-Schar School poll conducted in the United States in June 2020, among a random national sample of 1,006 adults, asked respondents whether they support or oppose protests following George Floyd's killing that have taken place in cities across the US.
3. **Black Lives Matter.** A Washington Post-Schar School poll conducted in the United States in June 2020, among a random national sample of 1,006 adults, asked respondents whether they support or oppose protests following George Floyd's killing that have taken place in cities across the US.
The survey also collected information on the age of the respondents.
[@survey:blmWaPoScar:2020] The results are summarized in the stacked bar plot below.
Expand Down Expand Up @@ -108,7 +104,7 @@
b. Conjecture other possible variables that might explain the potential association between these two variables.
1. **Raise taxes.** A random sample of registered voters nationally were asked whether they think it's better to raise taxes on the rich or raise taxes on the poor.
4. **Raise taxes.** A random sample of registered voters nationally were asked whether they think it's better to raise taxes on the rich or raise taxes on the poor.
The survey also collected information on the political party affiliation of the respondents.
[@survey:raiseTaxes:2015]
Expand Down Expand Up @@ -147,11 +143,11 @@
b. Conjecture other possible variables that might explain the potential association between these two variables.
1. **Heart transplant data display.** The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan.
5. **Heart transplant data display.** The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan.
Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart.
Patients were randomly assigned into treatment and control groups.
Patients in the treatment group received a transplant, and those in the control group did not.
The visualization below displays two different versions of the data.[^heart_transplant_display_q-1]
The visualization below displays two different versions of the data.[^_04-ex-explore-categorical-3]
[@Turnbull+Brown+Hu:1974]
```{r}
Expand Down Expand Up @@ -186,9 +182,7 @@
c. For the Heart Transplant Study which of those aspects would be more important to display?
That is, which bar plot would be better as a data visualization?
[^heart_transplant_display_q-1]: The [`heart_transplant`](http://openintrostat.github.io/openintro/reference/heart_transplant.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
1. **Shipping holiday gifts data display.** A local news survey asked 500 randomly sampled Los Angeles residents which shipping carrier they prefer to use for shipping holiday gifts.
6. **Shipping holiday gifts data display.** A local news survey asked 500 randomly sampled Los Angeles residents which shipping carrier they prefer to use for shipping holiday gifts.
The table below shows the distribution of responses by age group as well as the expected counts for each cell (shown in italics).
```{r}
Expand Down Expand Up @@ -238,7 +232,7 @@
d. FedEx would like to reach out to grow their market share so as to balance the age demographics of FedEx users.
To what age group should FedEx market?
1. **Meat consumption and Life Expectancy.** In data collected for @You:2022, total meat intake is associated with life expectancy (at birth) in 175 countries.
7. **Meat consumption and Life Expectancy.** In data collected for @You:2022, total meat intake is associated with life expectancy (at birth) in 175 countries.
Meat intake is measured in kg per capita per year (averaged over 2011 to 2013).
The two ridge plots show an association between income and meat consumption (higher income countries tend to eat more meat) and an association between income and life expectancy (higher income countries have higher life expectancy).
Expand Down Expand Up @@ -286,24 +280,28 @@
That is, can you tell if countries with low meat consumption have low life expectancy?
Explain.
b. Let's assume that you had a plot comparing meat consumption and life expectancy and they **do** seem associated.
b. Let's assume that you had a plot comparing meat consumption and life expectancy, and they **do** seem associated.
Your friend says that the plot shows that high meat consumption leads to a longer life.
You correctly say, no, we can't tell if there is a causal realtionship because the relationship is confounded by income level.
Explain what you mean.
c. How can you investigate the relationship between meat consumption and life expectancy in the presence of confounding variables (like income)?
1. **Florence Nightingale.**
Florence Nightingale was a nurse in the Crimean War and an early statistician. In her notes, she opined "In comparing the deaths of one hospital with those of another, any statistics are justly considered absolutely valueless which do not give the ages, the sexes and the diseases of all the cases." [@nightingale:1859]
8. **Florence Nightingale.** Florence Nightingale was a nurse in the Crimean War and an early statistician.
In her notes, she opined "In comparing the deaths of one hospital with those of another, any statistics are justly considered absolutely valueless which do not give the ages, the sexes and the diseases of all the cases." [@nightingale:1859]
a. Nightingale describes three confounding variables to consider when comparing death rates across hospitals. What are they? Describe what makes each variable potentially confounding.
a. Nightingale describes three confounding variables to consider when comparing death rates across hospitals.
What are they?
Describe what makes each variable potentially confounding.
b. Provide two additional potential confounding variables for this situation. Check to make sure that the variables are associated with both the explanatory variable (hospital) and the response variable (death).
b. Provide two additional potential confounding variables for this situation.
Check to make sure that the variables are associated with both the explanatory variable (hospital) and the response variable (death).
c. Why does Nightingale say that the statistics are "valueless" if given without being broken down by age, sex and disease? Explain.
c. Why does Nightingale say that the statistics are "valueless" if given without being broken down by age, sex and disease?
Explain.
1. **On-time arrivals.** Consider all of the flights out of New York City in 2013 that flew into San Francisco (SFO), Los Angeles (LAX), or Puerto Rico (BQN) on the following two airlines: JetBlue (B6) or United Airlines (UA).
Below are the tabulated counts for the number of flights `on time` and `delayed` for each airline into each city.[^flights_cat_sp_q-1]
9. **On-time arrivals.** Consider all of the flights out of New York City in 2013 that flew into San Francisco (SFO), Los Angeles (LAX), or Puerto Rico (BQN) on the following two airlines: JetBlue (B6) or United Airlines (UA).
Below are the tabulated counts for the number of flights `on time` and `delayed` for each airline into each city.[^_04-ex-explore-categorical-4]
```{r}
flights |>
Expand All @@ -312,7 +310,11 @@ Florence Nightingale was a nurse in the Crimean War and an early statistician.
drop_na(arr_delay) |>
mutate(status = ifelse(arr_delay <= 0, "on time", "delayed")) |>
group_by(dest, carrier, status) |>
summarize(count = n())
summarize(count = n()) |>
kbl(linesep = "", booktabs = TRUE) |>
kable_styling(bootstrap_options = c("striped", "condensed"),
latex_options = "HOLD_position",
full_width = FALSE)
```
a. What percent of all JetBlue flights were delayed?
Expand All @@ -322,13 +324,9 @@ Florence Nightingale was a nurse in the Crimean War and an early statistician.
b. For each of the three airports, find the percent of delayed flights for each of JetBlue and United (you should have 6 numbers).
c. United has a higher proportion of delayed flights for each of the three cities, yet JetBlue has a higher proportion of delayed flights overall.
Explain, using the data counts provided, how the seeming paradox could happen.[^flights_cat_sp_q-2]
Explain, using the data counts provided, how the seeming paradox could happen.[^_04-ex-explore-categorical-5]
[^flights_cat_sp_q-1]: The `flights` data used in this exercise can be found in the [**nycflights13**](https://github.com/tidyverse/nycflights13) R package.
[^flights_cat_sp_q-2]: The conundrum is known as Simpson's Paradox and is explored in @sec-data-applications.
1. **US House of Representatives.** The US House of Representatives is dominated by two political parties: Democrats and Republicans.
10. **US House of Representatives.** The US House of Representatives is dominated by two political parties: Democrats and Republicans.
Democrats are thought to be the more liberal party and Republicans are considered to be the more conservative party.
However, within each party there is an internal spectrum of liberal to conservative.
For example, conservative Democrats and liberal Republicans would be labeled moderate.
Expand All @@ -344,6 +342,16 @@ Florence Nightingale was a nurse in the Crimean War and an early statistician.
Explain.
d. In what settings would you report the outcome of the change in House membership to be more conservative?
And in what settings would you report the outcome of the change in House membership to be more liberal?[^congress_q-1]
And in what settings would you report the outcome of the change in House membership to be more liberal?[^_04-ex-explore-categorical-6]
[^_04-ex-explore-categorical-1]: The [`antibiotics`](http://openintrostat.github.io/openintro/reference/antibiotics.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
[^_04-ex-explore-categorical-2]: The [`immigration`](http://openintrostat.github.io/openintro/reference/immigration.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
[^_04-ex-explore-categorical-3]: The [`heart_transplant`](http://openintrostat.github.io/openintro/reference/heart_transplant.html) data used in this exercise can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
[^_04-ex-explore-categorical-4]: The `flights` data used in this exercise can be found in the [**nycflights13**](https://github.com/tidyverse/nycflights13) R package.
[^_04-ex-explore-categorical-5]: The conundrum is known as Simpson's Paradox and is explored in @sec-data-applications.
[^congress_q-1]: The conundrum is known as Simpson's Paradox and is explored in @sec-data-applications.
[^_04-ex-explore-categorical-6]: The conundrum is known as Simpson's Paradox and is explored in @sec-data-applications.
31 changes: 21 additions & 10 deletions exercises/_04-sa-explore-categorical.qmd
Original file line number Diff line number Diff line change
@@ -1,14 +1,25 @@
1. \(a) We see the order of the categories and the relative frequencies in the bar plot. (b) There are no features that are apparent in the pie chart but not in the bar plot. (c) We usually prefer to use a bar plot as we can also see the relative frequencies of the categories in this graph.
\addtocounter{enumi}{1}
1. \(a\) We see the order of the categories and the relative frequencies in the bar plot. (b) There are no features that are apparent in the pie chart but not in the bar plot. (c) We usually prefer to use a bar plot as we can also see the relative frequencies of the categories in this graph.

1. \(a) The horizontal locations at which the age groups break into the various opinion levels differ, which indicates that likelihood of supporting protests varies by age group. Two variables may be associated. (b) Answers may vary. Political ideology/leaning and education level.
\addtocounter{enumi}{1}
\addtocounter{enumi}{1}

1. (a) Number of participants in each group. (b) Proportion of survival. (c) The standardized bar plot should be displayed as a way to visualize the survival improvement in the treatment versus the control group.
\addtocounter{enumi}{1}
2. \(a\) The horizontal locations at which the age groups break into the various opinion levels differ, which indicates that likelihood of supporting protests varies by age group. Two variables may be associated. (b) Answers may vary. Political ideology/leaning and education level.

1. (a) The ridge plots do not tell us about the relationship between meat consumption and life expectancy. While it is true that the high income group of countries has highest meat consumption and highest life expectancy, we can't, for example, differentiate meat consumption across the low and middle income groups (so as to connect to life expectancy). Additionally, we don't know anything about the relationship betwen meat consumption and life expectancy *within* an income group. (b) When a relationship is confounded we cannot determine the causal mechanism. We don't know if the longer life expecancy is due to meat consumption or due to higher income (which comes with many other life-extending practices). (c) In order to investigate a specific confounding variable, first break the data into categories according to that confounding variable (here, income). Then look at the relationship of interest (here meat consumption and life expectancy) separately for each of the levels of the confounding variable (income).
\addtocounter{enumi}{1}
\addtocounter{enumi}{1}

1. (a) 41% of the JetBlue flights are delayed. 40.7% of the United Airlines flights are delayed. (b) For SFO: JetBlue had 39.7% delayed, United had 40% delayed (United had more delayed flights). For LAX: JetBlue had 40.1% delayed, United had 41% delayed (United had more delayed flights). For BQN: JetBlue had 45.7% delayed, United had 48.8% delayed (United had more delayed flights). (c) Note that JetBlue had substantially more flights than United out of BQN (where there was a high delay percentage). United had substantially more flights than United out of SFO and LAX, both of which had low delay percentages. So JetBlue's overall percentage delay is bumped up due to the BQN flights, and United's overall percentage delay is bumped down due to the SFO and LAX flights.
\addtocounter{enumi}{1}
3.

(a) Number of participants in each group. (b) Proportion of survival. (c) The standardized bar plot should be displayed as a way to visualize the survival improvement in the treatment versus the control group.

\addtocounter{enumi}{1}

4.

(a) The ridge plots do not tell us about the relationship between meat consumption and life expectancy. While it is true that the high income group of countries has highest meat consumption and highest life expectancy, we can't, for example, differentiate meat consumption across the low and middle income groups (so as to connect to life expectancy). Additionally, we don't know anything about the relationship betwen meat consumption and life expectancy *within* an income group. (b) When a relationship is confounded we cannot determine the causal mechanism. We don't know if the longer life expecancy is due to meat consumption or due to higher income (which comes with many other life-extending practices). (c) In order to investigate a specific confounding variable, first break the data into categories according to that confounding variable (here, income). Then look at the relationship of interest (here meat consumption and life expectancy) separately for each of the levels of the confounding variable (income).

\addtocounter{enumi}{1}

5.

(a) 41% of the JetBlue flights are delayed. 40.7% of the United Airlines flights are delayed. (b) For SFO: JetBlue had 39.7% delayed, United had 40% delayed (United had more delayed flights). For LAX: JetBlue had 40.1% delayed, United had 41% delayed (United had more delayed flights). For BQN: JetBlue had 45.7% delayed, United had 48.8% delayed (United had more delayed flights). (c) Note that JetBlue had substantially more flights than United out of BQN (where there was a high delay percentage). United had substantially more flights than United out of SFO and LAX, both of which had low delay percentages. So JetBlue's overall percentage delay is bumped up due to the BQN flights, and United's overall percentage delay is bumped down due to the SFO and LAX flights.

\addtocounter{enumi}{1}

0 comments on commit 94d5340

Please sign in to comment.