Skip to content

Commit

Permalink
change seed
Browse files Browse the repository at this point in the history
  • Loading branch information
LucyMcGowan committed Mar 12, 2024
1 parent 3f0cfc0 commit 0f80fca
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 8 deletions.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"hash": "d1177155bac4bb97379bc825b019a6e1",
"hash": "ed66b179d5e23f0ed356462b4a178e31",
"result": {
"markdown": "---\ntitle: \"A set.seed() + ggplot2 adventure\"\nauthor: \"Lucy D'Agostino McGowan\"\ndate: '2018-01-22'\nslug: a-set-seed-ggplot2-adventure\ncategories: [rstats, ggplot2]\ntags: [rstats, ggplot2]\ndescription: \"Recently I tweeted a small piece of advice re: when to set a seed in your script. Jenny pointed out that this may be blog post-worthy, so here we are!\"\n---\n\n\nRecently I tweeted a small piece of advice re: when to set a seed in your script: \n\n> tip: always `set.seed()` AFTER loading ggplot2\n\n[Jenny](http://twitter.com/JennyBryan) pointed out that this may be blog post-worthy, so here we are! \n\n# Background\n\n::: column-margin\n![](https://media1.giphy.com/media/JUh0yTz4h931K/giphy.gif)\n:::\n\nA little bit about where this tweet came from: I was out to lunch with my friend [Jonathan](https://twitter.com/JJ_Chipman) -- over our scrumptious Thai food, he mentioned he was having some simulation trouble. In particular, some of his simulations were breaking, but when he tried to run the script locally, everything seemed fine. He was using the same seed in both instances, and in fact was running the exact same script -- what could be causing this discrepancy! I recalled that ggplot2 would generate a random message about 10% the time, and asked whether he thought loading this package could somehow be affecting his simulation results. After some further investigation 🕵️‍♂️, it looked like indeed this may be the culprit!\n\n# Why did this happen?\n\nThe `set.seed()` function sets the starting number used to generate a sequence of random numbers -- it ensures that you get the same result if you start with that same seed each time you run the same process. For example, if I use the `sample()` function immediately after setting a seed, I will always get the same sample.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1)\nsample(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 2 3\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1)\nsample(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 2 3\n```\n\n\n:::\n:::\n\n\nIf I run `sample()` twice after setting a seed, however, I would not expect them to be the same. I'd expect the first result to match those above, and the second to be different.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1)\nsample(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 2 3\n```\n\n\n:::\n\n```{.r .cell-code}\nsample(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 2 3\n```\n\n\n:::\n:::\n\n\nThe second is different because I have already performed one random process, so now my starting point prior to running the latter `sample()` function is no longer `1`. \n\nThere is a small function in ggplot2 that runs when the library is loaded for the first time. \n\n::: column-marign\nNote: This is a little different than this code looks now, it has been updated since this discussion began!\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\n.onAttach <- function(...) {\n if (!interactive() || stats::runif(1) > 0.1) return()\n\n tips <- c(\n \"RStudio Community is a great place to get help: https://community.rstudio.com/c/tidyverse.\",\n \"Find out what's changed in ggplot2 at https://github.com/tidyverse/ggplot2/releases.\",\n \"Use suppressPackageStartupMessages() to eliminate package startup messages.\",\n \"Need help? Try Stackoverflow: https://stackoverflow.com/tags/ggplot2.\",\n \"Need help getting started? Try the cookbook for R: http://www.cookbook-r.com/Graphs/\",\n \"Want to understand how all the pieces fit together? See the R for Data Science book: http://r4ds.had.co.nz/\"\n )\n\n tip <- sample(tips, 1)\n packageStartupMessage(paste(strwrap(tip), collapse = \"\\n\"))\n}\n```\n:::\n\n\nThis function has a `sample()` call, which will move the starting place of your random sequence of numbers. The main piece that caused Jonathan some 😫 is the `!interactive()` logic, which only runs the remainder of the code if the session is interactive. Another thing that can cause a bit of confusion is this `.onAttach()` function is only run the _first_ time the library is loaded, so if I run what looks like the exact same code twice during the same session, I can get different results. For example,\n\n::: column-margin\nNote: If you are quite clever, you will notice that this is *not* an interactive document and therefore neither the first nor the second chunk should run `.onAttach()` -- this is true! I've run them interactively and included the output for demonstration purposes 🙊.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1)\nlibrary(ggplot2)\nsample(3)\n```\n:::\n\n<pre><code>## [1] 2 3 1</code></pre>\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1)\nlibrary(ggplot2)\nsample(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 2 3\n```\n\n\n:::\n:::\n\n\nNotice the second chunk gave us the same result as above, but the first chunk was different. That is because the first chunk runs the `.onAttach()` function between when I set my seed and when I drew my sample. \n\n# Reproducibility crisis?\n\n::: column-margin\n![](https://metrouk2.files.wordpress.com/2017/04/phew.gif?w=620&h=392&crop=1)\n:::\nIt turns out this isn't cause for concern re: reproducibility, thanks to Jim Hester's new patch to ggplot2 _phew!_ Prior to this update, non-interactive scripts will be reproducible (if always run non-interactively), however interactive scripts can cause some issues, as shown above. In general, I like to provide future Lucy with as much assistance as possible, so I will likely avoid setting seeds prior to loading packages on the off chance that it will make my debugging trickier in the future.\n\n# What should I do?\n\n* For this particular issue, [Jim Hester](http://twitter.com/jimhester_) has already patched the development version of [ggplot2](https://github.com/tidyverse/ggplot2/pull/2409) to preserve your seed 🌱 if you set it before loading ggplot2 (he did it with a slick function, `withr::with_preserve_seed`, love it!), so you can go [download the development version](https://github.com/tidyverse/ggplot2) and set your seed wherever you please. \n* In general, it seems somewhat prudent to set your seeds after loading packages 📦, as it can be tricky to know exactly what is going on under the hood. The wise R-Lady [Steph Locke](https://twitter.com/SteffLocke) advised in a conversation on this topic to generally try to set seeds as close to the random component as possible to avoid any confusion -- this seems like easy and good advice to follow 👯!\n\n# What have we learned?\n\n* I've learned a lot just from thinking through where to set different parts of my code and how that can affect things downstream \n* We now know a bit more about how seeds work! \n* We've learned about the `withr::with_preserve_seed()` function 🎉 \n* We've seen the potential consequences of changing the global state in a package -- Jenny recently [added this as an issue to discuss](https://github.com/hadley/r-pkgs/issues/447) in a future version of r-pkgs, which she eloquently summarizes as \"don't touch things that don't belong to you and if you have to, you need to be super careful to wipe all your sticky fingerprints off everything\" \n* The #rstats community is so helpful and responsive! A small debugging situation led to lots of helpful advice and a quick fix from Jim 👷",
"markdown": "---\ntitle: \"A set.seed() + ggplot2 adventure\"\nauthor: \"Lucy D'Agostino McGowan\"\ndate: '2018-01-22'\nslug: a-set-seed-ggplot2-adventure\ncategories: [rstats, ggplot2]\ntags: [rstats, ggplot2]\ndescription: \"Recently I tweeted a small piece of advice re: when to set a seed in your script. Jenny pointed out that this may be blog post-worthy, so here we are!\"\n---\n\n\nRecently I tweeted a small piece of advice re: when to set a seed in your script: \n\n> tip: always `set.seed()` AFTER loading ggplot2\n\n[Jenny](http://twitter.com/JennyBryan) pointed out that this may be blog post-worthy, so here we are! \n\n# Background\n\n::: column-margin\n![](https://media1.giphy.com/media/JUh0yTz4h931K/giphy.gif)\n:::\n\nA little bit about where this tweet came from: I was out to lunch with my friend [Jonathan](https://twitter.com/JJ_Chipman) -- over our scrumptious Thai food, he mentioned he was having some simulation trouble. In particular, some of his simulations were breaking, but when he tried to run the script locally, everything seemed fine. He was using the same seed in both instances, and in fact was running the exact same script -- what could be causing this discrepancy! I recalled that ggplot2 would generate a random message about 10% the time, and asked whether he thought loading this package could somehow be affecting his simulation results. After some further investigation 🕵️‍♂️, it looked like indeed this may be the culprit!\n\n# Why did this happen?\n\nThe `set.seed()` function sets the starting number used to generate a sequence of random numbers -- it ensures that you get the same result if you start with that same seed each time you run the same process. For example, if I use the `sample()` function immediately after setting a seed, I will always get the same sample.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(7)\nsample(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 2 1 3\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(7)\nsample(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 2 1 3\n```\n\n\n:::\n:::\n\n\nIf I run `sample()` twice after setting a seed, however, I would not expect them to be the same. I'd expect the first result to match those above, and the second to be different.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(7)\nsample(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 2 1 3\n```\n\n\n:::\n\n```{.r .cell-code}\nsample(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3 2 1\n```\n\n\n:::\n:::\n\n\nThe second is different because I have already performed one random process, so now my starting point prior to running the latter `sample()` function is no longer `1`. \n\nThere is a small function in ggplot2 that runs when the library is loaded for the first time. \n\n::: column-marign\nNote: This is a little different than this code looks now, it has been updated since this discussion began!\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\n.onAttach <- function(...) {\n if (!interactive() || stats::runif(1) > 0.1) return()\n\n tips <- c(\n \"RStudio Community is a great place to get help: https://community.rstudio.com/c/tidyverse.\",\n \"Find out what's changed in ggplot2 at https://github.com/tidyverse/ggplot2/releases.\",\n \"Use suppressPackageStartupMessages() to eliminate package startup messages.\",\n \"Need help? Try Stackoverflow: https://stackoverflow.com/tags/ggplot2.\",\n \"Need help getting started? Try the cookbook for R: http://www.cookbook-r.com/Graphs/\",\n \"Want to understand how all the pieces fit together? See the R for Data Science book: http://r4ds.had.co.nz/\"\n )\n\n tip <- sample(tips, 1)\n packageStartupMessage(paste(strwrap(tip), collapse = \"\\n\"))\n}\n```\n:::\n\n\nThis function has a `sample()` call, which will move the starting place of your random sequence of numbers. The main piece that caused Jonathan some 😫 is the `!interactive()` logic, which only runs the remainder of the code if the session is interactive. Another thing that can cause a bit of confusion is this `.onAttach()` function is only run the _first_ time the library is loaded, so if I run what looks like the exact same code twice during the same session, I can get different results. For example,\n\n::: column-margin\nNote: If you are quite clever, you will notice that this is *not* an interactive document and therefore neither the first nor the second chunk should run `.onAttach()` -- this is true! I've run them interactively and included the output for demonstration purposes 🙊.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(7)\nlibrary(ggplot2)\nsample(3)\n```\n:::\n\n<pre><code>[1] 3 2 1</code></pre>\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(7)\nlibrary(ggplot2)\nsample(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 2 1 3\n```\n\n\n:::\n:::\n\n\nNotice the second chunk gave us the same result as above, but the first chunk was different. That is because the first chunk runs the `.onAttach()` function between when I set my seed and when I drew my sample. \n\n# Reproducibility crisis?\n\n::: column-margin\n![](https://metrouk2.files.wordpress.com/2017/04/phew.gif?w=620&h=392&crop=1)\n:::\nIt turns out this isn't cause for concern re: reproducibility, thanks to Jim Hester's new patch to ggplot2 _phew!_ Prior to this update, non-interactive scripts will be reproducible (if always run non-interactively), however interactive scripts can cause some issues, as shown above. In general, I like to provide future Lucy with as much assistance as possible, so I will likely avoid setting seeds prior to loading packages on the off chance that it will make my debugging trickier in the future.\n\n# What should I do?\n\n* For this particular issue, [Jim Hester](http://twitter.com/jimhester_) has already patched the development version of [ggplot2](https://github.com/tidyverse/ggplot2/pull/2409) to preserve your seed 🌱 if you set it before loading ggplot2 (he did it with a slick function, `withr::with_preserve_seed`, love it!), so you can go [download the development version](https://github.com/tidyverse/ggplot2) and set your seed wherever you please. \n* In general, it seems somewhat prudent to set your seeds after loading packages 📦, as it can be tricky to know exactly what is going on under the hood. The wise R-Lady [Steph Locke](https://twitter.com/SteffLocke) advised in a conversation on this topic to generally try to set seeds as close to the random component as possible to avoid any confusion -- this seems like easy and good advice to follow 👯!\n\n# What have we learned?\n\n* I've learned a lot just from thinking through where to set different parts of my code and how that can affect things downstream \n* We now know a bit more about how seeds work! \n* We've learned about the `withr::with_preserve_seed()` function 🎉 \n* We've seen the potential consequences of changing the global state in a package -- Jenny recently [added this as an issue to discuss](https://github.com/hadley/r-pkgs/issues/447) in a future version of r-pkgs, which she eloquently summarizes as \"don't touch things that don't belong to you and if you have to, you need to be super careful to wipe all your sticky fingerprints off everything\" \n* The #rstats community is so helpful and responsive! A small debugging situation led to lots of helpful advice and a quick fix from Jim 👷",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
Expand Down
12 changes: 6 additions & 6 deletions posts/2018-01-22-a-set-seed-ggplot2-adventure/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -27,19 +27,19 @@ A little bit about where this tweet came from: I was out to lunch with my friend
The `set.seed()` function sets the starting number used to generate a sequence of random numbers -- it ensures that you get the same result if you start with that same seed each time you run the same process. For example, if I use the `sample()` function immediately after setting a seed, I will always get the same sample.

```{r}
set.seed(1)
set.seed(7)
sample(3)
```

```{r}
set.seed(1)
set.seed(7)
sample(3)
```

If I run `sample()` twice after setting a seed, however, I would not expect them to be the same. I'd expect the first result to match those above, and the second to be different.

```{r}
set.seed(1)
set.seed(7)
sample(3)
sample(3)
```
Expand Down Expand Up @@ -77,15 +77,15 @@ Note: If you are quite clever, you will notice that this is *not* an interactive
:::

```{r, eval = FALSE}
set.seed(1)
set.seed(7)
library(ggplot2)
sample(3)
```
<pre><code>## [1] 2 3 1</code></pre>
<pre><code>[1] 3 2 1</code></pre>


```{r}
set.seed(1)
set.seed(7)
library(ggplot2)
sample(3)
```
Expand Down

0 comments on commit 0f80fca

Please sign in to comment.