index.qmd

---
title: "Just Enough Python"
subtitle: "for R users to write advanced Snakemake workflows"
date: 2022-11-14
date-modified: 2022-11-21
aliases: 
    - just-enough-python.html
engine: knitr
image: /images/python-snakemake-r-logos-horiz.png
---

```{r}
#| eval: true
#| echo: false
#\ message: false

# forces renv to recognize these packages as dependencies
library(reticulate) # for python
library(downlit) # for code linking
library(xml2) # for code linking
```


Snakemake is a workflow management tool implemented as extra magic on top of Python. 
If you want to take your Snakemake workflows from beginner to advanced, 
learning a little Python goes a long way.

This is just enough Python to understand and apply some cool tricks in your Snakemake workflows.

![](/images/python-snakemake-r-logos-horiz.png){ width=300px fig-align="center" fig-alt="The Python, R, and Snakemake logos linked together."}^[The [Python logo](https://commons.wikimedia.org/wiki/File:Python-logo-notext.svg) by the Python Software Foundation is licensed under GNU GPL v2. The [Snakemake logo](https://commons.wikimedia.org/wiki/File:Snakemake_logo_dark.png) by Johannes Köster is  licensed under CC BY-SA 4.0. The [R logo](https://commons.wikimedia.org/wiki/File:R_logo.svg) by Hadley Wickham and others at RStudio is licensed under CC BY-SA 4.0.]

#####  prereqs

- **Basic Snakemake knowledge.** You've gone through [the official tutorial](https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html). 
  Maybe you've even built your own snakemake workflow.
- Minimal or **no python knowledge**, aside from what you may have gleaned from 
  the snakemake docs.
- **Basic R knowledge.** Tidyverse experience is helpful but not required.


## variables

Here are the basic variable types in Python and the equivalent type in R:

### basic types

| Python  | R         |
|---------|-----------|
| string  | character |
| integer | integer   |
| float   | numeric   |
| boolean | logical   |
| None    | NULL      |

```{python py-types}
#| filename: "Python"

# string
name = "Kelly"
favorite_color = 'green'
# float
height_inches = 64.0
# int
num_plants = 14
# bool
likes_cats = True
# none
invalid = None
```

```{r r-types}
#| filename: "R"

# character
name <- "Kelly"
favorite_color <- 'green'
# numeric
height_inches <- 64
# integer
num_plants <- 14L
# logical
likes_cats <- TRUE
# null
invalid = NULL
```

A key difference is that in R, all of the above objects are **vectors** of length 1.

#### type casting

You can convert variables to different types:

```{python type-cast}
#| filename: "Python"
int(32.0)
float(32)
str(32)
bool(0)
bool(1)
bool(None)
```

The equivalent R code:

```{R}
#| filename: "R"
as.integer(32)
as.numeric(32L)
as.character(32)
as.logical(0)
as.logical(1)
as.logical(NULL)
```

What similarities and differences do you notice when type-casting in Python vs. R?

#### Naming variables

The biggest difference in variable names between R and Python: 
no dots are allowed in Python variable names. 
All other rules from R apply. Python variable names must start 
with an alphabetic character and they can contain alphanumeric characters 
and underscores.

Just like in R, it is a good idea not to name a variable the same thing as a 
reserved keyword or global function.

```{python}
#| filename: "Python keywords"
help("keywords")
```

#### formatting strings

You can add strings together to build new ones like this:

```{python}
#| filename: "Python"
firstname = "Marie"
lastname = "Curie"
middle_init = 'S'
prefix = "Madame"

fullname = prefix + " " + firstname + " " + middle_init + ". " + lastname
print(fullname)
```

The print function takes any number of arguments and will insert a space between
items by default:

```{python}
#| filename: "Python"
print(prefix, firstname, middle_init + ".", lastname)
```

But there's a better way!

**f-Strings** in Python work just like `glue()` in R, where curly braces in the 
string are interpreted as code. They're much easier to read & write:

```{python}
#| filename: "Python"
f"{prefix} {firstname} {middle_init}. {lastname}"
```

This is equivalent to:

```{R}
#| filename: "R"
firstname <- "Marie"
lastname <- "Curie"
middle_init <- 'S'
prefix <- "Madame"
glue::glue("{prefix} {firstname} {middle_init}. {lastname}")
```


### container types

Here are a few of the most useful container types and their equivalents in R:

| Python             | R              |
|--------------------|----------------|
| list               | vector (all same type) or list (multiple types) |
| dictionary         | named list     |
| DataFrame (pandas) | data.frame (base) or tibble (tidyverse) |

#### lists

Python lists are most similar to lists in R. Python lists can also be compared
to vectors in R, but unlike vectors, lists can contain objects of different types.

```{python py-lists}
#| filename: "Python"
fruit = ['apple','banana','grape']
my_list = ['pizza', 'fries', 3, -2.5, 'donuts']
fruit + my_list
fruit.append('mango')
fruit
```

##### Indexing

Indexing is 0-based in Python. You can index lists and strings. 

```{python index}
#| filename: "Python"
numbers = [1,2,3,4]
len(numbers) # length
numbers[0] # first item
numbers[3] # last item
numbers[-1] # last item
```


You might find 0-based indexing unintuitive if you're used to working in 
languages like R that use 1-based indexing, but this is 
"not mad, just differently sane!"^[Greg Wilson quoting Terry Pratchett in 
_The Tidynomicon_: <https://tidynomicon.github.io/tidynomicon/>]

Here's an illustration that shows how 0-based & 1-based indexing systems compare
in the context of DNA coordinates^[from biostars: <https://www.biostars.org/p/84686/>]:

[![DNA indexing comparison](https://i.postimg.cc/yxxngrS5/single-nucleotide-or-variant.jpg)](https://www.biostars.org/p/84686/)

```{python}
#| filename: "Python"
dna = "TACGTCA"
print(dna[0])
print(dna[6])
```

We can try to write equivalent code in R...

```{r}
#| filename: "R"
#| error: true
dna <- "TACGTCA"
print(dna[1])
print(dna[7])
```

But this doesn't work! 
In the above R code, `dna` is a vector of length 1 with "TACGTCA" as the item at index 1.
You can't grab individual characters out of it.
We have to break up the string into a vector with every character as a separate item in R.

```{r}
#| filename: "R"
dna <- c('T', 'A', 'C', 'G', 'T', 'C', 'A')
print(dna[[1]])
print(dna[[7]])
```


#### dictionaries

Dictionaries consist of key-value pairs where you use the key to access the value.
Keys can be strings, integers, floats, or booleans. Values can be anything!

```{python}
#| filename: "Python"
foods = {'vegetable': ['carrot', 'eggplant'], 
         'fruit': 'cherry'}
foods['fruit']
foods['vegetable']
foods['vegetable'][0]
```

Dictionaries can be nested:

```{python}
#| filename: "Python"
hyperparams = {'glmnet': {'alpha': [0, 0.5, 1],
                          'lambda': [0.01, 0.1, 0, 1]
                          },
                'rf': {'mtry': [128, 256, 512]}
               }
hyperparams['glmnet']['alpha'][1]
```

Get a list of keys in the dictionary with:

```{python}
#| filename: "Python"
hyperparams.keys()
```

##### Snakemake config

Say you have a Snakemake workflow for training machine learning models.
You want to be able to specify different datasets, ML methods, and random seeds, 
so you use f-Strings to fill in these variables:

```{python}
#| filename: "Snakefile"
#| eval: false
dataset = "OTU"
method = "rf"
seed = 1001

rule train_ml_model:
    input:
        R="workflow/scripts/train_ml.R",
        csv=f"data/{dataset}.csv"
    output:
        model=f"results/{dataset}/runs/{method}_{seed}_model.Rds"
    script:
        "scripts/train_ml.R"
```

You can improve this by creating a configuration file in YAML format:

```{R}
#| filename: "config/otu.yml"
#| eval: false
dataset: 'OTU'
method: 'rf'
seed: 1001
```

Then specify your config file at the top of your Snakefile.
Snakemake parses your YAML config files to a dictionary called `config`.

```{python}
#| filename: "Snakefile"
#| eval: false
configfile: "config/otu.yml"

dataset = config["dataset"]
method = config['rf']
seed = config['seed']

rule train_ml_model:
    input:
        R="workflow/scripts/train_ml.R",
        csv=f"data/{dataset}.csv"
    output:
        model=f"results/{dataset}/runs/{method}_{seed}_model.Rds"
    script:
        "scripts/train_ml.R"
```


When you run this workflow, it uses `config/otu.yml` by default:

```{bash}
#| filename: sh
#| eval: false
snakemake -n
```

```{bash}
#| eval: false
rule train_ml_model:
    input: workflow/scripts/train_ml.R, data/OTU.csv
    output: results/OTU/runs/rf_1001_model.Rds
    jobid: 0
    reason: Missing output files: results/OTU/runs/rf_1001_model.Rds
    resources: tmpdir=/var/folders/9n/glrhxtfx453gl68sc1gzq0mc0000gr/T

Job stats:
job               count    min threads    max threads
--------------  -------  -------------  -------------
train_ml_model        1              1              1
total                 1              1              1
```


Now you can change the values of these variables by editing the config file,
or by creating multiple config files and specifying them on the command line:

```{R}
#| filename: "config/genus.yml"
#| eval: false
dataset: 'genus'
method: 'glmnet'
seed: 1001
```


```{bash}
#| filename: sh
#| eval: false
snakemake -n --configfile config/genus.yml
```

```{bash}
#| eval: false
rule train_ml_model:
    input: workflow/scripts/train_ml.R, data/genus.csv
    output: results/genus/runs/glmnet_1001_model.Rds
    jobid: 0
    reason: Missing output files: results/genus/runs/glmnet_1001_model.Rds
    resources: tmpdir=/var/folders/9n/glrhxtfx453gl68sc1gzq0mc0000gr/T

Job stats:
job               count    min threads    max threads
--------------  -------  -------------  -------------
train_ml_model        1              1              1
total                 1              1              1
```


See this example in context [here](https://github.com/SchlossLab/just-enough-python/blob/main/Snakemake_demo/workflow/Snakefile).

## conditionals

Think of these as functions that return a boolean.

| operator | meaning                  | same in R |
|----------|--------------------------|-----------|
| `==`     | equal                    | ✅        |
| `!=`     | not equal                | ✅        |
| `<`      | less than                | ✅        |
| `>`      | greater than             | ✅        |
| `<=`     | less than or equal to    | ✅        |
| `>=`     | greater than or equal to | ✅        |
| or       | or (inclusive)           | `|`       |
| and      | and                      | `&`       |
| in       | in                       | `%in%`    |
| not      | not                      | `!`       |

```{python}
3 > 2
True or False
```

### if-else statements

No curly braces here! But indentation does matter.

```{python}
#| filename: "Python"
if len(my_list) < 10:
    list_description = "This list is short"
elif len(my_list) < 15:
    list_description = "This list is medium length"
else:
    list_description = "This list is long!"
print(list_description)
```

#### inline if-else statement

When you create a variable with an if-else statement, 
you can do it all on one line:

```{python}
list_description = "short" if len(my_list) < 10 else "not short"
```

This is equivalent to:

```{python}
if len(my_list) < 10:
    list_description = "short"
else:
    list_description = "not short"
```

You can't make it a one-liner if you need to use `elif`.

#### nested if-else blocks

```{python}
#| filename: "Python"
num = 1
if num >= 0:
    if num == 0:
        print("zero")
    else:
        print("positive")
else:
    print("negative")
```


## for loops

If a variable has a length (i.e. you can run `len(myvar)`), then you can 
**iterate** over it in a for loop.

```{python}
#| filename: "Python"
postdocs = ['Allison Mason', 'Courtney Armour', 'Sarah Lucas']
for person in postdocs:
    print(f"{person}, PhD")
```

### range

Use `range()` to loop over a range of numbers. If you give `range()` two arguments,
range starts counting from the first argument and stops _just before_ the second argument,
stepping by 1.
In other words, the stop argument is _exclusive_.

```{python}
#| filename: "Python"
num_bottles = 3
for i in range(1, num_bottles+1):
    print(i, " bottles of pop on the wall")
```

If you give range just one argument, it starts counting from zero and uses
the argument as the stop value:
```{python}
#| filename: "Python"
num_bottles = 3
for i in range(num_bottles+1):
    print(i, " bottles of pop on the wall")
```

<!-- made it to here in code club #1 -->

You can also use `range()` to create lists:

```{python}
#| filename: "Python"
mylist = list(range(3))
mylist
```

You can use range with three arguments: `start`, `stop`, and `step`:

```{python}
#| filename: "Python"
list(range(2, 11, 2))
```

Give the `step` argument a negative value and make `start` greater than `stop` to move backwards:
```{python}
#| filename: "Python"
list(range(4, -7, -2))
```

You might think it's weird that range stops just _before_ the stop value.
But it works nicely when you give it the length of the list -- now you have the list indices!

```{python}
#| filename: "Python"
numbers = [1,2,3,4]
for i in range(len(numbers)):
    print(f"index {i} is {numbers[i]}")
```


### iterate over dictionaries

```{python}
#| filename: "Python"
problems = {1: "naming things", 2: "understanding binary", 3: "off-by-one errors"}
print('There are only 10 hard problems in computer science:')
for num in problems:
    print(f"\t {num} {problems[num]}")
```


### list comprehensions

When you want to make a for loop that creates a new list, dictionary, or set,
use a **comprehension**. I'm only going to cover list comprehensions below,
but you can learn about dictionary and set comprehensions 
[here](just-a-little-more-python.qmd#more-comprehensions).

To build a list with a for loop, you can do this:
```{python list-comp-for}
#| filename: "Python"
squares = []
for i in range(4):
    squares.append(i**2)
squares
```

but a **list comprehension** is much sleeker and performs faster:
```{python list-comp}
#| filename: "Python"
squares = [i**2 for i in range(4)]
```

This is roughly equivalent to using *apply functions in R:

```{r apply}
#| filename: "R"
sapply(0:3, function(x) {x**2})
```

You can continue nesting `for x in y` statements to build complex lists like this:

```{python}
#| filename: "Python"
[f"{x}{y}{z}" for x in range(1,3) for y in range(2,4) for z in range(3,5)]
```

#### Snakemake expand

Let's say you have a Snakemake workflow with a rule that should run multiple times
with different parameters. For example, you may want to train machine learning models
with different ML methods, datasets, and random seeds. 
You define these as [wildcards](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards) in a rule to train each ML model:

```{python}
#| filename: "Snakefile"
#| eval: false

rule train_ml_model:
    input:
        R="workflow/scripts/train_ml.R",
        csv="data/{dataset}.csv"
    output:
        model="results/{dataset}/runs/{method}_{seed}_model.Rds"
    script:
        "scripts/train_ml.R"
```

Notice that we're _not_ using **f-Strings** within the rule because we want 
Snakemake to recognize these as **wildcards** and fill in different values for
them with each run of the `train_ml_model` rule.

To tell Snakemake what values to assign for these wildcards, you can create 
lists with the values you want and write a target rule at the top of your Snakefile.

Here, you could use an f-String and a list comprehension to create the list of target files.

```{python}
#| filename: "Python"
datasets = ['OTU', 'genus']
methods = ['rf', 'glmnet', 'svmRadial']
seeds = range(1000, 1005)
[f"results/{dataset}/runs/{method}_{seed}_model.Rds" for dataset in datasets for method in methods for seed in seeds]
```

Here's what your Snakefile looks like now.

```{python}
#| filename: "Snakefile"
#| eval: false

datasets = ['OTU', 'genus']
methods = ['rf', 'glmnet', 'svmRadial']
seeds = range(1000, 1005)

rule targets:
    input:
        [f"results/{dataset}/runs/{method}_{seed}_model.Rds" for dataset in datasets for method in methods for seed in seeds]

rule train_ml_model:
    input:
        R="workflow/scripts/train_ml.R",
        csv=f"data/{dataset}.csv"
    output:
        model="results/{dataset}/runs/{method}_{seed}_model.Rds"
    script:
        "scripts/train_ml.R"
```

The list comprehension gets harder to read the more `for` statements you add to it.
So Snakemake provides a function `expand()` to clean this up:

```{python}
#| filename: "Snakefile"
#| eval: false

datasets = ['OTU', 'genus']
methods = ['rf', 'glmnet', 'svmRadial']
seeds = range(1000, 1005)

rule targets:
    input:
        expand("results/{dataset}/runs/{method}_{seed}_model.Rds", 
                dataset = datasets, method = methods, seed = seeds)

rule train_ml_model:
    input:
        R="workflow/scripts/train_ml.R"
    output:
        model="results/{dataset}/runs/{method}_{seed}_model.Rds"
    script:
        "scripts/train_ml.R"
```

Using `expand()` creates the exact same list as the list comprehension, but it's
much easier to read and add variables to.


## functions

You can define your own functions
in R: 

```{R r-function}
#| filename: "R"
square <- function(x) {
    return(x^2)
}

square(2)
```

and in Python:

```{python py-function}
#| filename: "Python"
def square(x):
    return x**2

square(2)
```


### anonymous functions

Sometimes you want to write a simple function that you only use once.
They're so inconsequential you don't even want to give them a name.

We've already used one in R^[Read more about anonymous functions in Hadley Wickham's book _Advanced R_: <http://adv-r.had.co.nz/Functional-programming.html#anonymous-functions>] inside `sapply()` above:

```{R r-anon}
#| filename: "R"
function(x) { x**2 }
```


In Python you use the `lambda` keyword:

```{python py-anon}
#| filename: "Python"
lambda x: x**2
```

Before the colon, list the arguments of the function.
After the colon, compute the value that the function should return.

#### lambda in Snakemake

When writing Snakemake workflows, lambda functions are useful for defining input 
files based on the wildcards in output files.
Consider a workflow where you have several rules that plot figures for a manuscript.
When you initially conduct the analysis, you don't know how the figures will be ordered
in the manuscript.
Once you begin drafting the manuscript, you decide that your diversity plot will
be figure 1 and your error rates plot will be figure 2.
You also decide to convert the figures to a different format, so the conversion
step seems like a good opportunity to rename the figures.
Initially you write this workflow:

```{python}
#| filename: "Snakefile"
#| eval: false

rule targets:
    input:
        "paper/paper.pdf"

rule convert_figure_1:
    input:
        tiff='figures/diversity.tiff'
    output:
        png="paper/figure_1.png"
    shell:
        """
        convert {input.tiff} {output.png}
        """

rule convert_figure_2:
    input:
        tiff='figures/error_rates.tiff'
    output:
        png="paper/figure_2.png"
    shell:
        """
        convert {input.tiff} {output.png}
        """

rule render_paper:
    input:
        Rmd="paper/paper.Rmd",
        R="workflow/scripts/render_rmd.R",
        figures=['paper/figure_1.png', 'paper/figure_2.png']
    output:
        pdf="paper/paper.pdf"
    script:
        "scripts/render_rmd.R"
```

The rules `convert_figure_1` and `convert_figure_2` are a bit repetitive; they
only differ by the input filenames and the figure numbers in the output filenames.
Maybe this isn't so bad with only two figures, but you might actually have 5-10 
figures for a full scientific paper. We can reduce the repetitive code with
a few tricks:

1. Create a dictionary `figures_dict` that maps the figure numbers to the descriptive figure file names.
1. Use a single rule to convert figures called `convert_tiff_to_png`, using a lambda function to get the input figure filenames based on the final figure numbers.


```{python}
#| filename: "Snakefile"
#| eval: false
figures_dict = {'1': 'diversity', '2': 'error_rates'}

rule targets:
    input:
        "paper/paper.pdf"

rule convert_tiff_to_png:
    input:
        tiff=lambda wildcards: f"figures/{figures_dict[wildcards.fig_num]}.tiff"
    output:
        png="paper/figure_{fig_num}.png"
    shell:
        """
        convert {input.tiff} {output.png}
        """

rule render_paper:
    input:
        Rmd="paper/paper.Rmd",
        R="workflow/scripts/render_rmd.R",
        figures=['paper/figure_1.png', 'paper/figure_2.png']
    output:
        pdf="paper/paper.pdf"
    script:
        "scripts/render_rmd.R"
```


This lambda function is equivalent to:

```{python}
def get_fig_name_from_num(wildcards):
    return figures_dict[wildcards.fig_num]
```

This works because Snakemake allows you to define a function that takes the 
rule's wildcards and returns a list of input filenames, rather than literally
listing the input filenames as before.
This greatly reduces the repetitiveness of the code and makes it easier to maintain.

We can improve this Snakefile even further by replacing the list of figures in
`render_paper` with a call to `expand()`:

```{python}
#| filename: "Snakefile"
#| eval: false
figures_dict = {'1': 'diversity', '2': 'error_rates'}

rule targets:
    input:
        "paper/paper.pdf"

rule convert_tiff_to_png:
    input:
        tiff=lambda wildcards: f"figures/{figures_dict[wildcards.fig_num]}.tiff"
    output:
        png="paper/figure_{fig_num}.png"
    shell:
        """
        convert {input.tiff} {output.png}
        """

rule render_paper:
    input:
        Rmd="paper/paper.Rmd",
        R="workflow/scripts/render_rmd.R",
        figures=expand(rules.convert_tiff_to_png.output.png, 
                       fig_num = figures_dict.keys())
    output:
        pdf="paper/paper.pdf"
    script:
        "scripts/render_rmd.R"
```


You can take a look at this full example in context [here](https://github.com/kelly-sovacool/figure-convert-reprex).
Also see the Snakemake docs for more about [input functions](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#snakefiles-input-functions)
and [params functions](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#non-file-parameters-for-rules).


# recap

**Snakemake concepts covered**

- Use [**f-Strings**](#formatting-strings) for human-readable string formatting.
- [**Config files**](#snakemake-config) are imported as dictionaries and allow you to change workflow parameters without modifying code.
- Snakemake's [**expand()**](#snakemake-expand) function is a readable way to build lists just like list comprehensions.
- [**Lambda functions**](#lambda-in-snakemake) help define Snakemake input files based on wildcards. You can also use them to define params based on wildcards and/or output files.

# resources {.appendix}

- [The Python docs](https://docs.python.org/3/)
- [The Snakemake docs](https://snakemake.readthedocs.io/)
<!--
- [Just A Little More Python](just-a-little-more-python.qmd): topics I wish I had time to cover in this tutorial.
-->