chapter5.Rmd

--- 
courseTitle       : Introduction to R v2
chapterTitle      : Data frames
description       : Most datasets you'll be working with will be stored as a data frame. By the end of this chapter, I'll be able to create a data frame, select interesting parts of a data frame and order a data frame according to certain variables. 
framework : datamind 
mode: selfcontained 
--- 

## What's a data frame?

You may remember from the chapter about matrices that all the elements you put in a matrix should be of the same type. Back then, your dataset on Star Wars only contained numeric elements. 

However, when doing a market research survey you often have questions such as:
- 'Are your married?' or a 'yes/no' questions (= boolean data type)
- 'How old are you?' (= numeric data type)
- 'What is your opinion on this product?' or other 'open- ended' questions(= character data type)
- ...

The output when doing the above survey on a number of respondents, is a dataset of different data types. You will often find yourself working with datasets containing different data types instead of only one. 

A data frame has the variables of a dataset as columns and the observations as rows. This will be a familiar concept for those coming from different statistical software packages such as SAS or SPSS.

*** =instructions

1. Click Submit Answer and have a look at the console printing sample data frame that built-in as an example in R.

*** =hint
Just click Run!

*** =sample_code
```{r eval=FALSE}
mtcars # Built-in R dataset stored in a data frame
```

*** =solution
```{r eval=FALSE}
# Just click the Run button
```

*** =sct
```{r eval=FALSE}
DM.result <- TRUE
```

*** =pre_exercise_code
```{r eval=FALSE}

```


---

## Quick, have a look at your dataset

Wow, that's a lot of cars! 

Working with large datasets is not uncommon in data analysis. When working with (extremely) large datasets and data frames, the first important task you, the data analyst, have is to develop a clear understanding of its structure and main elements. Therefore, it is often usefull to show only a small part of the entire dataset. This instead of printing the entire dataset every time you need it. 

So how to do this in R?
- The function `head()` enables you to show the first observations of a data frame (or any R object you pass to it).
- Unoriginal, the function `tail()` prints out the last observations in your dataset.

As you will see, both `head()` and `tail()` print the column names at the top. This top line is called the 'header' and contains the name of the different variables in your dataset.

*** =instructions

1. Give an overview of the mtcars data set by printing its first observations and the corresponding header.

*** =hint
Use the `head()` function and ask the head of `mtcars`.

*** =sample_code
```{r eval=FALSE}
# Have a quick look at your data
```

*** =solution
```{r eval=FALSE}
head(mtcars)
```

*** =sct
```{r eval=FALSE}
DM.result <- code_test("head(mtcars)")
```

*** =pre_exercise_code
```{r eval=FALSE}

```


---

## Have a look at the structure

Another, often used, method to get a rapid overview of your data is the function `str()`. The function `str()` shows you the structure of your dataset. For a data frame it tells you:

- The total number of observations (e.g. 32 car types)
- The total number of variables (e.g. 11 car features)
- A full list of the variables names (e.g. mpg, cyl ... )
- The data type of each variable (e.g. num for car features)
- The first observations

Applying the `str()` function will often be the first thing you do when receiving a new dataset or data frame. It is a great way to get more insight in your dataset before diving into the real analysis!

*** =instructions

1. Investigate the structure of the `mtcars`. Make sure you see the same numbers, variables and data types as mentioned above.

*** =hint
Use the `str()` function with `mtcars` as input!

*** =sample_code
```{r eval=FALSE}
# Investigate the structure of the mtcars dataset to get started!
```

*** =solution
```{r eval=FALSE}
# Investigate the structure of the mtcars dataset to get started!
str(mtcars)
```

*** =sct
```{r eval=FALSE}
DM.result <- code_test("str(mtcars)")
```

*** =pre_exercise_code
```{r eval=FALSE}

```


---

## Creating a data frame 

Since using built-in datasets is not even half the fun of creating your own datasets, the rest of this chapter is based on your personally developed dataset. So put your jet pack on, because it is time for some good old fashioned space exploration! 

As a first goal, you want to construct a data frame that describes the main characteristics of 8 planets in our solar system. According to your good friend Buzz the main features of a planet are:
- The type of planet (Terrestrial or Gass Giant)
- The planet's diameter relative to the diameter of the earth
- The planet's rotation across the sun relative to that of earth
- If the planet has rings or not (TRUE or FALSE)

After doing some high-quality research on [wikipedia](http://en.wikipedia.org/wiki/Planet), you feel confident enough to create the necessary vectors: planets, type, diameter, rotation and rings. (see editor, note that an element of one vector is linked to the element of another vector based on its position) 

You construct a data frame with the `data.frame()` function, giving it the vectors as input that should become the different columns of that data frame. Therefore, it is important that each vector used to construct a data frame has an equal length. But do not forget, it is possible (and likely) they have different types of data.

*** =instructions
- Use the function `data.frame()` to construct a data frame. Call this variable `planets.df`.

*** =hint
The `data.frame(col1,col2,col3,...)` function takes as arguments the vectors that will become the columns of the data frame. The columns in this case are: planet, diameter, rotation and rings.

*** =sample_code
```{r eval=FALSE}
planets    <- c("Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune");
type  <- c("Terrestrial planet","Terrestrial planet","Terrestrial planet","Terrestrial planet","Gass giant","Gass giant","Gass giant","Gass giant")
diameter <- c(0.382,0.949,1,0.532,11.209,9.449,4.007,3.883); 
rotation <- c(58.64,-243.02,1,1.03,0.41,0.43,-0.72,0.67);  
rings  <- c(FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE);


# Create the data frame:
```

*** =solution
```{r eval=FALSE}
planets    <- c("Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune");
type  <- c("Terrestrial planet","Terrestrial planet","Terrestrial planet","Terrestrial planet","Gass giant","Gass giant","Gass giant","Gass giant")
diameter <- c(0.382,0.949,1,0.532,11.209,9.449,4.007,3.883); 
rotation <- c(58.64,-243.02,1,1.03,0.41,0.43,-0.72,0.67);  
rings  <- c(FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE);

planets.df <- data.frame(planets,type,diameter,rotation,rings)

str(planets.df)
```

*** =sct
```{r eval=FALSE}
names <- "planets.df" 
values <- list("data.frame(planets,type,diameter,rotation,rings)" )
DM.result <- closed_test(names,values)
```

*** =pre_exercise_code
```{r eval=FALSE}

```

---

## Creating a data frame (2)

Make sure you have 8 observations and 5 variables.

*** =instructions

1. Use of the function `str()` to investigate the structure of the new `planets.df` variable.

*** =hint
This is easy, no hints ;-)!

*** =sample_code
```{r eval=FALSE}
# Check the structure of planets.df
```

*** =solution
```{r eval=FALSE}
# Check the structure of planet.df
str(planets.df)
```

*** =sct
```{r eval=FALSE}
DM.result <- code_test("str(planets.df)")
```

*** =pre_exercise_code
```{r eval=FALSE}
planets    <- c("Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune");
type  <- c("Terrestrial planet","Terrestrial planet","Terrestrial planet","Terrestrial planet","Gass giant","Gass giant","Gass giant","Gass giant")
diameter <- c(0.382,0.949,1,0.532,11.209,9.449,4.007,3.883); 
rotation <- c(58.64,-243.02,1,1.03,0.41,0.43,-0.72,0.67);  
rings  <- c(FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE);

planets.df <- data.frame(type,planets,diameter,rotation,rings)
```


---

## Selection of data frame elements

Similar to vectors and matrices, you select elements from a data frame with the help of square brackets `[ ]`. By using a comma, you can indicate what to select from the rows and the columns respectively. For example:

- `my.data.frame[1,2]` selects from the first row in `my.data.frame`, the second element.
- `my.data.frame[1:3,2:4]` selects rows 1,2,3 and columns 2,3,4 in `my.data.frame`.

Sometimes you want to select all elements of a row or column. In this case, you just don't put anything in front or behind the comma: e.g. `my.data.frame[1,]` selects all elements of the first row. Let's now apply this technique on planets.df!

*** =instructions

1. Create `closest.planets.df`. Here, put all the data you have on the first three planets.
2. Create `furthest.planets.df`. Here, put all the data you have on the last three planets.

*** =hint
`planets.df[1:3,]` will select alle elements of the first three rows.

*** =sample_code
```{r eval=FALSE}
# planets.df from the previous exercise is pre-loaded
closest.planets.df <- 

furthest.planets.df <- 

# Have a look:
closest.planets.df
furthest.planets.df
```

*** =solution
```{r eval=FALSE}
# planets.df from the previous exercise is pre-loaded
closest.planets.df <- planets.df[1:3,]

furthest.planets.df <- planets.df[6:8,]

# Have a look:
closest.planets.df
furthest.planets.df
```

*** =sct
```{r eval=FALSE}
names  <- c("planets.df","closest.planets.df","furthest.planets.df")
values <- c("planets.df","planets.df[1:3,]","planets.df[6:8,]")
DM.result <- closed_test(names,values)
```

*** =pre_exercise_code
```{r eval=FALSE}
planets    <- c("Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune");
type  <- c("Terrestrial planet","Terrestrial planet","Terrestrial planet","Terrestrial planet","Gass giant","Gass giant","Gass giant","Gass giant")
diameter <- c(0.382,0.949,1,0.532,11.209,9.449,4.007,3.883); 
rotation <- c(58.64,-243.02,1,1.03,0.41,0.43,-0.72,0.67);  
rings  <- c(FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE);

planets.df <- data.frame(type,planets,diameter,rotation,rings)
```


---

## Selection of data frame elements (2)

Instead of using numerics to select elements of a data frame, you can also use variable names to select columns of a data frame. 

Maybe you want to select for the first three rows, only the elements of the variable 'type'. One way to do this, is by typing `planets.df[1:3,1]`. A possible disadvantage of this approach is that you have to know (or look up) the position of the variable 'type', which gets hard if you have a lot of variables. It is often easier just to make use of the variable name `"type"`:

`planets.df[1:3,"type"]`.

*** =instructions

1. Select for the last six rows only the diameter and assign this selection to `furthest.planets.diameter`.

*** =hint
You select elements of a data frame conveniently with the square brackets. Select `3:8` for the rows, and `"diameter"` for the columns.

*** =sample_code
```{r eval=FALSE}
# planets.df from the previous exercise is pre-loaded: 
furthest.planets.diameter <- 

# Show me the 
furthest.planets.diameter
```

*** =solution
```{r eval=FALSE}
# planets.df from the previous exercise is pre-loaded
furthest.planets.diameter <- planets.df[3:8,"diameter"]
furthest.planets.diameter
```

*** =sct
```{r eval=FALSE}
names  <- c("planets.df","furthest.planets.diameter")
values <- c("planets.df","planets.df[3:8,'diameter']")
DM.result <- closed_test(names,values)
```

*** =pre_exercise_code
```{r eval=FALSE}
planets    <- c("Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune");
type  <- c("Terrestrial planet","Terrestrial planet","Terrestrial planet","Terrestrial planet","Gass giant","Gass giant","Gass giant","Gass giant")
diameter <- c(0.382,0.949,1,0.532,11.209,9.449,4.007,3.883); 
rotation <- c(58.64,-243.02,1,1.03,0.41,0.43,-0.72,0.67);  
rings  <- c(FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE);

planets.df <- data.frame(type,planets,diameter,rotation,rings)
```

---
## Only planets with rings

You'll often want to select an entire column, i.e. 1 specific variable from a data frame. For example: Suppose you want to select all elements of the variable `rings`. As seen before, one way to do this is making use of `planets.df[,5]` or `planets.df[,"rings"]`. 

However, there is a short-cut. Use the `$` -sign to tell R it only has to look up all the elements of the variable behind the sign: `data.frame.name$variable.name`

*** =instructions

1. Make use of the `$` -sign to create the `rings.vector` that contains all elements of the 'rings' variable in the planets.df data frame.

*** =hint
`data.frame.name$variable.name` is the most convenient way to select a variable from a data frame. In this case the data frame of choice is `planets.df` and the variable of choice is `rings`.

*** =sample_code
```{r eval=FALSE}
# Create the rings.vector
rings.vector <- 

# Show me the 
rings.vector
```

*** =solution
```{r eval=FALSE}
# Create the rings.vector
rings.vector <- planets.df$rings
```

*** =sct
```{r eval=FALSE}
names  <- c("planets.df","rings.vector")
values <- c("planets.df","planets.df$rings")
DM.result <- closed_test(names,values)
```

*** =pre_exercise_code
```{r eval=FALSE}
planets    <- c("Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune");
type  <- c("Terrestrial planet","Terrestrial planet","Terrestrial planet","Terrestrial planet","Gass giant","Gass giant","Gass giant","Gass giant")
diameter <- c(0.382,0.949,1,0.532,11.209,9.449,4.007,3.883); 
rotation <- c(58.64,-243.02,1,1.03,0.41,0.43,-0.72,0.67);  
rings  <- c(FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE);

planets.df <- data.frame(type,planets,diameter,rotation,rings)
```


---

## Only planets with rings (2)

From highschool, you remember some planets in our solar system have rings and others don't. But due to other priorities at that time (read: puberty) you can't recall their names, let alone their rotation speed, etc. 


Could R help you out? (Spoiler alert: of course it can!) 

Mmm, type `rings.vector` in the console and you get: FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE 

This means that the first 4 observations (or planets) do not have a ring (FALSE), but the other 4 do (TRUE). However, you don't get a nice overview of the names of these planets, their diameter, etc. As a next step, use `rings.vector` to select from `planets.df` all the data (i.e. all columns) on the four planets with rings.

*** =instructions

1. Assign to `planets.with.rings.df` all data in the planets.df dataset for the planets with rings (i.e. where `rings.vector` is TRUE).

*** =hint
Select elements from `planets.df` using the square brackets. The `rings.vector` contains boolean values and R will select only those rows/columns were the vector element is TRUE. In this case you want to select rows based on `rings.vector` and select all the columns.

*** =sample_code
```{r eval=FALSE}
# Note that the planets.df dataset and rings.vector is pre-loaded in your workspace

# Select the information on planets with rings:
rings.vector <- planets.df$rings
planets.with.rings.df <- 

# Show me the
planets.with.rings.df
```

*** =solution
```{r eval=FALSE}
# Note that the planets.df dataset is pre-loaded in your workspace

# Select the information on planets with rings:
rings.vector <- planets.df$rings
planets.with.rings.df <- planets.df[rings.vector,]

# Show me the
planets.with.rings.df
```

*** =sct
```{r eval=FALSE}
names  <- c("planets.df","rings.vector","planets.with.rings.df")
values <- c("planets.df","planets.df$rings", "planets.df[rings.vector,]")
DM.result <- closed_test(names,values)
```

*** =pre_exercise_code
```{r eval=FALSE}
planets    <- c("Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune");
type  <- c("Terrestrial planet","Terrestrial planet","Terrestrial planet","Terrestrial planet","Gass giant","Gass giant","Gass giant","Gass giant")
diameter <- c(0.382,0.949,1,0.532,11.209,9.449,4.007,3.883); 
rotation <- c(58.64,-243.02,1,1.03,0.41,0.43,-0.72,0.67);  
rings  <- c(FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE);

planets.df <- data.frame(type,planets,diameter,rotation,rings)
rings.vector <- planets.df$rings
```


---

## Only planets with rings but shorter

So what exactly did you learn in the past exercises? You selected a subset from a data frame ( `planets.df` ), based on whether or not a certain condition was true (rings or no rings), and you managed to pull out all relevant data. Pretty awesome! By now, NASA is probably already flirting with your CV ;-). 

Now, let's move up one level and use the function `subset()`. You should see the `subset()` function as a short-cut to do exactly the same as what you just did. 

`subset(data.frame.name, subset = some.condition)` 

The first argument of `subset()` specifies the dataset from which you want a subset. With the second argument, you give R the necessary information and conditions to select the correct subset. 
For example: `subset(planets.df, subset=(planets.df$rings == TRUE))` 

R will give you exactly the same result as you got in the previous exercise. But this time with 1 line of code!

*** =instructions

1. Create a dataframe `small.planets.df` with planets that have a diameter smaller than earth (i.e smaller than 1, since diameter is a relative measure of the planets diameter).

*** =hint
Sorry, you're on your own here.

*** =sample_code
```{r eval=FALSE}
# Planets smaller than earth:
small.planets.df  <- # Your code here!
small.planets.df
```

*** =solution
```{r eval=FALSE}
# Planets smaller than earth:
small.planets.df  <- subset(planets.df, subset = planets.df$diameter < 1)
small.planets.df
```

*** =sct
```{r eval=FALSE}
names  <- c("planets.df","small.planets.df")
values  <-  c("planets.df","subset(planets.df, subset = planets.df$diameter < 1)")
DM.result   <- closed_test(names,values)
```

*** =pre_exercise_code
```{r eval=FALSE}
planets    <- c("Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune");
type  <- c("Terrestrial planet","Terrestrial planet","Terrestrial planet","Terrestrial planet","Gass giant","Gass giant","Gass giant","Gass giant")
diameter <- c(0.382,0.949,1,0.532,11.209,9.449,4.007,3.883); 
rotation <- c(58.64,-243.02,1,1.03,0.41,0.43,-0.72,0.67);  
rings  <- c(FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE);

planets.df <- data.frame(type,planets,diameter,rotation,rings)
```


---

## Sorting

Making and creating rankings, is one of mankinds favorite affairs. These rankings can be useful (best universities in the world), entertaining (most influencial moviestars) or pointless (best 007 look-a-like). Up to you for which purpose you want to use your R skills ;-). 

In data analysis you will sort your data according to a certain variable in the dataset. In R, this is done with the help of the function `order()`. 

`order()` is a function that, when applied on a variable, gives you in return the position of each element. Let's look at the vector `a`: `a <- c(100,9,101)`. Now `order(a)` returns 2,1,3. Since 100 is the second largest element of the vector, 9 is the smallest element and 101 is the largest element.
    
Subsequently, `a[order(a)]` will give you the ordered vector 9,100,101, since it first picks the second element of a, then the first, then the last. Got it? If you're not sure, use the console to play with the `order()` function.` 

*** =instructions

1. Click Submit Answer when you are ready.

*** =hint
Just play with the `oder()` function in the console!

*** =sample_code
```{r eval=FALSE}

```

*** =solution
```{r eval=FALSE}
# Just play around with the order function in the console to see how it works
# Some examples:
order(1:10)
order(2:11)
order(c(5,4,6,7))
```

*** =sct
```{r eval=FALSE}
DM.result <- TRUE
```

*** =pre_exercise_code
```{r eval=FALSE}

```


---

## Sorting your data frame

Alright, now that you understand the `order()` function, let's do something useful shall we? You'd like to rearrange your data frame such that it starts with the largest planet and ends with the smallest one (i.e. sort on the column diameter).

*** =instructions

1. Assign to the variable positions the desired ordering for the new data frame you will create in the next step. You can use the `order()` function for that, with the additional argument `decreasing=TRUE`.
2. Now create the data frame `largest.first.df`, wich contains the same information as `planets.df`, but with the planets in decreasing order of magnitude.

*** =hint
`order(planets.df$diameter, decreasing=TRUE)` will give you the ordering of the variable diameter from largest to smallest. This is wat you should assign to positions. Use the variable positions then to select from the data `frame planets.df` !

*** =sample_code
```{r eval=FALSE}
#NOTE: The data frame planets.df has been pre-loaded

# What is the correct ordering based on the planets.df$diameter variable?
positions <-  

# Create new "ordered" data frame:
largest.first.df <-

# Show me the
largest.first.df
```

*** =solution
```{r eval=FALSE}
# What is the correct ordering based on the planets.df$diameter variable?
positions <-  order(planets.df$diameter, decreasing=TRUE)

# Create new "ordered" data frame:
largest.first.df <- planets.df[positions,]

# Have a look:
largest.first.df
```

*** =sct
```{r eval=FALSE}
names  <- c("planets.df","positions","largest.first.df")
values <- c("planets.df"," order(planets.df$diameter, decreasing=TRUE)"," planets.df[positions,]" )
DM.result <- closed_test(names,values)
```

*** =pre_exercise_code
```{r eval=FALSE}
planets    <- c("Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune");
type  <- c("Terrestrial planet","Terrestrial planet","Terrestrial planet","Terrestrial planet","Gass giant","Gass giant","Gass giant","Gass giant")
diameter <- c(0.382,0.949,1,0.532,11.209,9.449,4.007,3.883); 
rotation <- c(58.64,-243.02,1,1.03,0.41,0.43,-0.72,0.67);  
rings  <- c(FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE);

planets.df <- data.frame(type,planets,diameter,rotation,rings)
```