Skip to content

Latest commit

 

History

History
355 lines (273 loc) · 11.7 KB

R6DS-vignette.md

File metadata and controls

355 lines (273 loc) · 11.7 KB

R6DS stands for R6 class based Data Structures. The package provides reference classes implementing some useful data stuctures. They are:

Why do we need the reference class?

The importance of the data structures in algorithm implementation and design patterns has been mentioned many many times in textbooks. In this section, I would like to explain why we need to combine the idea of the reference class and the data structures in R.

The reference class

The first question is: what is the reference class?

As an R user, you should know that, in R, everything is an object. This includes the variables like vector, list, data.frame, and etc., and even all the functions (closures).

Almost all the variables (if the functions are also variables, yes, they are, because the function name can be changed to another function) are assigned or passed by value (this is termed pass-by-value).

For example, consider the simple assignment:

x <- 0
y <- x

R will copy the value of x to y, and they are supposed to be two different variables which occupy different memory pieces. However, in practice, the new memory will not be allocated for y immediately when running y <- x. R will do a fake pass-by-reference (that we will explain later), and later new memory will be allocated for y(or x), once the value of x(or y) is changed. So anyway, x to y are supposed to be in different places in the memory.

Now consider the pass:

# define a function which changes the value of the passed variable,
# and then the new one
func <- function(val){ val <- -1; return(val) }

# remember that x = 0
z <- func(x)
x
#> [1] 0
z
#> [1] -1

R will copy the variable x to a new variable val, change the value of val, and then return it by value! (copy the local val to the new variable z, and then kill val when the function exits) So the function actually copies twice in both passing and returning.

It should be noticed that, if the value of val is not going to be changed inside the function func, that is, func is defined to be a immutable function like, for example,

# define a function which changes the value of the passed variable,
# and then the new one
func <- funcion(val){ print(val) }

then R will pass val by reference (that we will explain later). But anyway, you are changing another variable val instead of x inside the function.

The idea of pass-by-reference comes from a good dream that we hope when we do something like

func <- function(val){ val <- -1 }
func(x)

the global x will be changed…

I will explain why it is a good dream later. But now let’s focus on the pass-by-reference.

The pass-by-reference in assignment implies that, in the assignment action, for example,

y <- x

the variable y will be an alias (the term is a “reference”) of x, which means that both x and y will share the same memory. Once you change the value of either of the two, the other will be changed automatically.

Now consider the function

func <- function(val){ val <- -1 }
func(x)

If the function is pass-by-reference, then argument val will share the same memory of x when you run func(x), and the value of the global x will be changed after calling the function.

So when we do pass-by-reference, no new memory is allocated, the original variable will have an alias. When you change the value of the alias, the original one will also be changed.

The reference class follows the same rule, that is, any object or instance of this class will always be passed-by-reference.

Data structures

R is becoming a more and more sophisticated language. R users program by using R to make achievements in data sciences. Now the question is that “Are you really happy with only pass-by-value?” or “whether only the pass-by-value is sufficient?”

The answer is NO, because we are doing much more than expected in data sciences, and some ideas from the data structures are now needed, but the implementation of these ideas needs the pass-by-reference!

So the dream is that “we hope that we can input something into some function and the function will change its value!”

Or, for some people, “we really miss the pointers or references in Fortran, C, C++, and etc.”

Even R says no itself due to the fact the Reference Class following the S4 class has been implemented. And then the much more efficient R6 class was implemented and available in the R6 package. I employ the R6 class to implement the data structures.

Suppose that you want to design and implement some algorithm. The most efficient solution is, for example, to use the recursion, in which you pass a variable into some function, the function will call itself and pass the same variable into it. The corresponding algorithm requires that the variable will be changed inside the recursion. See for example, the traverse algorithm in the binary-search-tree, in which, if we want to have a copy of these traversed elements, then it is desirable to pass-by-reference a container (vector, list, data.frame and etc.) into the recursive traverse function. Once an element in the tree is reached, its value will be copied into the container.

Now you see that, the data structures and the algorithms do require the pass-by-reference feature!

The R6 class is truly the reference class

In order to investigate the R6 class, especially to confirm that it is doing pass-by-reference, consider the example below

library(R6)

RClass <- R6Class("RClass", portable = FALSE, class = FALSE)

RClass$set("private", ".val", 0)

RClass$set("public", "initialize", function(val=0){ .val <<- val })

RClass$set("active", "Val", function(){ return(.val) })

RClass$set("public", "Set", function(newval){ .val <<- newval })

RClass$set("public", "finalize", function(){ print(paste("obj",.val,"deleted!")) })

We define a new R6 type reference class RClass. It is has a finalize function which will be run when the system removes the instance of the class (free or collect the memory allocated for it). We see that the finalize will print a message show that the instance is being deleted. We do the memory garbage collection manually by using the gc function, because R is “lazy”…

We test if the memory will be collected when we run gc.

tmp1 = RClass$new()
rm(tmp1)
#> [1] "obj 0 deleted!"

gc()
#>           used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells  531038 28.4    1196111 63.9         NA   642280 34.4
#> Vcells 1041578  8.0    8388608 64.0      16384  1769699 13.6

Yes.

Then we define the first function to check if the memory of the local variable will be freed when the function exits.

ftmp <- function(){ tmp <- RClass$new() }
ftmp()

gc()
#> [1] "obj 0 deleted!"
#>           used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells  533384 28.5    1196111 63.9         NA   947574 50.7
#> Vcells 1048527  8.0    8388608 64.0      16384  1769699 13.6

For sure, it will.

We override the same function to take a variable and print it.

ftmp <- function(tmp){ tmp$Set(1) }

Our experiment is designed as follows. If the instance of the RClass class is passed-by-value into the function ftmp, then i) the global tmp1 will not be changed; ii) the new variable tmp inside the function will be removed when the function exits, which means that we will see some message saying “obj 1 deleted!”, because the value of tmp is set to one. Now let’s check

tmp1 = RClass$new()
ftmp(tmp1)

gc()
#>           used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells  533430 28.5    1196111 63.9         NA   947574 50.7
#> Vcells 1048703  8.1    8388608 64.0      16384  1769699 13.6
tmp1$Val
#> [1] 1

We see that no memory space is freed, which means that inside the function, no new variable was created. The value of the global tmp1 has been changed to one successfully.

So our conclusion is that R6 is capable!

The binary search tree example

Here is one example of the package R6DS. The binary search tree is quite efficient in sorting, searching and traversing its elements. The time complexity can be O(log**n) if the tree is well structured.

When building the binary search tree, the “<” and “=” operators should be defined and equip to the generator instance of the RBST class. We just compare the numbers for simplicity.

We consider the tree as follows

Graph source: Binary Search Tree @ WIKIPEDIA.

We try the package:

# we define the "<" and "="
lessthan <- function(x, y) return(x < y)
equal <- function(x, y) return(x == y)
# then we initialize the tree
bst <- RBST$new(lessthan=lessthan, equal=equal)

# the nodes in a vector
nodes <- c(8, 3, 10, 1, 6, 14, 4, 7, 13)
# we add the nodes or elements
bst$insert(collapse = as.list(nodes))

OK, the tree is now built. When you read the manual of the RBST class, you will see that we can do traversal by calling the traverse function of the class. Each node or element can be manipulated by using the function callback that is input into traverse. But it is a really good dream that we can pass something by reference into the recursive traverse and do something inside the recursion.

Now the dream comes true. We do:

# create an empty container to hold the elements
# we choose the data structure queue
container <- RQueue$new()
container$size
#> [1] 0

# then we define the callback function which takes two arguments
# node: the node or element in the tree
# queue: additional argument hopefully pass-by-reference
callback <- function(item, queue) queue$enqueue(item)
# note that item is the value of the node but not the node!

# traverse-in-order
bst$traverse(mode="in", callback=callback, container)
# it should be a sorted list of the elements in the tree
# 1  3  4  6  7  8 10 13 14
unlist(container$toList)
#> [1]  1  3  4  6  7  8 10 13 14

# empty the container
container <- RQueue$new()
container$size
#> [1] 0

# traverse-pre-order
bst$traverse(mode="pre", callback=callback, container)
# it should be
# 8  3  1  6  4  7 10 14 13
unlist(container$toList)
#> [1]  8  3  1  6  4  7 10 14 13

# empty the container
container <- RQueue$new()
container$size
#> [1] 0

# traverse-post-order
bst$traverse(mode="post", callback=callback, container)
# it should be 
# 1  4  7  6  3 13 14 10  8
unlist(container$toList)
#> [1]  1  4  7  6  3 13 14 10  8

You can pass any instance of some reference class into the function to make the manipulation of the data much more flexible. And you should see how important the package can be…

But of course, the example is somewhat “stupid” as you can solve the same problem by using some global variable. However, global variable should not be preferred in many other cases, for example: when you want to share your code with others, and preferrably the function offers the only interface (they do not need to know the names of the global variables and create them manually)