R functions to bind together lists into a data frame: yet another plug for the tidyverse

When working with large datasets in R, I often have to split data manipulation jobs up into a large number of small tasks which I run in parallel on my university’s computing cluster. Once the tasks have all finished running, I load the output into my R workspace. Each task’s output is usually a single data frame, and I define a list with each element being one of the data frames. The next step is to somehow bind together the list of data frames into a single data frame. The data frames all have the same number of columns and the same column names, so they can easily be put together with R’s rbind() function.

The “naive” way of doing this would be to put rbind() into a for-loop. Here I make a dummy data frame into a list by copying it many times to demonstrate.

foo <- data.frame(a = rnorm(100), 
                  b = runif(100), 
                  c = rlnorm(100), 
                  d = rlogis(100))
foolist <- replicate(n = 10000, foo, simplify = FALSE)

Binding these data frames together in a for-loop is really not recommended, because it is both extremely slow and uses a large amount of RAM, because the entire constantly lengthening data frame is read back into the memory with each trip through the loop. Here is how you would do it, though (I added the system.time() wrapper to show how long it takes):

system.time({
  foo_loop <- foolist[[1]]
  for (i in 2:length(foolist)) {
    foo_loop <- rbind(foo_loop, foolist[[i]])
  }
})

This takes 594 seconds on my machine.

You can use a functional programming trick in R to make this a little bit more efficient. The function do.call() takes a function as its first argument, followed by a list. It constructs a function call with the function from the first argument, and the list as its arguments. As a toy example, do.call('c', list(1, 2, 3)) is equivalent to c(1, 2, 3). Here is how you would use do.call() to bind the list of data frames together.

system.time({
  foo_do_call <- do.call('rbind', foolist)
})

This takes 184 seconds on my machine, and also uses less memory.

However, Hadley Wickham’s tidyverse family of packages, and its workhorse package dplyr, provide an even more efficient solution. The bind_rows() function takes a list of data frames as its argument, and returns identical output to the preceding base-R function calls.

system.time({
  foo_bindrows <- bind_rows(foolist)
})

This takes only 0.03 seconds on my machine. That’s over 6000 times faster than do.call()!

You can confirm that these three methods return the same output by checking with identical(foo_loop, foo_do_call) and identical(foo_loop, foo_bindrows).

Overall, I would recommend using the tidyverse function here. It’s better both in terms of speed and memory usage.

Advertisements

One thought on “R functions to bind together lists into a data frame: yet another plug for the tidyverse

  1. Ich habe den ersten Satz verstanden, und teilweise der zweite. Dann kommt die Verwirrung. Ich glaube es waere eben so wenn du auf deutsch schreiben wuerdest. Also du bist ein sehr intiigenter Kerl, das muss ich zugeben. Frage: Warum spielst du denn so schlecht Cribbage?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s