dplyr and memory

730 views
Skip to first unread message

Andrew Hoerner

unread,
Jan 19, 2014, 9:32:05 PM1/19/14
to manip...@googlegroups.com

Dear folks—

 I have been putting together some code to do various sorts of analysis on 52 years of US income distribution data (initially Current Population Survey data). My full data set is about 14 gig as a csv (or 13 as an SQLite DB, or 20 as 731 individual column files), but a lot of that is missing values for variables that only exist for a small number of years.  Despite this, it is still true that a data frame containing a minimally useful set of variables is big enough that many operations that make a full copy of the data frame fail because of running out of memory. (My experience is that you always need four times, and usually need six times, and sometimes need ten times as much free memory as the size of your data frame to avoid snags for operations that copy and change).

One of my purposes in doing this is to write code that makes this sort of analysis available to a broader spectrum of users. The imagined user for whom I am writing is a research staffer at a small public-interest nonprofit. So one of my goals is to write code that can be run on ordinary desktops or laptops, by people who are not professional programmers.  To this end I have been writing my code in steps that leave the DF in place, with strategies like working with a permutation rather than doing a sort, and producing intermediate results in vectors that I can work with a few at a time.

I have been drawn to dplyer by the prospect of doing more analysis on objects that are either out on disk or in alter-in-place objects like data.table. I have taught myself R for the specific purpose of doing this project, and I have not had much experience with either databases or languages with a more copy-by-reference style – I am a lawyer and maybe 3/5 of an economist by original training, so this has been a bit of a leap for me. (In retrospect I’d have to say that this was less than ideal as a new learner’s first R project). I was sort of hoping to ease my way into SQL and data.table both by working through the dplyr command set.

I have been sitting here trying to sketch out a translation of my existing code, which mainly uses base R functions, into dplyr terms, and I find I am handicapped by not really understanding which of the dplyer functions need to hold two or more complete in-memory copies of the tbl_df or data frame object that they work on.

These are questions that ultimately I am going to have to answer by trial, but I am hoping that at I can get at least a little information to avoid spending days on blind alleys. I would be very grateful to get answers to even one of the following four questions:

1.      Do any of the five verbs or group_by need _both_ their input and their output to be in memory? I am basically asking if I can avoid memory limits in chained operations by writing every other product to disk.

2.      Which if any of the five verbs and group_by can operate on an object in memory in place, i.e. placing the output in the same memory as the input? (Or nearly the same in the case of select and mutate?) And would this require anything in the syntax besides assigning the output to the same variable as the input?

3.      Are any of the five verbs or group_by already known to be more memory-intensive than the others, or less memory-intensive?

4.      Would I be correct in assuming that nested operations (i.e. f(g(x)) using the five verbs or group_by on a single data frame will generally require an amount of memory that is larger than the maximum of the memory required by the functions individually?    Or to frame this concern differently, would I be right in thinking that I am less likely to run into memory problems if I run any analysis one step at a time rather than do several at once, or are there some economies of joint operation?

 Lastly, I’d like to say that I have yet to run into any instance of R code that I could not run because it was too slow. I’ve had a number of  things take all night, but that is not a big problem for me. On the other hand, I am constantly running into code that I literally can not run because of the memory demands that it makes. I mention this because I wanted something a little different than what I got with the baseball benchmark vignette. I am less of a statistician and much less of a programmer than most of your users, and intentionally working on an older computer, so I recognize that my needs may be atypical. But I don’t think they are unique. So anyway, I would love to see memory as well as speed benchmarks. 

Thanks for reading a long complicated question. 

I hope that I may be able to contribute answers to some of these questions for which the answers are not already known.

Warmest regards, andrewH

Hadley Wickham

unread,
Jan 20, 2014, 10:29:57 AM1/20/14
to Andrew Hoerner, manipulatr
Hi Andrew,

Making sure the dplyr makes a few copies as possible is still a work
in progress, but I can give you a general outline, point you to some
useful tools and tell you where improvements are planned.

We'll start by making a local copy of the internal dplyr function
`dfloc()`. This function is very useful for helping us understanding
how the memory in a data frame works.

```{r}
library(dplyr)
dfloc <- dplyr:::dfloc
```

(dfloc will eventually be exported from dplyr once we've thought it
through a bit more.)

`dfloc()` tells us the address of each vector in the data frame.

```{r}
dfloc(iris)
```

If these addresses change between operations then we know R has made a
copy. It's important to think about data frames as collections as
columns rather than monolithic objects because for many operations we
can reuse existing columns and not use any extra memory

In base R, a surprising number of operations make copies of the
individual vectors. For example, when you extract two columns from a
data frame, their contents are actually copied. There's no reason to
do this!

```{r}
# Copies the first two columns
dfloc(iris[1:2])

dfloc(iris)
# Copies all the columns!
iris$blah <- 1
dfloc(iris)
```

(This is something that may improve in R 3.1.0 due to some work by
Michael Lawrence)

The goal of dplyr is to avoid making copies when not needed:

```{r}
dfloc(iris)
dfloc(group_by(iris, Species))
dfloc(mutate(iris, area = Sepal.Length * Sepal.Width))
dfloc(select(iris, 1:3))
```

Currently, `group_by()` doesn't make a copy, but `mutate()` and
`select()` do, so we'll fix that for the next version. Once we've done
that, any sequence of `mutate()`, `select()` and `group_by()` will
only need to occupy a little extra memory (i.e. for the indices and
new variables). Saving interim results will not have any effect on
memory usage.

Obviously there's no way around `summarise()` making a copy, but it's
usually not a big deal since you're reducing the size of the data so
much. `arrange()` also has to make a copy, but generally you can avoid
using it since it won't affect any statistical operation. If ordering
is important (e.g. for computing a cumulative mean), dplyr provides
ways to avoid copying the whole data frame and instead only reorder
just the columns you need (see the windowing vignette for more
details).

Altogether, this means that dplyr lets you work with data frames with
very little extra overhead. Eventually, dplyr will never create a
complete copy of the data frame unless you're sorting it, and will
provide tools so that you never need to sort it. This should mean that
you can keep using data frames, and don't need to switch to a more
complex object with reference semantics (like a data table).

Hadley
> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to manipulatr+...@googlegroups.com.
> To post to this group, send email to manip...@googlegroups.com.
> Visit this group at http://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/groups/opt_out.



--
http://had.co.nz/

Arun

unread,
Jan 20, 2014, 1:50:30 PM1/20/14
to manip...@googlegroups.com
@Hadley,

You wrote: `Currently, `group_by()` doesn't make a copy, but `mutate()` and 
`select()` do, so we'll fix that for the next version.`

Probably you mean for `data.frames`? Because for data.table object this still makes a copy.

> x <- tbl_dt(data.table(x=1:5, y=6:10))
> x
Source:     local data table [5 x 2]

  x  y
1 1  6
2 2  7
3 3  8
4 4  9
5 5 10
> tracemem(x)
[1] "<0x7ff1b6a90350>"
> group_by(x, x)
tracemem[0x7ff1b6a90350 -> 0x7ff1b7643628]: structure grouped_dt regroup.tbl_dt regroup group_by 
tracemem[0x7ff1b7643628 -> 0x7ff1b7643510]: structure grouped_dt regroup.tbl_dt regroup group_by 
Source: local data table [5 x 2]
Groups: x

  x  y
1 1  6
2 2  7
3 3  8
4 4  9
5 5 10

Hadley Wickham

unread,
Jan 20, 2014, 1:54:41 PM1/20/14
to Arun, manipulatr
On Mon, Jan 20, 2014 at 12:50 PM, Arun <arago...@gmail.com> wrote:
> @Hadley,
>
> You wrote: `Currently, `group_by()` doesn't make a copy, but `mutate()` and
> `select()` do, so we'll fix that for the next version.`
>
> Probably you mean for `data.frames`? Because for data.table object this
> still makes a copy.

Yes, that's probably because I'm setting the key, which I now believe
that I shouldn't do.

Hadley

--
http://had.co.nz/

Arunkumar Srinivasan

unread,
Jan 20, 2014, 2:01:00 PM1/20/14
to Hadley Wickham, manipulatr
Not sure why you believe you shouldn't do.
Are you planning to not allow a) binary-search based subsets, 2) joins on data.tables and 3) faster aggregation due to grouping values being in contiguous memory locations?

Arun

From: Hadley Wickham Hadley Wickham
Reply: Hadley Wickham h.wi...@gmail.com
Date: January 20, 2014 at 7:55:01 PM
To: Arun arago...@gmail.com
Subject:  Re: dplyr and memory

Hadley Wickham

unread,
Jan 20, 2014, 2:04:02 PM1/20/14
to Arunkumar Srinivasan, manipulatr
From our long discussion on another thread, my impression was that I
shouldn't be modifying the keys of a data table in a way that the user
doesn't know about (and it's not needed for most grouped operations).
I'm happy to do whatever you think is best because I find the
semantics confusing.

Let's keep any further discussion off list, because I don't think it's
relevant to most dplyr users.

Hadley
--
http://had.co.nz/

Andrew Hoerner

unread,
Jan 21, 2014, 7:14:50 PM1/21/14
to manip...@googlegroups.com, Andrew Hoerner

Thanks, Hadley!  That is extremely helpful.

I am aware that this is a little presumptuous, and probably has implications way beyond my ken, but I offer a small suggestion. For any function that copies the DF or produces new rows or columns, how about an argument, near or at the end of your argument list and defaulting to FALSE, that when TRUE deletes the original object if it is in memory. This would give your user the option of making your verbs somewhat more sqldf-like if desired by those of us who are more memory-constrained, but leave them unchanged by default. (I am not sure whether or not it would makes more sense to apply that treatment to unchanged columns or rows for functions that change columns or rows in place).

Again, many thanks.  --andrewH

P.S. I agree that I don’t need to follow your discussion with Arun, but I would be interested in whatever answer you work out, as I am still trying to make up my mind about data frames (or tbl_df’s) as versus data.table objects.

Hadley Wickham

unread,
Jan 21, 2014, 8:10:30 PM1/21/14
to Andrew Hoerner, manipulatr
> I am aware that this is a little presumptuous, and probably has implications
> way beyond my ken, but I offer a small suggestion. For any function that
> copies the DF or produces new rows or columns, how about an argument, near
> or at the end of your argument list and defaulting to FALSE, that when TRUE
> deletes the original object if it is in memory. This would give your user
> the option of making your verbs somewhat more sqldf-like if desired by those
> of us who are more memory-constrained, but leave them unchanged by default.
> (I am not sure whether or not it would makes more sense to apply that
> treatment to unchanged columns or rows for functions that change columns or
> rows in place).

I think eventually there'll be something like tbl_lazy() which
wouldn't do any operations until explicitly asked (just like tbl_sql).
But to be any more efficient than the eager equivalent, it would need
something like a compiler to figure out an efficient execution path.
That's currently beyond my skills.

Other developments may make memory limitations on your personal
computer less of a problem - it's possible to "rent" a computer from
amazon with 145 GB of memory for $3.50 / hour. If that was easy to
access to RStudio, many currently hard problems would become much
easier.

Hadley


--
http://had.co.nz/
Reply all
Reply to author
Forward
0 new messages