Dear folks—
I have been putting together some code to do various sorts of analysis on 52 years of US income distribution data (initially Current Population Survey data). My full data set is about 14 gig as a csv (or 13 as an SQLite DB, or 20 as 731 individual column files), but a lot of that is missing values for variables that only exist for a small number of years. Despite this, it is still true that a data frame containing a minimally useful set of variables is big enough that many operations that make a full copy of the data frame fail because of running out of memory. (My experience is that you always need four times, and usually need six times, and sometimes need ten times as much free memory as the size of your data frame to avoid snags for operations that copy and change).
One of my purposes in doing this is to write code that makes this sort of analysis available to a broader spectrum of users. The imagined user for whom I am writing is a research staffer at a small public-interest nonprofit. So one of my goals is to write code that can be run on ordinary desktops or laptops, by people who are not professional programmers. To this end I have been writing my code in steps that leave the DF in place, with strategies like working with a permutation rather than doing a sort, and producing intermediate results in vectors that I can work with a few at a time.
I have been drawn to dplyer by the prospect of doing more analysis on objects that are either out on disk or in alter-in-place objects like data.table. I have taught myself R for the specific purpose of doing this project, and I have not had much experience with either databases or languages with a more copy-by-reference style – I am a lawyer and maybe 3/5 of an economist by original training, so this has been a bit of a leap for me. (In retrospect I’d have to say that this was less than ideal as a new learner’s first R project). I was sort of hoping to ease my way into SQL and data.table both by working through the dplyr command set.
I have been sitting here trying to sketch out a translation of my existing code, which mainly uses base R functions, into dplyr terms, and I find I am handicapped by not really understanding which of the dplyer functions need to hold two or more complete in-memory copies of the tbl_df or data frame object that they work on.
These are questions that ultimately I am going to have to answer by trial, but I am hoping that at I can get at least a little information to avoid spending days on blind alleys. I would be very grateful to get answers to even one of the following four questions:
1. Do any of the five verbs or group_by need _both_ their input and their output to be in memory? I am basically asking if I can avoid memory limits in chained operations by writing every other product to disk.
2. Which if any of the five verbs and group_by can operate on an object in memory in place, i.e. placing the output in the same memory as the input? (Or nearly the same in the case of select and mutate?) And would this require anything in the syntax besides assigning the output to the same variable as the input?
3. Are any of the five verbs or group_by already known to be more memory-intensive than the others, or less memory-intensive?
4. Would I be correct in assuming that nested operations (i.e. f(g(x)) using the five verbs or group_by on a single data frame will generally require an amount of memory that is larger than the maximum of the memory required by the functions individually? Or to frame this concern differently, would I be right in thinking that I am less likely to run into memory problems if I run any analysis one step at a time rather than do several at once, or are there some economies of joint operation?
Lastly, I’d like to say that I have yet to run into any instance of R code that I could not run because it was too slow. I’ve had a number of things take all night, but that is not a big problem for me. On the other hand, I am constantly running into code that I literally can not run because of the memory demands that it makes. I mention this because I wanted something a little different than what I got with the baseball benchmark vignette. I am less of a statistician and much less of a programmer than most of your users, and intentionally working on an older computer, so I recognize that my needs may be atypical. But I don’t think they are unique. So anyway, I would love to see memory as well as speed benchmarks.
Thanks for reading a long complicated question.
I hope that I may be able to contribute answers to some of these
questions for which the answers are not already known.
Warmest regards, andrewH
Thanks, Hadley! That is extremely helpful.
I am aware that this is a little presumptuous, and probably has implications way beyond my ken, but I offer a small suggestion. For any function that copies the DF or produces new rows or columns, how about an argument, near or at the end of your argument list and defaulting to FALSE, that when TRUE deletes the original object if it is in memory. This would give your user the option of making your verbs somewhat more sqldf-like if desired by those of us who are more memory-constrained, but leave them unchanged by default. (I am not sure whether or not it would makes more sense to apply that treatment to unchanged columns or rows for functions that change columns or rows in place).
Again, many thanks. --andrewH
P.S. I agree that I don’t need to follow your discussion with Arun, but I would be interested in whatever answer you work out, as I am still trying to make up my mind about data frames (or tbl_df’s) as versus data.table objects.