DataFrame and Memory Limitations

536 views
Skip to first unread message

Michael Smith

unread,
Aug 5, 2014, 10:42:58 AM8/5/14
to julia...@googlegroups.com
All,

Are there currently any solutions in Julia to handle larger-than-memory
datasets in a similar way you do in a DataFrame?

The reason I'm asking is that R has the limitation that you need to fit
all your data into memory. On the other hand, SAS (while being quite
different) does not have this limitations.

In the age of "big data" this can be quite an advantage.

Of course, you can "patch" this situation, e.g. in R you can use the ff
or bigmemory packages, or use SQL.

But my point is that it is bolted on, and you need to spend extra mental
loops switching between, say, data.frame and ff, instead of focusing on
your data problem at hand. This is a clear advantage of SAS, where you
don't have to do that. So I'm wondering how this is handled in Julia.

Thanks,

M

P.S.: I do not intend to start a flame war, e.g. whether R or SAS or
Julia is better. I'm just interested to find out whether such a solution
exists in Julia (I haven't found any, but maybe I overlooked something).
And if no such solution exists, given that Julia is still young,
evolving, and malleable (in a positive sense), it might make sense to
think about it.

Harlan Harris

unread,
Aug 5, 2014, 10:48:48 AM8/5/14
to julia...@googlegroups.com
Not currently, but it's been talked about as long as there's been DataFrames in Julia. See these issues, and references therein, for a start:


Also look around in the package and issues list for DataStreams (which I believe are not currently functional) which is a related issue.




--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ramesh Fernando

unread,
Aug 5, 2014, 11:04:38 AM8/5/14
to julia...@googlegroups.com
Hi I don't know Julia, but in R you don't need to load all data into  memory just like SAS you can read off disk, in R both proprietary Revolutionary Analytics R I think working with Hortonworks/Cloudera and Hadoop and Yarn (I don't know if there is a Julia package for Yarn?, I know little of Hadoop  and [not really interested in Java ] and Yarn  so I suggest you contact someone at Hortonworks or Revolution R) g  which I saw a demonstration of in R User group here in Ottawa, Canada as well as Revolution R's other proprietary methods  and bigmemory  http://cran.r-project.org/web/packages/bigmemory/index.html and http://www.bigmemory.org/ can handle more data. I Here is a discussion on large size data.


On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <my.r...@gmail.com> wrote:

John Myles White

unread,
Aug 6, 2014, 12:23:05 AM8/6/14
to julia...@googlegroups.com
At some point, we need to create some additional data tools for working with data sets that do not fit in memory. Harlan’s list touches on a lot of the best strategies for doing that in a way that would smoothly integrate with the rest of the language.

 — John

Michael Smith

unread,
Aug 6, 2014, 9:29:51 AM8/6/14
to julia...@googlegroups.com
Thanks everybody. It should come as no surprise to me that the Julia
community is already working on this. Awesome.

One minor point that I have not seen discussed in the issues is a
reference to the plyrmr package, which is essentially plyr/dplyr for
Hadoop.

https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Eplyrmr%3EHome

Maybe it's possible to pillage some ideas from there.

M


On 08/06/2014 12:23 PM, John Myles White wrote:
> At some point, we need to create some additional data tools for working
> with data sets that do not fit in memory. Harlan’s list touches on a lot
> of the best strategies for doing that in a way that would smoothly
> integrate with the rest of the language.
>
> — John
>
> On Aug 5, 2014, at 7:48 AM, Harlan Harris <har...@harris.name
>> <mailto:julia-stats%2Bunsu...@googlegroups.com>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "julia-stats" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to julia-stats...@googlegroups.com
>> <mailto:julia-stats...@googlegroups.com>.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to julia-stats...@googlegroups.com
> <mailto:julia-stats...@googlegroups.com>.

John Myles White

unread,
Aug 6, 2014, 10:52:29 AM8/6/14
to julia...@googlegroups.com
Isn’t Hive already “plyr for Hadoop”?

— John

On Aug 6, 2014, at 6:29 AM, Michael Smith <my.r...@gmail.com> wrote:

> Thanks everybody. It should come as no surprise to me that the Julia
> community is already working on this. Awesome.
>
> One minor point that I have not seen discussed in the issues is a
> reference to the plyrmr package, which is essentially plyr/dplyr for
> Hadoop.
>
> https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Eplyrmr%3EHome
>
> Maybe it's possible to pillage some ideas from there.
>
> M
>
>
> On 08/06/2014 12:23 PM, John Myles White wrote:
>> At some point, we need to create some additional data tools for working
>> with data sets that do not fit in memory. Harlan's list touches on a lot
>> of the best strategies for doing that in a way that would smoothly
>> integrate with the rest of the language.
>>
>> -- John
> To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.

Randy Zwitch

unread,
Aug 6, 2014, 1:00:57 PM8/6/14
to julia...@googlegroups.com
Yes, and FWIW, I use Julia and Hive all the time via ODBC.jl. I just don't do it within the context of a "dataframe".
>>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "julia-stats" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to julia-stats...@googlegroups.com

Michael Smith

unread,
Aug 8, 2014, 10:20:40 PM8/8/14
to julia...@googlegroups.com
Hive is great, but it's SQL-like, and plyrmr is R, which is more similar
to Julia, and because of this difference, plyrmr might be worth keeping
in mind. Also, Antonio Piccolboni (the guy behind much of this) has done
an excellent job integrating R with Hadoop (mostly MapReduce), so maybe
there's something to be learned.

On a different note, since Spark seems to be the new big thing, and is
going to obsolete much of MapReduce (based on what I hear from people in
the industry), it might be important to keep in mind that the industry
is moving towards Spark (and Taz) and away from MapReduce. (This is not
to say that Hadoop will be deprecated, since Spark and Taz integrate
well with Hadoop; just that MapReduce, which is part of Hadoop, will
mainly be used to keep old projects running, while new development will
mainly be done for Spark (and Taz).)

Anyway, this is leading us more to _really_ big data. In contrast, what
I had in mind was something like what PyTables does for Python (i.e.
sort of _intermediate_ big data, not really big data), but with better
integration with DataFrame. I think HDF5 (not HDFS) has been already
discussed in one of the issues, so things look fine (although I haven't
seen PyTables mentioned explicitly in the github issues mentioned by
Harlan, and since PyTables provides an abstraction on top of HDF5,
PyTables might also be worth considering to get some ideas from).

Anyway, that's my core dump, hope it helps.

Cheers,
M


On 08/07/2014 01:00 AM, Randy Zwitch wrote:
> Yes, and FWIW, I use Julia and Hive all the time via ODBC.jl. I just
> don't do it within the context of a "dataframe".
>
> On Wednesday, August 6, 2014 10:52:29 AM UTC-4, John Myles White wrote:
>
> Isn’t Hive already “plyr for Hadoop”?
>
> — John
>
> On Aug 6, 2014, at 6:29 AM, Michael Smith <my.r...@gmail.com
> <javascript:>> wrote:
>
> > Thanks everybody. It should come as no surprise to me that the Julia
> > community is already working on this. Awesome.
> >
> > One minor point that I have not seen discussed in the issues is a
> > reference to the plyrmr package, which is essentially plyr/dplyr for
> > Hadoop.
> >
> >
> https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Eplyrmr%3EHome
> <https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Eplyrmr%3EHome>
>
> >
> > Maybe it's possible to pillage some ideas from there.
> >
> > M
> >
> >
> > On 08/06/2014 12:23 PM, John Myles White wrote:
> >> At some point, we need to create some additional data tools for
> working
> >> with data sets that do not fit in memory. Harlan's list touches
> on a lot
> >> of the best strategies for doing that in a way that would smoothly
> >> integrate with the rest of the language.
> >>
> >> -- John
> >>
> >> On Aug 5, 2014, at 7:48 AM, Harlan Harris <har...@harris.name
> <javascript:>
> >> <mailto:har...@harris.name <javascript:>>> wrote:
> >>
> >>> Not currently, but it's been talked about as long as there's been
> >>> DataFrames in Julia. See these issues, and references therein,
> for a
> >>> start:
> >>>
> >>> https://github.com/JuliaStats/DataFrames.jl/issues/25
> <https://github.com/JuliaStats/DataFrames.jl/issues/25>
> >>> https://github.com/JuliaStats/DataFrames.jl/issues/26
> <https://github.com/JuliaStats/DataFrames.jl/issues/26>
> >>>
> >>> Also look around in the package and issues list for DataStreams
> (which
> >>> I believe are not currently functional) which is a related issue.
> >>>
> >>>
> >>>
> >>> On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith
> <my.r...@gmail.com <javascript:>
> >>> send an email to julia-stats...@googlegroups.com <javascript:>
> >>> <mailto:julia-stats%2Bunsu...@googlegroups.com
> <javascript:>>.
> >>> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
> >>>
> >>>
> >>>
> >>> --
> >>> You received this message because you are subscribed to the Google
> >>> Groups "julia-stats" group.
> >>> To unsubscribe from this group and stop receiving emails from
> it, send
> >>> an email to julia-stats...@googlegroups.com <javascript:>
> >>> <mailto:julia-stats...@googlegroups.com <javascript:>>.
> >>> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
> >>
> >> --
> >> You received this message because you are subscribed to the Google
> >> Groups "julia-stats" group.
> >> To unsubscribe from this group and stop receiving emails from it,
> send
> >> an email to julia-stats...@googlegroups.com <javascript:>
> >> <mailto:julia-stats...@googlegroups.com <javascript:>>.
> >> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send an email to julia-stats...@googlegroups.com <javascript:>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to julia-stats...@googlegroups.com
> <mailto:julia-stats...@googlegroups.com>.
Message has been deleted

John Myles White

unread,
Aug 13, 2014, 5:42:45 PM8/13/14
to julia...@googlegroups.com
If somebody wants to build that, it would be awesome.

 -- John

On Aug 13, 2014, at 2:42 PM, Ariel Katz <arika...@gmail.com> wrote:

What about something like Continuum's Blaze, where there is a single table and array interface to a variety of data formats and computational engines. 

Calculations are separated from the data, thus backend's can be swapped out and driven by common syntax. 



--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
Message has been deleted
Message has been deleted

Ariel Katz

unread,
Aug 13, 2014, 5:53:17 PM8/13/14
to julia...@googlegroups.com
Considering the resources and time invested in Blaze so far, that would certainly be a significant undertaking.

They already provided Julia bindings for bokeh, I wonder if they would consider doing the same for Blaze...

John Myles White

unread,
Aug 13, 2014, 5:56:36 PM8/13/14
to julia...@googlegroups.com
I don't think Continuum has anything to port with the Julia bindings for Bokeh. I believe that project's largely the work of a single volunteer: https://github.com/samuelcolvin/Bokeh.jl/graphs/contributors

 -- John

Ariel Katz

unread,
Aug 14, 2014, 7:24:18 PM8/14/14
to julia...@googlegroups.com
Ah ok.

Juan

unread,
Jul 19, 2016, 5:36:32 AM7/19/16
to julia-stats, my.r...@gmail.com
I agree, we need a platform able to work transparently with data of any size, at least bigger than memory, and ideally distributed on several computers.

Solutions such as Spark are not complete, they only offer the basic functionalities to build something else.

We don't just need to be able to get some summaries, as we do with databases,  we need to be able to do all operations with big data, operations such as multiplying to big matrixes, fitting a mixed-effect models, MCMC, etc.

Bigmemory or ff allows you to do some simple things but they cannot be used by other packages like lme4.

Juan

unread,
Sep 28, 2016, 6:48:09 PM9/28/16
to julia-stats
Yes, but you can only do simple things such as summaries or use functions implemented on that special packages. You can do linear regression, till now but you can't  more complex things such as mixed effect regression or use stan nor any other generic bayesian package.
The same goes for Spark, you can only use predefined functions, very simple ones, or create your own by hand, but it's very difficult that you can program from scratch something like lme4.

Milan Bouchet-Valat

unread,
Sep 29, 2016, 4:33:21 AM9/29/16
to julia...@googlegroups.com
We're not completely there yet, but with Query.jl and
StructuredQueries.jl, combined with JuliaDB/JuliaData packages, one
should be able to work on out-of-memory data sets as (or more)
efficiently as e.g. SAS. The high-level API is the same whether you
work on a DataFrame or on an external data base.

There's also OnlineStats.jl for computing statistics without loading
the full data set in memory at once.


Regards


Le mercredi 28 septembre 2016 à 15:48 -0700, Juan a écrit :
> Yes, but you can only do simple things such as summaries or use functions implemented on that special packages. You can do linear regression, till now but you can't  more complex things such as mixed effect regression or use stan nor any other generic bayesian package.
> The same goes for Spark, you can only use predefined functions, very simple ones, or create your own by hand, but it's very difficult that you can program from scratch something like lme4.
>

dalinkman

unread,
Sep 29, 2016, 10:53:15 AM9/29/16
to julia-stats
What about using a tuple of distributed vectors/arrays as table subclass, or using dagger for an out of core lazy array.

Then it can be loaded into a distributed array for linear algebra. 

David Anthoff

unread,
Sep 29, 2016, 5:59:49 PM9/29/16
to julia...@googlegroups.com
Yes, at least in theory it should be possible to e.g. load a very large CSV file with CSV.jl, transform it with Query.jl and then feed it into OnlineStats.jl. I think the architecture of all three packages should be such that this could work with a dataset that is larger than memory. In practice I don't think anyone has tried and I'm sure we would run into things that need fixing, but I can't think of some basic design decision in any of these packages that would prevent this kind of thing in principle.

There is a general question of the core interop type for these things. Right now things like regression packages mostly expect a DataFrame. But we could imagine a world where these packages expected a more generic type. I think right now there are a bunch of potential options out there: both DataStreams and Query define their own streaming interfaces for tabular data (in the case of Query it is just a normal julia iterator that returns NamedTuple elements). DataStreams in addition defines a column based interface that might be much faster when the dataset actually fits into memory (pure speculation on my end). I think there are also a bunch of attempts out there to define something like an abstract table structure, but I'm not sure to what extend they would enable a streaming data story.

David Anthoff

unread,
Sep 29, 2016, 6:03:14 PM9/29/16
to julia...@googlegroups.com

Microsoft at some point had DryadLINQ, which allowed one to run LINQ queries on a distributed cluster. Given that Query.jl is modeled very much after LINQ I’m sure one could write a query provider for Query.jl that did something similar, i.e. maybe a front-end for Dagger.jl that translates Query.jl queries into dagger computations. Having said that, I have not looked into any of these in detail, so what I just wrote might be completely off. Would certainly be a fun project for someone, though!

 

From: julia...@googlegroups.com [mailto:julia...@googlegroups.com] On Behalf Of dalinkman
Sent: Thursday, September 29, 2016 7:53 AM
To: julia-stats <julia...@googlegroups.com>
Subject: Re: [julia-stats] DataFrame and Memory Limitations

 

What about using a tuple of distributed vectors/arrays as table subclass, or using dagger for an out of core lazy array.

Tom Breloff

unread,
Sep 29, 2016, 6:03:40 PM9/29/16
to julia...@googlegroups.com
I remember Stefan talking about a built-in "record" type on the horizon (like named tuples, but core to the language).  Does anyone know about progress there?


> > > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats+unsubscribe@googlegroups.com.

Milan Bouchet-Valat

unread,
Sep 30, 2016, 3:31:41 AM9/30/16
to julia...@googlegroups.com
Le jeudi 29 septembre 2016 à 18:03 -0400, Tom Breloff a écrit :
> I remember Stefan talking about a built-in "record" type on the
> horizon (like named tuples, but core to the language).  Does anyone
> know about progress there?
I think that's https://github.com/JuliaLang/julia/pull/16580


Regards

David Anthoff

unread,
Sep 30, 2016, 12:34:40 PM9/30/16
to julia...@googlegroups.com
See also the discussion at https://github.com/JuliaLang/julia/issues/8470. Best, David
Reply all
Reply to author
Forward
0 new messages