Importing / Exporting DataFrames between Daru & R

55 views
Skip to first unread message

Athitya Kumar

unread,
May 22, 2017, 12:52:11 PM5/22/17
to Sameer Deshmukh, Victor Shepelev, Lokesh Sharma, Shekhar Prasad Rajak, SciRuby Mailing List
Hey all.

I'm planning to have `from_rdata` and `to_rdata` methods (easy to include the rds versions too) in the daru-io gem. Both these methods can be done in Ruby via gems that have wrapped R such as Rinruby, RSRuby and Rserve-ruby-client. All these gems involve evaluating R commands such as `load()`, `save()` in Ruby. The main parameters are,

(1) Conversion for Lists / DataFrames (R) to and from Array / Hash Objects (Ruby).
(2) Time required for parsing, conversion and writing  

The scenario between the 3 gems is explained very well addressed in this README. In short, RSRuby is the fastest gem (Supports only Linux and Mac OS), Rinruby is the most robust (Supports all OS) yet slowest gem, and Rserve-client is in between the two.

Currently, Daru supports Vector and DataFrame objects - so, if we have robust conversion of R Lists into Arrays / Hashes, converting them into Daru::Dataframes is easy. This is done even by the (fastest) RSRuby gem.

The main issue lies in writing back to RData files - ie, converting back Ruby Arrays / Hashes into R objects with eval, and saving them back to RData files. I don't see a direct way (yet) to assign the Ruby Arrays into R Lists rather than looping through the whole Ruby array and calling `eval` statements of  RSRuby / Rserve-ruby-client gems for each element to create the R List.

However, Rinruby does provide an assign method to 'copy' Ruby 1D arrays into R 1D List (2D Arrays throw an 'Unsupported Data Type' Error, but we can definitely workaround with 1D arrays) .

Please share your opinions regarding which gem(s) we can go ahead with. We can even go ahead with one gem (say, RSRuby) for importer and another gem (say, Rinruby) for exporter.

Regards,
Athitya Kumar

Victor Shepelev

unread,
May 23, 2017, 10:32:54 AM5/23/17
to Athitya Kumar, Sameer Deshmukh, Lokesh Sharma, Shekhar Prasad Rajak, SciRuby Mailing List
My 5c for this question would be pretty extremist, so to speak.

1. I believe importing is MUCH more important than exporting. Imagine the Rubyist guy working in a team of several R guys. They typically send each other just Rds or Rdata files of their studies. If our Rubyist guy could read those files, he is good. He can send just CSV back, or any other "common" format, R is good with them and R guys are used to reading any format they receive data in.

2. I believe the "ideal" goal is to read Rds (at least, as they are "one R object", while Rdata is "whole large project") in an R-less environment. Otherwise (if you have R) it is simpler to open the file in R and resave it as CSV or something. The task is not super-hard, but neither super-easy. The format is simple, yet poorly documented. 

Though, as I can see, Python guys go with rpy2 (like RSRuby) or PypeR (like RinRuby).
While Perl guys suddenly have full-featured R reading library from scratch: https://github.com/cubranic/Statistics-R-IO

So, you can ignore my point (2), but consider point (1), at least.

V.

Athitya Kumar

unread,
May 24, 2017, 2:07:51 PM5/24/17
to SciRuby Mailing List, Sameer Deshmukh, Lokesh Sharma, Shekhar Prasad Rajak, Victor Shepelev
Hello Victor.

I understand, but aren't formats like RData & RDS still preferred as they're binary formats which are lighter and faster as compared to formats like csv? I think this makes both import and export equally important for a Rubyist to work with a Team of R developers.

Regarding (2), it is sad to note that all the above gems have R as a requirement, and importing / exporting RData / RDS data into Daru DataFrames might not be possible in a R-less environment.

Continuing from the discussion today, I think we can go ahead with RSRuby for import and Rinruby for export as both have requirement of just R, whereas Ruby-rserve-client requires Rserve as well.

I'm considering of choosing RSRuby for import, as it's the fastest of these 3 gems and is still able to parse RData files to provide R lists as Array of Hashes that can directly be used to create Daru::DataFrame.

Rinruby has "assign" method, which makes it possible to directly "create" the R variables from Ruby, and write into RData / RDS files - making it suitable for using with export. However, the trade-off here is that this is the slowest of the 3 gems.

Do share your opinions regarding this. :)

Regards,
Athitya Kumar

Pjotr Prins

unread,
May 24, 2017, 2:21:57 PM5/24/17
to sciru...@googlegroups.com, Sameer Deshmukh, Lokesh Sharma, Shekhar Prasad Rajak, Victor Shepelev
On Wed, May 24, 2017 at 11:37:10PM +0530, Athitya Kumar wrote:
>
> Hello Victor.
> I understand, but aren't formats like RData & RDS still preferred as
> they're binary formats which are lighter and faster as compared to
> formats like csv?

I would disagree with that. Compressed CSV tends to win because of
less IO - IO is usually the bottleneck. We have RData because the R
people have it that way. Reading RData is known to be slow.

It is good to support both formats, of course.

> I think this makes both import and export equally
> important for a Rubyist to work with a Team of R developers.
> Regarding (2), it is sad to note that all the above gems have R as a
> requirement, and importing / exporting RData / RDS data into Daru
> DataFrames might not be possible in a R-less environment.
> Continuing from the discussion today, I think we can go ahead with
> RSRuby for import and Rinruby for export as both have requirement of
> just R, whereas Ruby-rserve-client requires Rserve as well.

I was there at the birth of RSRuby. Alex did a great job :).

I think it is fine to have R as a dependency when dealing with RData.
Only thing is that it will probably rule out direct JRuby support. How
about making the RData tranformer a standalone tool - so it can be
called from both MRI and JRuby and even other languages.

People try to avoid dependencies between libraries and tools. But it
is actually a solved problem.

> I'm considering of choosing RSRuby for import, as it's the fastest of
> these 3 gems and is still able to parse RData files to provide R lists
> as Array of Hashes that can directly be used to create Daru::DataFrame.
> Rinruby has "assign" method, which makes it possible to directly
> "create" the R variables from Ruby, and write into RData / RDS files -
> making it suitable for using with export. However, the trade-off here
> is that this is the slowest of the 3 gems.
> Do share your opinions regarding this. :)

Shared. +1.

Pj.

Victor Shepelev

unread,
May 26, 2017, 7:43:04 AM5/26/17
to Athitya Kumar, SciRuby Mailing List, Sameer Deshmukh, Lokesh Sharma, Shekhar Prasad Rajak
OK, let's go the proposed way.
Let the first goal be importer only, done with RSRuby, and then... We'll see how it works.

BTW, as some "finalize preparations" work, it would be nice to start filling daru-io repository with test data, downloaded from open data sources, in various formats you plan to support.

V.

Sameer Deshmukh

unread,
May 26, 2017, 7:50:34 AM5/26/17
to Victor Shepelev, Athitya Kumar, SciRuby Mailing List, Lokesh Sharma, Shekhar Prasad Rajak
+1 to filling up the repo with test data.

Regards,
Sameer Deshmukh
Reply all
Reply to author
Forward
0 new messages