A fast and natural interface to R from Perl

21 views
Skip to first unread message

Zakariyya Mughal

unread,
Dec 23, 2014, 9:19:22 PM12/23/14
to Perldl, The Quantified Onion
Hi everyone,

I have (finally) uploaded modules for working with the R interpreter
with Perl. The CPAN links are below, but to get a taste of what the API
looks like, check out my blog post <http://enetdown.org/dot-plan/posts/2014/12/24/a_fast_and_natural_interface_to_R_from_Perl/>.

- Statistics::NiceR <http://p3rl.org/Statistics::NiceR>
- Data::Frame <http://p3rl.org/Data::Frame>

I'd love to have feedback on how to improve them.

Regards and happy hacking,
- Zaki Mughal

Zakariyya Mughal

unread,
Dec 24, 2014, 6:31:33 PM12/24/14
to Chris Marshall, Perldl, The Quantified Onion
On 2014-12-24 at 09:55:46 -0500, Chris Marshall wrote:
> Very cool! Thanks for expanding the space of perl and PDL computation! In
> your work, did you determine anything PDL3 would need to do a better job to
> support using R from perl?
>

Sure, there were a couple things that would have been nice to have:

For Data::Frame,

- It's a small thing, but a way to "plug-in" to the stringification for
PDL subclasses would make implementing subclasses easier. Right
now, PDL's `string` method is a bit of a black-box because it
stringifies all the elements at once. Instead, I had to write my own
string1d function [^stringifiable].

- Make a hash-based PDL the default. While using the `initialize` function
combined with `FOREIGNBUILDARGS` is an easy way to get PDL working
with Moo[se], it is extra code [^moo-hash-pdl].

- It might be useful to have annotations of all functions that do not
change the values of elements. I am using that for enum-like data
where I want the levels (the possible values of the enum) to be copied
over to new enum-like PDLs. So I wrap the following methods:

around qw(slice uniq dice) => sub { ... };

but I'm not sure if that covers everything [^around-enum].

My thoughts on this: perhaps the PDL class has too many methods by
default. There should be a way to pare that down using roles, but
deciding what goes in each role does not seem straightforward to me at
this time.

For Statistics::NiceR,

- The way that R stores data is inside a SEXP C structure. You can reach
inside and get at the data by using a macro which points to the memory
address like:

SEXP r_sexp_integer, r_sexp_real;
INTEGER(r_sexp_integer)[ idx ] /* access the int32_t value at idx */

REAL(r_sexp_real)[ idx ] /* access the double value at idx */

Currently, I'm just using memcpy() to get the R data into a PDL. I
haven't used pdl_wrap() on the R data yet, but I plan to soon. But
what I'm wondering is: can I change the way PDL allocates data so that
it will create the R's SEXP C structure in the background — perhaps
limited to a scope? This might be YAGNI, but it might have
implications for things like GPU support. Instead of having to
explicitly create GPU arrays all the time, there should be a way of
indicating that a piece of code will be using a different allocator
than usual.

- Speaking of different allocation types, it might be useful to look at
how other tools extend their built-in types. I'll give some R
examples:

- R's bigmemory <http://cran.r-project.org/web/packages/bigmemory/index.html>,
<http://www.stat.yale.edu/~mjk56/temp/bigmemory-vignette.pdf>,
<http://2013.hpcs.ca/wp-content/uploads/2013/07/HPCS2013-Parallel-Work-with-R.pdf>.

Not only does this support mmap'ed files (like PDL::IO::{FastRaw,FlexRaw}),
but they also have associated packages that have specialised
versions things like linear regression (in biglm) and k-means
clustering (in biganalytics).

- R's GMP <http://cran.r-project.org/web/packages/gmp/index.html>.

It's a wrapper for the GMP library for big integers/rationals, but
it also lets you create matrices of big numbers which can be used
for solving a system of equations (solve.bigz).


[^stringifiable]: Role that lets elements stringify themselves
<https://github.com/zmughal/p5-Data-Frame/blob/master/lib/PDL/Role/Stringifiable.pm>.

[^moo-hash-pdl]: <https://github.com/zmughal/p5-Data-Frame/blob/master/lib/PDL/Factor.pm> has the following code:

use Moo;
extends 'PDL';
around new => sub {
my $orig = shift;
my ($class, @args) = @_;
# snip...
unshift @args, _data => $enum;
my $self = $orig->($class, @args);
# snip...
}

sub FOREIGNBUILDARGS {
my ($self, %args) = @_;
( $args{_data} );
}

sub initialize {
bless { PDL => PDL::null() }, shift;
}

[^around-enum]: <https://github.com/zmughal/p5-Data-Frame/blob/master/lib/PDL/Role/Enumerable.pm#L46>.

Cheers,
- Zaki Mughal


> --Chris
> > _______________________________________________
> > Perldl mailing list
> > Per...@jach.hawaii.edu
> > http://mailman.jach.hawaii.edu/mailman/listinfo/perldl
> >
Reply all
Reply to author
Forward
0 new messages