Writing a faster version of base::duplicated.array(x, MARGIN = 1)

97 views
Skip to first unread message

Peter Hickey

unread,
May 10, 2014, 11:38:11 PM5/10/14
to r-an...@googlegroups.com
First of all, I hope this is an appropriate question for the group. I am already using Rcpp in a package I am developing. My previous use cases were simple matrix-traversal operations that are 100-1000x faster than the equivalent code in R - this makes me very happy :) But now I've struck a problem that I think could be solved via C++ integration but it's a little beyond my abilities. I hope that someone is willing to help me.

Thank you for your help,
Pete

# Aim
For completeness, my aim is to define a `duplicated` S4-method for my class that extends the GRanges class defined in the Bioconductor package, GenomicRanges. Details aside, basically I want to find duplicated rows of an (integer) matrix in R. This matrix, x, contains a small number of columns (4-10, typically) and many rows (> 1,000,000).

I'd love suggestions on how to write a fast routine that gives identical results to base::duplicated.array(x, MARGIN = 1). I have a kludge that is very fast (0.2 seconds vs. 40 seconds for the base version) but it only works when the matrix has 3-4 columns and not when there are more than 4 columns. I can't see a simple way to generalise this fast kludge to matrices of arbitrary dimension. I also tried writing a function that used the Rcpp sugar method, duplicated, but the Rcpp sugar method only works for vectors and not matrices (correct?).

# Reproducible example
Please see https://gist.github.com/PeteHaitch/75d6f7fd0566767e1e80 for an example dataset, the fast kludge and some basic benchmarking.

# Notes on my fast kludge that works for matrices with 3-4 columns.
In the above example, I make use of the function `GenomicRanges:::.duplicated.GenomicRanges` to write a (very fast) kludge that works when the matrix contains 3-4 columns (but won't work when there are more than 4 columns).

`GenomicRanges:::.duplicated.GenomicRanges` finds duplicate "quads", which can be thought of as rows of a 4-column matrix (although the matrix isn't explicitly formed, I think). `GenomicRanges:::.duplicated.GenomicRanges` calls the function `IRanges:::duplicatedIntegerQuads`, which in turn calls custom C routines, the default being `Integer_selfmatch4_hash` (https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/S4Vectors/src/int_utils.c). I don't fully understand how this C routine works. But, from what I can tell, it hashes each "quad" to identify duplicates. Unfortunately, the C-level routine is hard-coded to work with "quads" and I can't see a simple way to generalise it to work with an arbitrary number of vectors, i.e. a matrix with an arbitrary number of columns.

There is also a related function `IRanges:::duplicatedIntegerPairs`, and set of associated C-level routines, which suggests to me that the approach taken in the C code isn't readily generalisable to an arbitrary number of "columns". Otherwise, I would expect that both `IRanges:::duplicatedIntegerPairs` and `IRanges:::duplicatedIntegerQuads` would be special cases of the same routine.

# Some notes on software used in the example
Details on accessing the Bioconductor SVN server are available from http://master.bioconductor.org/developers/how-to/source-control/. For read only access, use the username: readonly and password: readonly.

The example uses methods from the `GenomicRanges` and `IRanges` Bioconductor packages. Many of the methods currently defined in the `IRanges` package are  being moved to the `S4Vectors` package (https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/S4Vectors/) and I've linked to the current versions in the `S4Vectors` package rather than those in the current release (although the code is basically the same).

Gábor Csárdi

unread,
May 10, 2014, 11:55:11 PM5/10/14
to Peter Hickey, r-an...@googlegroups.com
No detailed solution, just some general idea. You can hash the rows of the matrix, and if you find a good hash function for your data, then you only need to check the rows with identical hash.

E.g. just simply summing the rows of your test matrix gave me 8000 collisions, and it took 0.033 seconds to sum the rows, simply with rowSums(). It might be better to use a more complicated hash function, though, it all depends on the data you have.

Gabor


--
You received this message because you are subscribed to the Google Groups "R and C++" group.
To unsubscribe from this group and stop receiving emails from it, send an email to r-and-cpp+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Peter Hickey

unread,
May 11, 2014, 12:09:02 AM5/11/14
to r-an...@googlegroups.com, Peter Hickey
Thanks for your reply, Gabor. Does the hash function have to guarantee that there are no collisions (which seems hard) or are there general strategies for resolving collisions?

Hmm, perhaps I need to have a closer look at the base::duplicate.array method because I assumed that would be hashing the rows of the matrix, so I don't understand why it would be so slow.

Gábor Csárdi

unread,
May 11, 2014, 12:21:38 AM5/11/14
to Peter Hickey, r-an...@googlegroups.com
Well, the number of collisions depends on your data and the hash function, but you need to check for them, anyway (in most applications, where error is not allowed). 

A cheap solution is to go over the sorted hash values and then extract the subsets from the original matrix that have the same hash value, and just call the regular duplicated() method on it. Even this is not trivial, but relatively easy.

AFAIR base::duplicated.array just collapses the rows into strings, and then it searches for duplicated strings. I guess the latter is fast, the former is not. 

Which brings up another idea, actually. If you can write fast code to collapse your rows to strings, then duplicated() on the strings should be fast. This works if your numbers are integers, but it is not a very good solution for real numbers (you likely lose precision and get very long strings). But for this I think you definitely need C++ code to create the strings.

R internally uses a sophisticated storage for strings, so that identical strings are not stored multiple times, just once, and it uses hashing for this. I guess duplicated() exploits this, but I am not sure.

Gabor

Peter Hickey

unread,
May 13, 2014, 12:34:33 AM5/13/14
to r-an...@googlegroups.com, Peter Hickey
Thank you so much for your help, Gabor. I've taken your initial suggestion of computing a hash from the rowSums followed by a more careful look at duplicate hashes. It is fast enough for my needs (< 1 second on a matrix with > 2,000,000 rows and 4-10 columns), particularly because my matrices generally have very few, if any, duplicate rows.

A basic R implementation is shown below, although my current version does some of this using Rcpp.

rowSumsHash <- function(x){
  z <- rowSums(x)
  d <- duplicated(z, fromLast = FALSE) | duplicated(z, fromLast = TRUE) # Need both fromLast = FALSE and fromLast = TRUE to ensure all duplicates get flagged and not just the first instance.
  d[d] <- duplicated(x[d, ]) # Now use the base::duplicated.array method
  return(d)
}

Thanks,
Pete
Reply all
Reply to author
Forward
0 new messages