First of all, I hope this is an appropriate question for the group. I am already using Rcpp in a package I am developing. My previous use cases were simple matrix-traversal operations that are 100-1000x faster than the equivalent code in R - this makes me very happy :) But now I've struck a problem that I think could be solved via C++ integration but it's a little beyond my abilities. I hope that someone is willing to help me.
Thank you for your help,
Pete
# Aim
For completeness, my aim is to define a `duplicated` S4-method for my class that extends the GRanges class defined in the Bioconductor package, GenomicRanges. Details aside, basically I want to find duplicated rows of an (integer) matrix in R. This matrix, x, contains a small number of columns (4-10, typically) and many rows (> 1,000,000).
I'd love suggestions on how to write a fast routine that gives identical results to base::duplicated.array(x, MARGIN = 1). I have a kludge that is very fast (0.2 seconds vs. 40 seconds for the base version) but it only works when the matrix has 3-4 columns and not when there are more than 4 columns. I can't see a simple way to generalise this fast kludge to matrices of arbitrary dimension. I also tried writing a function that used the Rcpp sugar method, duplicated, but the Rcpp sugar method only works for vectors and not matrices (correct?).
# Reproducible example
# Notes on my fast kludge that works for matrices with 3-4 columns.
In the above example, I make use of the function `GenomicRanges:::.duplicated.GenomicRanges` to write a (very fast) kludge that works when the matrix contains 3-4 columns (but won't work when there are more than 4 columns).
`GenomicRanges:::.duplicated.GenomicRanges` finds duplicate "quads", which can be thought of as rows of a 4-column matrix (although the matrix isn't explicitly formed, I think). `GenomicRanges:::.duplicated.GenomicRanges` calls the function `IRanges:::duplicatedIntegerQuads`, which in turn calls custom C routines, the default being `Integer_selfmatch4_hash` (
https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/S4Vectors/src/int_utils.c). I don't fully understand how this C routine works. But, from what I can tell, it hashes each "quad" to identify duplicates. Unfortunately, the C-level routine is hard-coded to work with "quads" and I can't see a simple way to generalise it to work with an arbitrary number of vectors, i.e. a matrix with an arbitrary number of columns.
There is also a related function `IRanges:::duplicatedIntegerPairs`, and set of associated C-level routines, which suggests to me that the approach taken in the C code isn't readily generalisable to an arbitrary number of "columns". Otherwise, I would expect that both `IRanges:::duplicatedIntegerPairs` and `IRanges:::duplicatedIntegerQuads` would be special cases of the same routine.
# Some notes on software used in the example
The example uses methods from the `GenomicRanges` and `IRanges` Bioconductor packages. Many of the methods currently defined in the `IRanges` package are being moved to the `S4Vectors` package (
https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/S4Vectors/) and I've linked to the current versions in the `S4Vectors` package rather than those in the current release (although the code is basically the same).