Potential bug in rmr2-2.2.1, regarding the combiner/reducer

Amaranto Law

unread,

Jul 13, 2013, 1:52:15 PM7/13/13

to rha...@googlegroups.com, buehl

Hi all, I encounter some problems in rmr2 version 2.2.1. The problem doesn't exist if I use rmr2 version 2.2.0.

I will illustrate the problem with code below, or you can find the code in the attached file.

First of all, we have to set up some variables, which is critical to show the problem. For the `n`, it adjusts the size of input into the mapper. For the `w.size`, it adjusts the size of the key got by reducer and return from reducer as well. Later, you may have to adjust the `w.size` if the problem doen't occur, as I found that the `w.size` does matter.

library("rmr2")
## Settings.
n <- 200              # n is related to the size of input for the mapper.
w.size <- 10000       # w.size is relate to the number of key for the reducer. 
## w.size <- 5000     # No error occurs.

Then we create key and val pairs for input. Here we just use some random sample.

## Create the key val pairs.
word.set <-
    unique(sapply(1:w.size,
                  function(x) paste0(sample(letters,
                                            sample(4:10, 1),
                                            replace = TRUE),
                                     collapse = "")))
key <- as.character(sapply(1:n, rep, times = 500))
val <- unlist(lapply(1:n, function(x){
    a <- sample(c(0,0,0,0,0,1,2), 500, replace = TRUE)
    names(a) <- sample(word.set, 500)
    return(a)}))
test <- keyval(key, val)

Here is the mapreduce function. It is important to set `combine = TRUE`. If it is set as `combine = NULL`, no error will occur. But the point is that, we would like to use the reducer as combiner. And it is also where the potential bug located.

## Function that needed.
mp <- function(input) {  
    mapper <- function(keys, word.count) {
        words <- names(word.count)
        word.set <- sort(unique(words))
        return.keys <- return.vals <- NULL
        for (this.word in word.set) {
            doc.count <- length(which(words == this.word))
            return.keys <- c(return.keys, this.word)
            return.vals <- c(return.vals, doc.count)
        }
        return(keyval(return.keys, return.vals))
    }
    reducer <- function(term, freq) {
        stopifnot(length(term) == 1, all(is.finite(freq)))
        return(keyval(term, sum(freq)))
    } 
    return(mapreduce(input = input,
                     map = mapper, reduce = reducer, combine = TRUE))
}

Finally, we run the mapreduce job in the hadoop backend as well as local backend.

## Run in Hadoop Backend.
rmr.options(backend = "hadoop")
DTDF <- mp(to.dfs(test))
anyDuplicated(from.dfs(DTDF)$key)
## Expected Result is : [1] 0
## True Result is     : [1] <non-zero integer>


## Run in Local Backend.
rmr.options(backend = "local")
DTDF <- mp(to.dfs(test))
anyDuplicated(from.dfs(DTDF)$key)
## Expected Result is : [1] 0
## True Result is     : [1] 0

The results from hadoop backend and local backend are not the same. If you get result of [1] 0 in the hadoop backend, please repeat the above steps with a larger `w.size`, let say 20000, 50000, and so on.

To summarize, the bug occurs in the following conditions,

1) Using rmr2 version 2.2.1, while version 2.2.0 doesn't have the bug mentioned above.

2) Setting `w.size` to a sufficient large size, while small size of `w.size` doesn't have problem.

3) Using `combine = TRUE` in the mapreduce() function.

Thanks,

Keith

demo.R

Antonio Piccolboni

unread,

Jul 13, 2013, 3:25:45 PM7/13/13

to RHadoop Google Group, buehl

Hi,

thanks for your report, I could easily take your code and run it. Unfortunately, I can't reproduce the problem on the same version of rmr2. The local backend won't help you to troubleshoot this problem because the combiner is ignored in that setting. Any program that returns different values whether the combiner is on or not is to be considered incorrect as Hadoop doesn't guarantee the application of the combiner, it's 0 or more times. This program doesn't look incorrect to me, only very inefficient, so it could still be a bug in rmr2. In this case it seems that when the input size is less that 10^4, which also is the default value for keyval.length there is a single map call, hence the combiner is skipped. So your error happens when there are at least two map calls. Here is what I would do. I would first try to repro the error at a much smaller scale, for instance set rmr.options(keyval.length = 3) and create an input of size 10. To repro the problem I guess you'd need the same key in multiple groups, and from what I know about the difference between releases, you may want the first and last key to be the same. Then add rmr.str calls to dump keys and values in the mapper and reducer, I'd say just before the last line of the mapper and as the first of the reducer. The information will be in the stderr log that depending on your hadoop configuration may be conveniently in console or in a log accessible from the web UI, this is how it looks for me:

Dotted pair list of 6

$ : language (function() { load("./rmr-local-env687e32205164") ...

$ : language rmr2:::map.loop(map = map, keyval.reader = input.reader(), keyval.writer = if (is.null(reduce)) { output.writer() ...

$ : language as.keyval(map(keys(kv), values(kv)))

$ : language is.keyval(x)

$ : language map(keys(kv), values(kv))

$ :length 2 rmr.str(return.keys)

..- attr(*, "srcref")=Class 'srcref' atomic [1:8] 11 1 11 20 1 20 11 11

.. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x104190d10>

return.keys

chr [1:7993] "aagzlftcru" "aaiux" "aaiytmaktj" "aajqgi" ...

Please avoid using any randomness when creating test cases, like the sample call. That makes the test case nondeterministic and may be the reason why I can't repro it. I will try and build my own test case based on the current guess on where the problem may lie. Thanks

Antonio

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Keith Law

unread,

Jul 13, 2013, 7:51:44 PM7/13/13

to rha...@googlegroups.com, buehl, ant...@piccolboni.info

Hi Antonio,

Thank you for your clear instruction. I have reduced the size of input and capture the same error again by another random sample. This time I have attached the sample, together with the function in the RData file,

So, running the code below should be able to reproduce the error.

library(rmr2)
load("test.RData")
rmr.options(keyval.length = 3)
result <- from.dfs(mp(to.dfs(test)))$key
anyDuplicated(result)

I get

[1] 50

and what is expected to get is

[1] 0

Thanks.

test.RData

Keith Law

unread,

Jul 13, 2013, 8:00:02 PM7/13/13

to rha...@googlegroups.com, buehl

.. following up my previous post, please ignore the file attached. I have wrongly attach another file. The correct one should be attached in this post.

Thanks.

test.RData

Antonio Piccolboni

unread,

Jul 13, 2013, 8:10:20 PM7/13/13

to RHadoop Google Group, buehl

On Sat, Jul 13, 2013 at 4:51 PM, Keith Law <amaran...@gmail.com> wrote:

library(rmr2)
load("test.RData")
rmr.options(keyval.length = 3)
result <- from.dfs(mp(to.dfs(test)))$key

anyDuplicated(result)

I still get 0. I would suggest that you install quickcheck from https://github.com/RevolutionAnalytics/quickcheck/blob/master/build/quickcheck_1.0.tar.gz

and run R CMD check to verify the compatibility of your platform with rmr2 -- by the way, what OS, R and Hadoop versions are you running on?

Antonio

Antonio Piccolboni

unread,

Jul 13, 2013, 8:12:05 PM7/13/13

to RHadoop Google Group, buehl

Still 0 with the updated file.

Keith Law

unread,

Jul 13, 2013, 9:21:16 PM7/13/13

to rha...@googlegroups.com, buehl, ant...@piccolboni.info

I tried on several machines, with different OS.

Machine 1
OS: Mac OS X 10.8.4 64-bit
R version 2.15.3 (2013-03-01)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
Hadoop version: Apache 1.1.2

Machine 2
OS: Ubuntu Saucy Salamander (development branch) 32-bit
R version 3.0.0 (2013-04-03)
Platform: i686-pc-linux-gnu (32-bit)
Hadoop version: Apache 1.0.4

Machine 3

OS: Ubuntu 12.04 LTS 32-bit

R version 3.0.1 (2013-05-16)
Platform: i686-pc-linux-gnu (32-bit)

Hadoop version: Apache 1.1.2 (also tried 1.0.4)

Antonio Piccolboni

unread,

Jul 13, 2013, 10:53:28 PM7/13/13

to RHadoop Google Group, buehl

Two of those are 32 bit. We don't support that and the serialization implementation is broken on 32 bits. I should say we don't support OS X either, but that's where I develop so in this case my test was on OS X. I used CDH 4.2, I am not aware of any reasons why it should work on Apache 1.1.2 but that's one difference. I am on R 3.0, that should not be a problem either and the packages have been tested on 2.15.1 if I can remember. You could still run R CMD check on OS X and see if it fails. You could also run your test in a R --vanilla session to simplify your environment, if that's still broken you could send me a new RData and I will test that in an R --vanilla session (the one you shared seemed to have unrelated packages loaded).

Antonio

--

Keith Law

unread,

Jul 13, 2013, 11:36:44 PM7/13/13

to rha...@googlegroups.com, buehl, ant...@piccolboni.info

Just a quick follow-up for your previous email. Actually, I also run test on Machine 1 with linux 64-bit. It also gives out the same result. The R CMD check for Machine 1 using OS X is attached in the file. It seems to be fine except some LaTeX issues which is not relevant.

log.txt

Keith Law

unread,

Jul 14, 2013, 12:35:23 AM7/14/13

to rha...@googlegroups.com, buehl

I have attached the New test.RData. This time the magic number I got is [1] 29.

Hadoop:

Apache Hadoop 1.1.2

Java :

java version "1.7.0_21"

OpenJDK Runtime Environment (IcedTea 2.3.9) (7u21-2.3.9-0ubuntu0.12.10.1)

OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

R session:

R version 2.15.1 (2012-06-22)

Platform: x86_64-pc-linux-gnu (64-bit)

locale:

[1] LC_CTYPE=en_HK.UTF-8 LC_NUMERIC=C

[3] LC_TIME=en_HK.UTF-8 LC_COLLATE=en_HK.UTF-8

[5] LC_MONETARY=en_HK.UTF-8 LC_MESSAGES=en_HK.UTF-8

[7] LC_PAPER=C LC_NAME=C

[9] LC_ADDRESS=C LC_TELEPHONE=C

[11] LC_MEASUREMENT=en_HK.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] stats graphics grDevices utils datasets methods base

other attached packages:

[1] rmr2_2.2.1 reshape2_1.2.2 plyr_1.8 stringr_0.6.2 functional_0.4

[6] digest_0.6.3 bitops_1.0-5 RJSONIO_1.0-3 Rcpp_0.10.4

test.RData

Keith Law

unread,

Jul 14, 2013, 3:45:03 AM7/14/13

to rha...@googlegroups.com, buehl

As an update for this problem, I have installed CDH4.3 with MRv1 on Machine 1, which OS is ubuntu 12.10 64-bit, with R 3.0.1. I still get the same result [1] 29.

Hadoop version

=====================

Hadoop 2.0.0-cdh4.3.0

Subversion file:///var/lib/jenkins/workspace/generic-package-ubuntu64-12-04/CDH4.3.0-Packaging-Hadoop-2013-05-27_19-02-30/hadoop-2.0.0+1357-1.cdh4.3.0.p0.21~precise/src/hadoop-common-project/hadoop-common -r 48a9315b342ca16de92fcc5be95ae3650629155a

Compiled by jenkins on Mon May 27 19:45:27 PDT 2013

From source with checksum a4218d77f9b12df4e3e49ef96f9d357d

This command was run using /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.3.0.jar

Java version

======================

java version "1.7.0_21"

OpenJDK Runtime Environment (IcedTea 2.3.9) (7u21-2.3.9-0ubuntu0.12.10.1)

OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

R session info

======================

R version 3.0.1 (2013-05-16)

Antonio Piccolboni

unread,

Jul 14, 2013, 11:39:23 AM7/14/13

to RHadoop Google Group, buehl

I can't repro it on CentOS/HDP1/R3.0.1 either. Maybe someone else reading this has a few spare cycles to give it a run to, so that we can assess if more people experience the bug. I don't have many good options to suggest. One is that you give me access to a system where you observe the error. I know this is problematic for many organizations. The other is that you debug this yourself, with my assistance of course. The crucial functions are map.loop and reduce.loop. We want to examine what keys are written by the map loop but even more importantly how they are processed in the reduce loop. Is it that for some reason the keys are not getting to the reduce loop in order? Or is the reduce.loop logic wrong in grouping them? Of course this is not an easy task if you are not familiar with the implementation of rmr2, but you may be able to figure it out without reading the whole thing, just this function and a few other bits. The way I debug it when bugs are not reproducible on the local backend, I insert rmr.str statements strategically and then inspect stderr (console or log files depending on the mode of your hadoop instance). One thing just to turn every stone, did you take a look at the 29 repeating keys? What position do they have in the input? Thanks

Antonio

Antonio Piccolboni

unread,

Jul 17, 2013, 1:16:41 PM7/17/13

to rha...@googlegroups.com, buehl, ant...@piccolboni.info

To update everybody on this Keith and his organization kindly made available a test system on which I could diagnose and fix the bug. Despite being difficult to reproduce, this was a nasty bug that can occasionally produce split groups in the reduce phase (that is, no record disappears, but keys that should be equal are treated as distinct). The root cause is that Java and R order the same values in a different way and that affected the grouping algorithm under certain circumstances. There is a fix in master that's undergoing additional testing right now and a hotfix release should follow soon.

Antonio

Antonio Piccolboni

unread,

Jul 18, 2013, 11:47:06 PM7/18/13

to rha...@googlegroups.com, buehl, ant...@piccolboni.info

To wrap this up, I am about to announce rmr2.2.2 in a new thread. Thanks for the feedback and help.

Antonio

Keith Law

unread,

Jul 23, 2013, 10:55:24 PM7/23/13

to rha...@googlegroups.com

Hi Antonio,

Great thanks for your rapid response to the problem. I truly appreciate your time and effort. Now the rmr2 works perfectly. Thanks!

Reply all

Reply to author

Forward