Working with Large dataset: Mapreduce Streaming failed with error code 1

73 views
Skip to first unread message

Ankit Sangwan

unread,
Apr 13, 2015, 5:45:50 AM4/13/15
to rha...@googlegroups.com
Hi all,

I have started with the second phase of my work with RHadoop which contains working with Large Datasets.
Previously, I had successfully run most of examples available online (some new ones) on small datasets.

Then I started working with large dataset of size 2GB as it shows in R workspace.
The dataset which I am working on, contains 1.5 million reviews of hotels with author details (character, factor), hotel details (character, factor), and all aspect ratings (numeric, integer).

I tried to run word_count function on Review_Content using mapreduce job. There are 2 issues which I have been facing.

1.
When I tried with all the observations in reviews i.e., 1.5 million, it throws a error as shown below:

reviews_dfs <- to.dfs(reviews, output = "/ankit/input/")
Error in writeBin(.Call("typedbytes_writer", objects, native, PACKAGE = "rmr2"),  : 
  long vectors not supported yet: connections.c:4089

Then, I tried with less observations, it successfully stored in HDFS:
reviews_dfs <- to.dfs(reviews[1:1000000,], output = "/ankit/input/")

HDFS window looks like:
Name    
Type    
Size
  Replication
     Block Size     
Modification      Time
        Permission
      Owner
        Group
input
file
376.54 MB
          2
        64 MB
2015-04-13         14:53
           rw-r--r--
        -------     supergroup


Any ideas/suggestion why it's not working all the observations? 

And how come Size column value (376.54MB) is larger than Block Size column value (64MB) (any problem here)?



2.
Word_Count function:
# Function to count the words in Review.Content
> word_count  <- function(x) {
    count  <- length(unlist(strsplit(tolower(x), "[^a-z]+")))
    return(count)
  }
> s.map   <- function(keys, lines) {
    wordcount     <- ldply(lines$Reviews.Content, word_count)
    review_count  <- cbind(lines$Reviews.Content, wordcount)
    return(keyval("R", review_count))
  }
> s.reduce<- function(key, line) {
    keyval(key, line)
  }

Then, I ran word_count function over these:

> joboutput <- mapreduce(input = "/ankit/input/", output = "/ankit/output/", map = s.map, reduce = s.reduce) Warning: $HADOOP_HOME is deprecated. packageJobJar: [/app/hadoop/tmp/hadoop-unjar2339796455373846126/] [] /tmp/streamjob5910864976750693382.jar tmpDir=null 15/04/13 14:24:05 INFO mapred.FileInputFormat: Total input paths to process : 1 15/04/13 14:24:05 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local] 15/04/13 14:24:05 INFO streaming.StreamJob: Running job: job_201504031257_0056 15/04/13 14:24:05 INFO streaming.StreamJob: To kill this job, run: 15/04/13 14:24:05 INFO streaming.StreamJob: /usr/local/hadoop-1.0.4/libexec/../bin/hadoop job -Dmapred.job.tracker=1466:54311 -kill job_201504031257_0056 15/04/13 14:24:05 INFO streaming.StreamJob: Tracking URL: http://1466:50030/jobdetails.jsp?jobid=job_201504031257_0056 15/04/13 14:24:06 INFO streaming.StreamJob: map 0% reduce 0% 15/04/13 14:25:38 INFO streaming.StreamJob: map 1% reduce 0% 15/04/13 14:25:41 INFO streaming.StreamJob: map 2% reduce 0% 15/04/13 14:25:44 INFO streaming.StreamJob: map 3% reduce 0% 15/04/13 14:25:47 INFO streaming.StreamJob: map 6% reduce 0% 15/04/13 14:25:50 INFO streaming.StreamJob: map 8% reduce 0% 15/04/13 14:25:53 INFO streaming.StreamJob: map 10% reduce 0% 15/04/13 14:25:56 INFO streaming.StreamJob: map 13% reduce 0% 15/04/13 14:25:59 INFO streaming.StreamJob: map 15% reduce 0% 15/04/13 14:26:05 INFO streaming.StreamJob: map 0% reduce 0% 15/04/13 14:26:07 INFO streaming.StreamJob: map 1% reduce 0% 15/04/13 14:26:10 INFO streaming.StreamJob: map 4% reduce 0% 15/04/13 14:26:13 INFO streaming.StreamJob: map 8% reduce 0% 15/04/13 14:26:16 INFO streaming.StreamJob: map 12% reduce 0% 15/04/13 14:26:19 INFO streaming.StreamJob: map 16% reduce 0% 15/04/13 14:26:22 INFO streaming.StreamJob: map 20% reduce 0% 15/04/13 14:26:25 INFO streaming.StreamJob: map 25% reduce 0% 15/04/13 14:26:28 INFO streaming.StreamJob: map 28% reduce 0% 15/04/13 14:26:31 INFO streaming.StreamJob: map 33% reduce 0% 15/04/13 14:26:34 INFO streaming.StreamJob: map 0% reduce 0% 15/04/13 14:26:47 INFO streaming.StreamJob: map 1% reduce 0% 15/04/13 14:26:50 INFO streaming.StreamJob: map 2% reduce 0% 15/04/13 14:26:53 INFO streaming.StreamJob: map 4% reduce 0% 15/04/13 14:26:56 INFO streaming.StreamJob: map 6% reduce 0% 15/04/13 14:26:59 INFO streaming.StreamJob: map 8% reduce 0% 15/04/13 14:27:02 INFO streaming.StreamJob: map 10% reduce 0% 15/04/13 14:27:05 INFO streaming.StreamJob: map 13% reduce 0% 15/04/13 14:27:08 INFO streaming.StreamJob: map 16% reduce 0% 15/04/13 14:27:11 INFO streaming.StreamJob: map 18% reduce 0% 15/04/13 14:27:14 INFO streaming.StreamJob: map 6% reduce 0% 15/04/13 14:27:17 INFO streaming.StreamJob: map 8% reduce 0% 15/04/13 14:27:20 INFO streaming.StreamJob: map 11% reduce 0% 15/04/13 14:27:23 INFO streaming.StreamJob: map 14% reduce 0% 15/04/13 14:27:26 INFO streaming.StreamJob: map 16% reduce 0% 15/04/13 14:27:29 INFO streaming.StreamJob: map 17% reduce 0% 15/04/13 14:27:32 INFO streaming.StreamJob: map 0% reduce 0% 15/04/13 14:27:44 INFO streaming.StreamJob: map 1% reduce 0% 15/04/13 14:27:47 INFO streaming.StreamJob: map 3% reduce 0% 15/04/13 14:27:50 INFO streaming.StreamJob: map 5% reduce 0% 15/04/13 14:27:53 INFO streaming.StreamJob: map 7% reduce 0% 15/04/13 14:27:56 INFO streaming.StreamJob: map 9% reduce 0% 15/04/13 14:27:59 INFO streaming.StreamJob: map 11% reduce 0% 15/04/13 14:28:02 INFO streaming.StreamJob: map 13% reduce 0% 15/04/13 14:28:05 INFO streaming.StreamJob: map 15% reduce 0% 15/04/13 14:28:11 INFO streaming.StreamJob: map 0% reduce 0% 15/04/13 14:28:18 INFO streaming.StreamJob: map 1% reduce 0% 15/04/13 14:28:21 INFO streaming.StreamJob: map 3% reduce 0% 15/04/13 14:28:24 INFO streaming.StreamJob: map 5% reduce 0% 15/04/13 14:28:27 INFO streaming.StreamJob: map 7% reduce 0% 15/04/13 14:28:30 INFO streaming.StreamJob: map 10% reduce 0% 15/04/13 14:28:33 INFO streaming.StreamJob: map 11% reduce 0% 15/04/13 14:28:36 INFO streaming.StreamJob: map 15% reduce 0% 15/04/13 14:28:39 INFO streaming.StreamJob: map 18% reduce 0% 15/04/13 14:28:42 INFO streaming.StreamJob: map 20% reduce 0% 15/04/13 14:28:45 INFO streaming.StreamJob: map 7% reduce 0% 15/04/13 14:28:48 INFO streaming.StreamJob: map 10% reduce 0% 15/04/13 14:28:51 INFO streaming.StreamJob: map 12% reduce 0% 15/04/13 14:28:54 INFO streaming.StreamJob: map 14% reduce 0% 15/04/13 14:28:57 INFO streaming.StreamJob: map 16% reduce 0% 15/04/13 14:29:00 INFO streaming.StreamJob: map 0% reduce 0% 15/04/13 14:29:20 INFO streaming.StreamJob: map 1% reduce 0% 15/04/13 14:29:23 INFO streaming.StreamJob: map 3% reduce 0% 15/04/13 14:29:26 INFO streaming.StreamJob: map 5% reduce 0% 15/04/13 14:29:29 INFO streaming.StreamJob: map 7% reduce 0% 15/04/13 14:29:32 INFO streaming.StreamJob: map 10% reduce 0% 15/04/13 14:29:35 INFO streaming.StreamJob: map 11% reduce 0% 15/04/13 14:29:38 INFO streaming.StreamJob: map 14% reduce 0% 15/04/13 14:29:41 INFO streaming.StreamJob: map 16% reduce 0% 15/04/13 14:29:44 INFO streaming.StreamJob: map 0% reduce 0% 15/04/13 14:29:48 INFO streaming.StreamJob: map 1% reduce 0% 15/04/13 14:29:51 INFO streaming.StreamJob: map 2% reduce 0% 15/04/13 14:29:54 INFO streaming.StreamJob: map 4% reduce 0% 15/04/13 14:29:57 INFO streaming.StreamJob: map 6% reduce 0% 15/04/13 14:30:00 INFO streaming.StreamJob: map 9% reduce 0% 15/04/13 14:30:02 INFO streaming.StreamJob: map 10% reduce 0% 15/04/13 14:30:03 INFO streaming.StreamJob: map 12% reduce 0% 15/04/13 14:30:11 INFO streaming.StreamJob: map 100% reduce 100% 15/04/13 14:30:11 INFO streaming.StreamJob: To kill this job, run: 15/04/13 14:30:11 INFO streaming.StreamJob: /usr/local/hadoop-1.0.4/libexec/../bin/hadoop job -Dmapred.job.tracker=1466:54311 -kill job_201504031257_0056 15/04/13 14:30:11 INFO streaming.StreamJob: Tracking URL: http://1466:50030/jobdetails.jsp?jobid=job_201504031257_0056 15/04/13 14:30:11 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201504031257_0056_m_000000 15/04/13 14:30:11 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, : hadoop streaming failed with error code 1

What's the reason of getting this error? Any help/suggestion?

Thanks in advance........
Ankit

Antonio Piccolboni

unread,
Apr 15, 2015, 5:28:36 PM4/15/15
to rha...@googlegroups.com


On Monday, April 13, 2015 at 2:45:50 AM UTC-7, Ankit Sangwan wrote:
Hi all,

I have started with the second phase of my work with RHadoop which contains working with Large Datasets.
Previously, I had successfully run most of examples available online (some new ones) on small datasets.

Then I started working with large dataset of size 2GB as it shows in R workspace.
The dataset which I am working on, contains 1.5 million reviews of hotels with author details (character, factor), hotel details (character, factor), and all aspect ratings (numeric, integer).

I tried to run word_count function on Review_Content using mapreduce job. There are 2 issues which I have been facing.

1.
When I tried with all the observations in reviews i.e., 1.5 million, it throws a error as shown below:

reviews_dfs <- to.dfs(reviews, output = "/ankit/input/")
Error in writeBin(.Call("typedbytes_writer", objects, native, PACKAGE = "rmr2"),  : 
  long vectors not supported yet: connections.c:4089

Then, I tried with less observations, it successfully stored in HDFS:
reviews_dfs <- to.dfs(reviews[1:1000000,], output = "/ankit/input/")

Two comments. First, to.dfs is not meant to process big data. If it fits in memory, process it in memory not on Hadoop. Second, I would like to fix this anyway because it's unexpected, but it's been reported before and it can't reproduce it. Not with twice as many columns, not with 4 times. Of course I don't have the data, so I generated random data frames using the package quickcheck. 

As explained in the past, I would keep each thread to one main subject, thanks. Also please learn about debugging in rmr and reporting bugs in general, if you want to maximize the chances of getting help. 


 


HDFS window looks like:
Name    
Type    
Size
  Replication
     Block Size     
Modification      Time
        Permission
      Owner
        Group
input
file
376.54 MB
          2
        64 MB
2015-04-13         14:53
           rw-r--r--
        -------     supergroup


Any ideas/suggestion why it's not working all the observations? 

I do not understand this question. 
Reply all
Reply to author
Forward
0 new messages