Working with Large dataset: Mapreduce Streaming failed with error code 1

73 views

bugrhdfsrmrscalability

Skip to first unread message

Ankit Sangwan

unread,

Apr 13, 2015, 5:45:50 AM4/13/15

to rha...@googlegroups.com

Hi all,

I have started with the second phase of my work with RHadoop which contains working with Large Datasets.

Previously, I had successfully run most of examples available online (some new ones) on small datasets.

Then I started working with large dataset of size 2GB as it shows in R workspace.

The dataset which I am working on, contains 1.5 million reviews of hotels with author details (character, factor), hotel details (character, factor), and all aspect ratings (numeric, integer).

I tried to run word_count function on Review_Content using mapreduce job. There are 2 issues which I have been facing.

When I tried with all the observations in reviews i.e., 1.5 million, it throws a error as shown below:

reviews_dfs <- to.dfs(reviews, output = "/ankit/input/")

Error in writeBin(.Call("typedbytes_writer", objects, native, PACKAGE = "rmr2"),  : 
  long vectors not supported yet: connections.c:4089

Then, I tried with less observations, it successfully stored in HDFS:

reviews_dfs <- to.dfs(reviews[1:1000000,], output = "/ankit/input/")

HDFS window looks like:

Name	Type	Size	Replication	Block Size	Modification Time	Permission	Owner	Group
input	file	376.54 MB	2	64 MB	2015-04-13 14:53	rw-r--r--	-------	supergroup

Any ideas/suggestion why it's not working all the observations?

And how come Size column value (376.54MB) is larger than Block Size column value (64MB) (any problem here)?

Word_Count function:

# Function to count the words in Review.Content

> word_count  <- function(x) {
    count  <- length(unlist(strsplit(tolower(x), "[^a-z]+")))
    return(count)
  }

> s.map   <- function(keys, lines) {

    wordcount     <- ldply(lines$Reviews.Content, word_count)
    review_count  <- cbind(lines$Reviews.Content, wordcount)
    return(keyval("R", review_count))
  }
> s.reduce<- function(key, line) {
    keyval(key, line)
  }

Then, I ran word_count function over these:

> joboutput <- mapreduce(input = "/ankit/input/", output = "/ankit/output/", map = s.map, reduce = s.reduce) Warning: $HADOOP_HOME is deprecated. packageJobJar: [/app/hadoop/tmp/hadoop-unjar2339796455373846126/] [] /tmp/streamjob5910864976750693382.jar tmpDir=null 15/04/13 14:24:05 INFO mapred.FileInputFormat: Total input paths to process : 1 15/04/13 14:24:05 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local] 15/04/13 14:24:05 INFO streaming.StreamJob: Running job: job_201504031257_0056 15/04/13 14:24:05 INFO streaming.StreamJob: To kill this job, run: 15/04/13 14:24:05 INFO streaming.StreamJob: /usr/local/hadoop-1.0.4/libexec/../bin/hadoop job -Dmapred.job.tracker=1466:54311 -kill job_201504031257_0056 15/04/13 14:24:05 INFO streaming.StreamJob: Tracking URL: http://1466:50030/jobdetails.jsp?jobid=job_201504031257_0056 15/04/13 14:24:06 INFO streaming.StreamJob: map 0% reduce 0% 15/04/13 14:25:38 INFO streaming.StreamJob: map 1% reduce 0% 15/04/13 14:25:41 INFO streaming.StreamJob: map 2% reduce 0% 15/04/13 14:25:44 INFO streaming.StreamJob: map 3% reduce 0% 15/04/13 14:25:47 INFO streaming.StreamJob: map 6% reduce 0% 15/04/13 14:25:50 INFO streaming.StreamJob: map 8% reduce 0% 15/04/13 14:25:53 INFO streaming.StreamJob: map 10% reduce 0% 15/04/13 14:25:56 INFO streaming.StreamJob: map 13% reduce 0% 15/04/13 14:25:59 INFO streaming.StreamJob: map 15% reduce 0% 15/04/13 14:26:05 INFO streaming.StreamJob: map 0% reduce 0% 15/04/13 14:26:07 INFO streaming.StreamJob: map 1% reduce 0% 15/04/13 14:26:10 INFO streaming.StreamJob: map 4% reduce 0% 15/04/13 14:26:13 INFO streaming.StreamJob: map 8% reduce 0% 15/04/13 14:26:16 INFO streaming.StreamJob: map 12% reduce 0% 15/04/13 14:26:19 INFO streaming.StreamJob: map 16% reduce 0% 15/04/13 14:26:22 INFO streaming.StreamJob: map 20% reduce 0% 15/04/13 14:26:25 INFO streaming.StreamJob: map 25% reduce 0% 15/04/13 14:26:28 INFO streaming.StreamJob: map 28% reduce 0% 15/04/13 14:26:31 INFO streaming.StreamJob: map 33% reduce 0% 15/04/13 14:26:34 INFO streaming.StreamJob: map 0% reduce 0% 15/04/13 14:26:47 INFO streaming.StreamJob: map 1% reduce 0% 15/04/13 14:26:50 INFO streaming.StreamJob: map 2% reduce 0% 15/04/13 14:26:53 INFO streaming.StreamJob: map 4% reduce 0% 15/04/13 14:26:56 INFO streaming.StreamJob: map 6% reduce 0% 15/04/13 14:26:59 INFO streaming.StreamJob: map 8% reduce 0% 15/04/13 14:27:02 INFO streaming.StreamJob: map 10% reduce 0% 15/04/13 14:27:05 INFO streaming.StreamJob: map 13% reduce 0% 15/04/13 14:27:08 INFO streaming.StreamJob: map 16% reduce 0% 15/04/13 14:27:11 INFO streaming.StreamJob: map 18% reduce 0% 15/04/13 14:27:14 INFO streaming.StreamJob: map 6% reduce 0% 15/04/13 14:27:17 INFO streaming.StreamJob: map 8% reduce 0% 15/04/13 14:27:20 INFO streaming.StreamJob: map 11% reduce 0% 15/04/13 14:27:23 INFO streaming.StreamJob: map 14% reduce 0% 15/04/13 14:27:26 INFO streaming.StreamJob: map 16% reduce 0% 15/04/13 14:27:29 INFO streaming.StreamJob: map 17% reduce 0% 15/04/13 14:27:32 INFO streaming.StreamJob: map 0% reduce 0% 15/04/13 14:27:44 INFO streaming.StreamJob: map 1% reduce 0% 15/04/13 14:27:47 INFO streaming.StreamJob: map 3% reduce 0% 15/04/13 14:27:50 INFO streaming.StreamJob: map 5% reduce 0% 15/04/13 14:27:53 INFO streaming.StreamJob: map 7% reduce 0% 15/04/13 14:27:56 INFO streaming.StreamJob: map 9% reduce 0% 15/04/13 14:27:59 INFO streaming.StreamJob: map 11% reduce 0% 15/04/13 14:28:02 INFO streaming.StreamJob: map 13% reduce 0% 15/04/13 14:28:05 INFO streaming.StreamJob: map 15% reduce 0% 15/04/13 14:28:11 INFO streaming.StreamJob: map 0% reduce 0% 15/04/13 14:28:18 INFO streaming.StreamJob: map 1% reduce 0% 15/04/13 14:28:21 INFO streaming.StreamJob: map 3% reduce 0% 15/04/13 14:28:24 INFO streaming.StreamJob: map 5% reduce 0% 15/04/13 14:28:27 INFO streaming.StreamJob: map 7% reduce 0% 15/04/13 14:28:30 INFO streaming.StreamJob: map 10% reduce 0% 15/04/13 14:28:33 INFO streaming.StreamJob: map 11% reduce 0% 15/04/13 14:28:36 INFO streaming.StreamJob: map 15% reduce 0% 15/04/13 14:28:39 INFO streaming.StreamJob: map 18% reduce 0% 15/04/13 14:28:42 INFO streaming.StreamJob: map 20% reduce 0% 15/04/13 14:28:45 INFO streaming.StreamJob: map 7% reduce 0% 15/04/13 14:28:48 INFO streaming.StreamJob: map 10% reduce 0% 15/04/13 14:28:51 INFO streaming.StreamJob: map 12% reduce 0% 15/04/13 14:28:54 INFO streaming.StreamJob: map 14% reduce 0% 15/04/13 14:28:57 INFO streaming.StreamJob: map 16% reduce 0% 15/04/13 14:29:00 INFO streaming.StreamJob: map 0% reduce 0% 15/04/13 14:29:20 INFO streaming.StreamJob: map 1% reduce 0% 15/04/13 14:29:23 INFO streaming.StreamJob: map 3% reduce 0% 15/04/13 14:29:26 INFO streaming.StreamJob: map 5% reduce 0% 15/04/13 14:29:29 INFO streaming.StreamJob: map 7% reduce 0% 15/04/13 14:29:32 INFO streaming.StreamJob: map 10% reduce 0% 15/04/13 14:29:35 INFO streaming.StreamJob: map 11% reduce 0% 15/04/13 14:29:38 INFO streaming.StreamJob: map 14% reduce 0% 15/04/13 14:29:41 INFO streaming.StreamJob: map 16% reduce 0% 15/04/13 14:29:44 INFO streaming.StreamJob: map 0% reduce 0% 15/04/13 14:29:48 INFO streaming.StreamJob: map 1% reduce 0% 15/04/13 14:29:51 INFO streaming.StreamJob: map 2% reduce 0% 15/04/13 14:29:54 INFO streaming.StreamJob: map 4% reduce 0% 15/04/13 14:29:57 INFO streaming.StreamJob: map 6% reduce 0% 15/04/13 14:30:00 INFO streaming.StreamJob: map 9% reduce 0% 15/04/13 14:30:02 INFO streaming.StreamJob: map 10% reduce 0% 15/04/13 14:30:03 INFO streaming.StreamJob: map 12% reduce 0% 15/04/13 14:30:11 INFO streaming.StreamJob: map 100% reduce 100% 15/04/13 14:30:11 INFO streaming.StreamJob: To kill this job, run: 15/04/13 14:30:11 INFO streaming.StreamJob: /usr/local/hadoop-1.0.4/libexec/../bin/hadoop job -Dmapred.job.tracker=1466:54311 -kill job_201504031257_0056 15/04/13 14:30:11 INFO streaming.StreamJob: Tracking URL: http://1466:50030/jobdetails.jsp?jobid=job_201504031257_0056 15/04/13 14:30:11 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201504031257_0056_m_000000 15/04/13 14:30:11 INFO streaming.StreamJob: killJob...

Streaming Command Failed!

Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, : hadoop streaming failed with error code 1

What's the reason of getting this error? Any help/suggestion?

Thanks in advance........

Ankit

Antonio Piccolboni

unread,

Apr 15, 2015, 5:28:36 PM4/15/15

to rha...@googlegroups.com

On Monday, April 13, 2015 at 2:45:50 AM UTC-7, Ankit Sangwan wrote:

Hi all,

I have started with the second phase of my work with RHadoop which contains working with Large Datasets.
Previously, I had successfully run most of examples available online (some new ones) on small datasets.

Then I started working with large dataset of size 2GB as it shows in R workspace.
The dataset which I am working on, contains 1.5 million reviews of hotels with author details (character, factor), hotel details (character, factor), and all aspect ratings (numeric, integer).

I tried to run word_count function on Review_Content using mapreduce job. There are 2 issues which I have been facing.

1.
When I tried with all the observations in reviews i.e., 1.5 million, it throws a error as shown below:
reviews_dfs <- to.dfs(reviews, output = "/ankit/input/")
Error in writeBin(.Call("typedbytes_writer", objects, native, PACKAGE = "rmr2"),  : 
  long vectors not supported yet: connections.c:4089
Then, I tried with less observations, it successfully stored in HDFS:
reviews_dfs <- to.dfs(reviews[1:1000000,], output = "/ankit/input/")

Two comments. First, to.dfs is not meant to process big data. If it fits in memory, process it in memory not on Hadoop. Second, I would like to fix this anyway because it's unexpected, but it's been reported before and it can't reproduce it. Not with twice as many columns, not with 4 times. Of course I don't have the data, so I generated random data frames using the package quickcheck.

As explained in the past, I would keep each thread to one main subject, thanks. Also please learn about debugging in rmr and reporting bugs in general, if you want to maximize the chances of getting help.

HDFS window looks like:
Name
Type
Size
Replication
Block Size
Modification Time
Permission
Owner
Group
input
file
376.54 MB
2
64 MB
2015-04-13 14:53
rw-r--r--
------- supergroup

Any ideas/suggestion why it's not working all the observations?