Hi all,
I have started with the second phase of my work with RHadoop which contains working with Large Datasets.
Previously, I had successfully run most of examples available online (some new ones) on small datasets.
Then I started working with large dataset of size 2GB as it shows in R workspace.
The dataset which I am working on, contains 1.5 million reviews of hotels with author details (character, factor), hotel details (character, factor), and all aspect ratings (numeric, integer).
I tried to run word_count function on Review_Content using mapreduce job. There are 2 issues which I have been facing.
1.
When I tried with all the observations in reviews i.e., 1.5 million, it throws a error as shown below:
reviews_dfs <- to.dfs(reviews, output = "/ankit/input/")
Error in writeBin(.Call("typedbytes_writer", objects, native, PACKAGE = "rmr2"), :
long vectors not supported yet: connections.c:4089
Then, I tried with less observations, it successfully stored in HDFS:
reviews_dfs <- to.dfs(reviews[1:1000000,], output = "/ankit/input/")
HDFS window looks like:
Name
| Type
| Size
| Replication
| Block Size
| Modification Time
| Permission
| Owner
| Group
|
input
| file
| 376.54 MB
| 2
| 64 MB
| 2015-04-13 14:53
| rw-r--r--
| ------- | supergroup
|
Any ideas/suggestion why it's not working all the observations?
And how come Size column value (376.54MB) is larger than Block Size column value (64MB) (any problem here)?
2.
Word_Count function:
# Function to count the words in Review.Content
> word_count <- function(x) {
count <- length(unlist(strsplit(tolower(x), "[^a-z]+")))
return(count)
}
> s.map <- function(keys, lines) {
wordcount <- ldply(lines$Reviews.Content, word_count)
review_count <- cbind(lines$Reviews.Content, wordcount)
return(keyval("R", review_count))
}
> s.reduce<- function(key, line) {
keyval(key, line)
}
Then, I ran word_count function over these:
> joboutput <- mapreduce(input = "/ankit/input/", output = "/ankit/output/", map = s.map, reduce = s.reduce)
Warning: $HADOOP_HOME is deprecated.
packageJobJar: [/app/hadoop/tmp/hadoop-unjar2339796455373846126/] [] /tmp/streamjob5910864976750693382.jar tmpDir=null
15/04/13 14:24:05 INFO mapred.FileInputFormat: Total input paths to process : 1
15/04/13 14:24:05 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]
15/04/13 14:24:05 INFO streaming.StreamJob: Running job: job_201504031257_0056
15/04/13 14:24:05 INFO streaming.StreamJob: To kill this job, run:
15/04/13 14:24:05 INFO streaming.StreamJob: /usr/local/hadoop-1.0.4/libexec/../bin/hadoop job -Dmapred.job.tracker=1466:54311 -kill job_201504031257_0056
15/04/13 14:24:05 INFO streaming.StreamJob: Tracking URL: http://1466:50030/jobdetails.jsp?jobid=job_201504031257_0056
15/04/13 14:24:06 INFO streaming.StreamJob: map 0% reduce 0%
15/04/13 14:25:38 INFO streaming.StreamJob: map 1% reduce 0%
15/04/13 14:25:41 INFO streaming.StreamJob: map 2% reduce 0%
15/04/13 14:25:44 INFO streaming.StreamJob: map 3% reduce 0%
15/04/13 14:25:47 INFO streaming.StreamJob: map 6% reduce 0%
15/04/13 14:25:50 INFO streaming.StreamJob: map 8% reduce 0%
15/04/13 14:25:53 INFO streaming.StreamJob: map 10% reduce 0%
15/04/13 14:25:56 INFO streaming.StreamJob: map 13% reduce 0%
15/04/13 14:25:59 INFO streaming.StreamJob: map 15% reduce 0%
15/04/13 14:26:05 INFO streaming.StreamJob: map 0% reduce 0%
15/04/13 14:26:07 INFO streaming.StreamJob: map 1% reduce 0%
15/04/13 14:26:10 INFO streaming.StreamJob: map 4% reduce 0%
15/04/13 14:26:13 INFO streaming.StreamJob: map 8% reduce 0%
15/04/13 14:26:16 INFO streaming.StreamJob: map 12% reduce 0%
15/04/13 14:26:19 INFO streaming.StreamJob: map 16% reduce 0%
15/04/13 14:26:22 INFO streaming.StreamJob: map 20% reduce 0%
15/04/13 14:26:25 INFO streaming.StreamJob: map 25% reduce 0%
15/04/13 14:26:28 INFO streaming.StreamJob: map 28% reduce 0%
15/04/13 14:26:31 INFO streaming.StreamJob: map 33% reduce 0%
15/04/13 14:26:34 INFO streaming.StreamJob: map 0% reduce 0%
15/04/13 14:26:47 INFO streaming.StreamJob: map 1% reduce 0%
15/04/13 14:26:50 INFO streaming.StreamJob: map 2% reduce 0%
15/04/13 14:26:53 INFO streaming.StreamJob: map 4% reduce 0%
15/04/13 14:26:56 INFO streaming.StreamJob: map 6% reduce 0%
15/04/13 14:26:59 INFO streaming.StreamJob: map 8% reduce 0%
15/04/13 14:27:02 INFO streaming.StreamJob: map 10% reduce 0%
15/04/13 14:27:05 INFO streaming.StreamJob: map 13% reduce 0%
15/04/13 14:27:08 INFO streaming.StreamJob: map 16% reduce 0%
15/04/13 14:27:11 INFO streaming.StreamJob: map 18% reduce 0%
15/04/13 14:27:14 INFO streaming.StreamJob: map 6% reduce 0%
15/04/13 14:27:17 INFO streaming.StreamJob: map 8% reduce 0%
15/04/13 14:27:20 INFO streaming.StreamJob: map 11% reduce 0%
15/04/13 14:27:23 INFO streaming.StreamJob: map 14% reduce 0%
15/04/13 14:27:26 INFO streaming.StreamJob: map 16% reduce 0%
15/04/13 14:27:29 INFO streaming.StreamJob: map 17% reduce 0%
15/04/13 14:27:32 INFO streaming.StreamJob: map 0% reduce 0%
15/04/13 14:27:44 INFO streaming.StreamJob: map 1% reduce 0%
15/04/13 14:27:47 INFO streaming.StreamJob: map 3% reduce 0%
15/04/13 14:27:50 INFO streaming.StreamJob: map 5% reduce 0%
15/04/13 14:27:53 INFO streaming.StreamJob: map 7% reduce 0%
15/04/13 14:27:56 INFO streaming.StreamJob: map 9% reduce 0%
15/04/13 14:27:59 INFO streaming.StreamJob: map 11% reduce 0%
15/04/13 14:28:02 INFO streaming.StreamJob: map 13% reduce 0%
15/04/13 14:28:05 INFO streaming.StreamJob: map 15% reduce 0%
15/04/13 14:28:11 INFO streaming.StreamJob: map 0% reduce 0%
15/04/13 14:28:18 INFO streaming.StreamJob: map 1% reduce 0%
15/04/13 14:28:21 INFO streaming.StreamJob: map 3% reduce 0%
15/04/13 14:28:24 INFO streaming.StreamJob: map 5% reduce 0%
15/04/13 14:28:27 INFO streaming.StreamJob: map 7% reduce 0%
15/04/13 14:28:30 INFO streaming.StreamJob: map 10% reduce 0%
15/04/13 14:28:33 INFO streaming.StreamJob: map 11% reduce 0%
15/04/13 14:28:36 INFO streaming.StreamJob: map 15% reduce 0%
15/04/13 14:28:39 INFO streaming.StreamJob: map 18% reduce 0%
15/04/13 14:28:42 INFO streaming.StreamJob: map 20% reduce 0%
15/04/13 14:28:45 INFO streaming.StreamJob: map 7% reduce 0%
15/04/13 14:28:48 INFO streaming.StreamJob: map 10% reduce 0%
15/04/13 14:28:51 INFO streaming.StreamJob: map 12% reduce 0%
15/04/13 14:28:54 INFO streaming.StreamJob: map 14% reduce 0%
15/04/13 14:28:57 INFO streaming.StreamJob: map 16% reduce 0%
15/04/13 14:29:00 INFO streaming.StreamJob: map 0% reduce 0%
15/04/13 14:29:20 INFO streaming.StreamJob: map 1% reduce 0%
15/04/13 14:29:23 INFO streaming.StreamJob: map 3% reduce 0%
15/04/13 14:29:26 INFO streaming.StreamJob: map 5% reduce 0%
15/04/13 14:29:29 INFO streaming.StreamJob: map 7% reduce 0%
15/04/13 14:29:32 INFO streaming.StreamJob: map 10% reduce 0%
15/04/13 14:29:35 INFO streaming.StreamJob: map 11% reduce 0%
15/04/13 14:29:38 INFO streaming.StreamJob: map 14% reduce 0%
15/04/13 14:29:41 INFO streaming.StreamJob: map 16% reduce 0%
15/04/13 14:29:44 INFO streaming.StreamJob: map 0% reduce 0%
15/04/13 14:29:48 INFO streaming.StreamJob: map 1% reduce 0%
15/04/13 14:29:51 INFO streaming.StreamJob: map 2% reduce 0%
15/04/13 14:29:54 INFO streaming.StreamJob: map 4% reduce 0%
15/04/13 14:29:57 INFO streaming.StreamJob: map 6% reduce 0%
15/04/13 14:30:00 INFO streaming.StreamJob: map 9% reduce 0%
15/04/13 14:30:02 INFO streaming.StreamJob: map 10% reduce 0%
15/04/13 14:30:03 INFO streaming.StreamJob: map 12% reduce 0%
15/04/13 14:30:11 INFO streaming.StreamJob: map 100% reduce 100%
15/04/13 14:30:11 INFO streaming.StreamJob: To kill this job, run:
15/04/13 14:30:11 INFO streaming.StreamJob: /usr/local/hadoop-1.0.4/libexec/../bin/hadoop job -Dmapred.job.tracker=1466:54311 -kill job_201504031257_0056
15/04/13 14:30:11 INFO streaming.StreamJob: Tracking URL: http://1466:50030/jobdetails.jsp?jobid=job_201504031257_0056
15/04/13 14:30:11 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201504031257_0056_m_000000
15/04/13 14:30:11 INFO streaming.StreamJob: killJob... Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
What's the reason of getting this error? Any help/suggestion?
Thanks in advance........
Ankit