MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/bin/../lib/hadoop/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/mahout/mahout-examples-0.9-cdh5.1.2-job.jar
14/09/22 14:23:08 WARN driver.MahoutDriver: No org.apache.mahout.clustering.conversion.InputDriver.props found on classpath, will use command-line arguments only
14/09/22 14:23:09 INFO client.RMProxy: Connecting to ResourceManager at sandy-quad-1.sslab.lan/192.168.35.75:8032
14/09/22 14:23:10 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
14/09/22 14:23:10 INFO input.FileInputFormat: Total input paths to process : 1
14/09/22 14:23:11 INFO mapreduce.JobSubmitter: number of splits:1
14/09/22 14:23:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1410945757266_2515
14/09/22 14:23:11 INFO impl.YarnClientImpl: Submitted application application_1410945757266_2515
14/09/22 14:23:11 INFO mapreduce.Job: The url to track the job: http://sandy-quad-1.sslab.lan:8088/proxy/application_1410945757266_2515/
14/09/22 14:23:11 INFO mapreduce.Job: Running job: job_1410945757266_2515
14/09/22 14:23:23 INFO mapreduce.Job: Job job_1410945757266_2515 running in uber mode : false
14/09/22 14:23:23 INFO mapreduce.Job: map 0% reduce 0%
14/09/22 14:23:30 INFO mapreduce.Job: Task Id : attempt_1410945757266_2515_m_000000_0, Status : FAILED
Error: java.lang.NumberFormatException: For input string: "\N"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
at java.lang.Double.valueOf(Double.java:504)
at org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:48)
at org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:34)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
14/09/22 14:23:08 WARN driver.MahoutDriver: No org.apache.mahout.clustering.conversion.InputDriver.props found on classpath, will use command-line arguments only
NumberFormatException
at the end.
Any suggestions for how to solve this are always welcome!
14/09/22 15:52:31 WARN driver.MahoutDriver: No org.apache.mahout.clustering.conversion.InputDriver.props found on classpath, will use command-line arguments only
14/09/22 15:52:32 INFO client.RMProxy: Connecting to ResourceManager at sandy-quad-1.sslab.lan/192.168.35.75:8032
14/09/22 15:52:33 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
14/09/22 15:52:33 INFO input.FileInputFormat: Total input paths to process : 1
14/09/22 15:52:33 INFO mapreduce.JobSubmitter: number of splits:1
14/09/22 15:52:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1410945757266_2530
14/09/22 15:52:34 INFO impl.YarnClientImpl: Submitted application application_1410945757266_2530
14/09/22 15:52:34 INFO mapreduce.Job: The url to track the job: http://sandy-quad-1.sslab.lan:8088/proxy/application_1410945757266_2530/
14/09/22 15:52:34 INFO mapreduce.Job: Running job: job_1410945757266_2530
14/09/22 15:52:46 INFO mapreduce.Job: Job job_1410945757266_2530 running in uber mode : false
14/09/22 15:52:46 INFO mapreduce.Job: map 0% reduce 0%
14/09/22 15:52:53 INFO mapreduce.Job: Task Id : attempt_1410945757266_2530_m_000000_0, Status : FAILED
Error: java.lang.NumberFormatException: For input string: "\N"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
at java.lang.Double.valueOf(Double.java:504)
at org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:48)
at org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:34)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Error: java.lang.NumberFormatException: For input string: "\N"
has something do do with Hive using the string constant "\N" to encode NULL values. It is however not completely clear to me how Hive is used in step 3, because the command that gets executed in step 3 is a Mahout command:q20 Step 3/6: Generating sparse vectors
Command mahout org.apache.mahout.clustering.conversion.InputDriver -i /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp -o /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp/Vec -v org.apache.mahout.math.RandomAccessSparseVector
tmp output: /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp/Vec
Customer segmentation for return analysis: Customers are separated along the following dimensions: return frequency, return order ratio (total number of orders partially or fully returned versus the total number of orders), return item ratio (total number of items returned versus the number of items purchased), return amount ration (total monetary amount of items returned versus the amount purchased), return order ratio. Consider the store returns during a given year for the computation.
...
Command mahout org.apache.mahout.clustering.conversion.InputDriver -i /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp -o /<span style="colo
To clarify you question on how hive and mahout interact for query q20:
Q20's goal is to do:Customer segmentation for return analysis: Customers are separated along the following dimensions: return frequency, return order ratio (total number of orders partially or fully returned versus the total number of orders), return item ratio (total number of items returned versus the number of items purchased), return amount ration (total monetary amount of items returned versus the amount purchased), return order ratio. Consider the store returns during a given year for the computation.All your data (the benchmarks data) ist stored as hive tables.
In step 1 we prepare temporary directories
In step 2 we execute a hive query
We use hive to do the basic aggregation of customers and compute order/return item /amount ratios.
The query result is stored as external Text-Table.
In step 3
we transform the result from step2 into spare vectors, processable by mahout using mahouts clustering.conversion.InputDriver. This input driver reads the result written by the hive query.
In step 4
we perform a kmeans clustering on the sparse vectors
In step 5
we dump the kmeans clustering result and stream it to hdfs into the queries *results directory.
in step 6
we cleanup temporary hive table(s) and temporary working directories.
hope this helps you understanding the connection between hive and mahout and the logic behind q20.
best regards,
Michael
Error: java.lang.NumberFormatException: For input string: "\N"
) I found websites that mention that this has something do do with Hive using the string constant "\N" to encode NULL values. However, I didn't see hive being used in step 3, only mahout is used there, so that's why I was confused.