query 20

Bart Vandewoestyne

unread,

Sep 22, 2014, 8:41:25 AM9/22/14

to big-...@googlegroups.com

Before I can run a benchmark with larger scale factor, I first want to solve the problem I have with query 20. When I run this query, I see the following in my logs:

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/bin/../lib/hadoop/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/mahout/mahout-examples-0.9-cdh5.1.2-job.jar
14/09/22 14:23:08 WARN driver.MahoutDriver: No org.apache.mahout.clustering.conversion.InputDriver.props found on classpath, will use command-line arguments only
14/09/22 14:23:09 INFO client.RMProxy: Connecting to ResourceManager at sandy-quad-1.sslab.lan/192.168.35.75:8032
14/09/22 14:23:10 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
14/09/22 14:23:10 INFO input.FileInputFormat: Total input paths to process : 1
14/09/22 14:23:11 INFO mapreduce.JobSubmitter: number of splits:1
14/09/22 14:23:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1410945757266_2515
14/09/22 14:23:11 INFO impl.YarnClientImpl: Submitted application application_1410945757266_2515
14/09/22 14:23:11 INFO mapreduce.Job: The url to track the job: http://sandy-quad-1.sslab.lan:8088/proxy/application_1410945757266_2515/
14/09/22 14:23:11 INFO mapreduce.Job: Running job: job_1410945757266_2515
14/09/22 14:23:23 INFO mapreduce.Job: Job job_1410945757266_2515 running in uber mode : false
14/09/22 14:23:23 INFO mapreduce.Job:  map 0% reduce 0%
14/09/22 14:23:30 INFO mapreduce.Job: Task Id : attempt_1410945757266_2515_m_000000_0, Status : FAILED
Error: java.lang.NumberFormatException: For input string: "\N"
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
        at java.lang.Double.valueOf(Double.java:504)
        at org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:48)
        at org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:34)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

The full logfile is online at https://dl.dropboxusercontent.com/u/32340538/Big-Bench/q20_hive_RUN_QUERY_0.log

The lines

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.

and

14/09/22 14:23:08 WARN driver.MahoutDriver: No org.apache.mahout.clustering.conversion.InputDriver.props found on classpath, will use command-line arguments only

look suspicious to me. I wonder if they are somehow related to the NumberFormatException at the end.

Any suggestions for how to solve this are always welcome!

Bart Vandewoestyne

unread,

Sep 22, 2014, 10:01:47 AM9/22/14

to big-...@googlegroups.com

On Monday, September 22, 2014 2:41:25 PM UTC+2, Bart Vandewoestyne wrote:

Any suggestions for how to solve this are always welcome!

Some more information that might help in debugging this problem:

Apparently, query 20 consists of 6 steps:

Step 1/6: Prepare temp dir
Step 2/6: Executing hive queries
Step 3/6: Generating sparse vectors
Step 4/6: Calculating k-means
Step 5/6: Converting result and store in hdfs ${RESULT_DIR}/cluster.txt
Step 6/6: Clean up

In my run, things go wrong in step 3 'Generating sparse vectors'. At that point, the content of the q20_hive_run_query_0_temp temporary table that was generated in step 2 looks as follows:

https://dl.dropboxusercontent.com/u/32340538/Big-Bench/q20_intermediate_query_result.csv

As you can see in the above exported .csv file, this table contains a lot of zero values.

The WARN and NumberFormatException I get look as follows:

14/09/22 15:52:31 WARN driver.MahoutDriver: No org.apache.mahout.clustering.conversion.InputDriver.props found on classpath, will use command-line arguments only
14/09/22 15:52:32 INFO client.RMProxy: Connecting to ResourceManager at sandy-quad-1.sslab.lan/192.168.35.75:8032
14/09/22 15:52:33 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
14/09/22 15:52:33 INFO input.FileInputFormat: Total input paths to process : 1
14/09/22 15:52:33 INFO mapreduce.JobSubmitter: number of splits:1
14/09/22 15:52:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1410945757266_2530
14/09/22 15:52:34 INFO impl.YarnClientImpl: Submitted application application_1410945757266_2530
14/09/22 15:52:34 INFO mapreduce.Job: The url to track the job: http://sandy-quad-1.sslab.lan:8088/proxy/application_1410945757266_2530/
14/09/22 15:52:34 INFO mapreduce.Job: Running job: job_1410945757266_2530
14/09/22 15:52:46 INFO mapreduce.Job: Job job_1410945757266_2530 running in uber mode : false
14/09/22 15:52:46 INFO mapreduce.Job:  map 0% reduce 0%
14/09/22 15:52:53 INFO mapreduce.Job: Task Id : attempt_1410945757266_2530_m_000000_0, Status : FAILED

Error: java.lang.NumberFormatException: For input string: "\N"
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
    at java.lang.Double.valueOf(Double.java:504)
    at org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:48)
    at org.apache.mahout.clustering.conversion.InputMapper.map(InputMapper.java:34)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

and for as far as a Google search learns me, the exception Error: java.lang.NumberFormatException: For input string: "\N" has something do do with Hive using the string constant "\N" to encode NULL values. It is however not completely clear to me how Hive is used in step 3, because the command that gets executed in step 3 is a Mahout command:

q20 Step 3/6: Generating sparse vectors
Command mahout org.apache.mahout.clustering.conversion.InputDriver -i /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp -o /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp/Vec -v org.apache.mahout.math.RandomAccessSparseVector
tmp output: /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp/Vec

As I have never used Mahout before, I am quite stuck here... Any help on getting query 20 up and running is highly appreciated.

Michael Frank

unread,

Sep 22, 2014, 2:03:25 PM9/22/14

to big-...@googlegroups.com

please see:
https://groups.google.com/d/msg/big-bench/QCbCjcblM5Q/DexNg7e9MwIJ

To clarify you question on how hive and mahout interact for query q20:

Q20's goal is to do:

Customer segmentation for return analysis: Customers are separated along the following dimensions: return frequency, return order ratio (total number of orders partially or fully returned versus the total number of orders), return item ratio (total number of items returned versus the number of items purchased), return amount ration (total monetary amount of items returned versus the amount purchased), return order ratio. Consider the store returns during a given year for the computation.

All your data (the benchmarks data) ist stored as hive tables.
In step 1 we prepare temporary directories
In step 2 we execute a hive query
We use hive to do the basic aggregation of customers and compute order/return item /amount ratios.
The query result is stored as external Text-Table.
In step 3
we transform the result from step2 into spare vectors, processable by mahout using mahouts clustering.conversion.InputDriver. This input driver reads the result written by the hive query.
In step 4
we perform a kmeans clustering on the sparse vectors
In step 5
we dump the kmeans clustering result and stream it to hdfs into the queries *results directory.
in step 6
we cleanup temporary hive table(s) and temporary working directories.

hope this helps you understanding the connection between hive and mahout and the logic behind q20.

best regards,
Michael

Command mahout org.apache.mahout.clustering.conversion.InputDriver -i /user/bart/benchmarks/bigbench/temp/q20_hive_RUN_QUERY_0_temp -o /<span style="colo
...

Bart Vandewoestyne

unread,

Sep 24, 2014, 3:13:46 AM9/24/14

to big-...@googlegroups.com

On Monday, September 22, 2014 8:03:25 PM UTC+2, Michael Frank wrote:

To clarify you question on how hive and mahout interact for query q20:

Q20's goal is to do:

Customer segmentation for return analysis: Customers are separated along the following dimensions: return frequency, return order ratio (total number of orders partially or fully returned versus the total number of orders), return item ratio (total number of items returned versus the number of items purchased), return amount ration (total monetary amount of items returned versus the amount purchased), return order ratio. Consider the store returns during a given year for the computation.

All your data (the benchmarks data) ist stored as hive tables.
In step 1 we prepare temporary directories
In step 2 we execute a hive query
We use hive to do the basic aggregation of customers and compute order/return item /amount ratios.
The query result is stored as external Text-Table.
In step 3
we transform the result from step2 into spare vectors, processable by mahout using mahouts clustering.conversion.InputDriver. This input driver reads the result written by the hive query.
In step 4
we perform a kmeans clustering on the sparse vectors
In step 5
we dump the kmeans clustering result and stream it to hdfs into the queries *results directory.
in step 6
we cleanup temporary hive table(s) and temporary working directories.

hope this helps you understanding the connection between hive and mahout and the logic behind q20.

best regards,
Michael

Michael,

Thanks for clarifying. I understood the steps for q20, but I was confused about step 3. When I did a google search for the error message I got (Error: java.lang.NumberFormatException: For input string: "\N") I found websites that mention that this has something do do with Hive using the string constant "\N" to encode NULL values. However, I didn't see hive being used in step 3, only mahout is used there, so that's why I was confused.

Kind regards,
Bart

Reply all

Reply to author

Forward