Hi All,
All my MR jobs are on Amazon's EMR. This is my test launch script/setup on EMR:
1) ./elastic-mapreduce --create --alive --num-instances 1 --instance-type m1.small --name 'onconv' --hive-interactive --hive-versions 0.11.0 --ami-version latest --hadoop-version 1.0.3
When EMR is ready, I would ssh into a hadoop cluster (of one node)
Data file preparation(using ProtobufMRExample):
% export HADOOP_CLASSPATH=/mnt/var/lib/hive_0110/downloaded_resources/elephant-bird-core-3.0.5.jar:/home/hadoop/lib/guava-13.0.1.jar:/home/hadoop/lib/protobuf-java-2.4.1.jar
% hadoop jar /mnt/var/lib/hive_0110/downloaded_resources/elephant-bird-examples-3.0.4.jar com.twitter.elephantbird.examples.ProtobufMRExample -libjars /mnt/var/lib/hive_0110/downloaded_resources/elephant-bird-core-3.0.5.jar,/home/hadoop/lib/guava-13.0.1.jar,/home/hadoop/lib/protobuf-java-2.4.1.jar -Dproto.test=lzoOut -Dproto.test.format=Block s3://<ROOT_DEV>/tmp/input/test1 s3://<ROOT_DEV>/tmp/output6
I was able to uncompress by using -Dproto.test=lzoIn to verify that the compression was fine. test1 has 5 lines, each line has 2 columns: name<TAB>age
And with part-m-00000.lzo in place in my tmp/output6, these were my hive's commands:
add jar s3://<ROOT_DEV>/lib/elephant-bird/elephant-bird-core-3.0.5.jar;
add jar s3://<ROOT_DEV>/lib/elephant-bird/elephant-bird-hive-3.0.5.jar;
add jar s3://<ROOT_DEV>/lib/elephant-bird/guava-13.0.1.jar;
add jar s3://<ROOT_DEV>/lib/elephant-bird/protobuf-java-2.4.1.jar;
add jar s3://<ROOT_DEV>/lib/elephant-bird/elephant-bird-examples-3.0.4.jar;
drop table test1;
create external table test1
partitioned by (dt string)
row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
with serdeproperties (
"serialization.class"="com.twitter.elephantbird.examples.proto.Examples$Age")
stored as
inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
ALTER TABLE test1 ADD IF NOT EXISTS PARTITION (dt='output6')
LOCATION 's3://<ROOT_DEV>/tmp';
There were no errors nor warnings, so I thought that was a good sign.
hive> describe test1;
OK
name string from deserializer
age int from deserializer
dt string None
# Partition Information
# col_name data_type comment
dt string None
Time taken: 0.62 seconds, Fetched: 8 row(s)
The schema looks good though I don't quite understand why dt was mentioned twice.
hive> ALTER TABLE test1 ADD IF NOT EXISTS PARTITION (dt='output6')
> LOCATION 's3://<ROOT_DEV>/tmp';
OK
Time taken: 1.057 seconds
hive> select count(*) from test1;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Kill Command = /home/hadoop/bin/hadoop job -kill job_201309022344_0002
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 1
2013-09-02 23:56:34,464 Stage-1 map = 0%, reduce = 0%
2013-09-02 23:56:41,606 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 1.06 sec
2013-09-02 23:56:42,647 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 1.06 sec
2013-09-02 23:56:43,716 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 1.06 sec
2013-09-02 23:56:44,788 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 1.06 sec
2013-09-02 23:56:45,795 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 1.06 sec
2013-09-02 23:56:46,826 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 1.06 sec
2013-09-02 23:56:47,858 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 1.06 sec
2013-09-02 23:56:48,891 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 1.06 sec
2013-09-02 23:56:49,899 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.06 sec
MapReduce Total cumulative CPU time: 1 seconds 60 msec
Ended Job = job_201309022344_0002
Counters:
MapReduce Jobs Launched:
Job 0: Reduce: 1 Cumulative CPU: 1.06 sec HDFS Read: 0 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 60 msec
OK
0
Time taken: 53.766 seconds, Fetched: 1 row(s)
I did a 'select name,age from test1;' but the job returned nothing. In EMR, perusing the log files in s3 reviewed zero errors. However, I'm seeing
Task TASKID="task_201309022344_0007_m_000001" TASK_TYPE="SETUP" TASK_STATUS="SUCCESS" FINISH_TIME="1378167778191" COUNTERS="{(FileSystemCounters)(FileSystemCounters)[(FILE_BYTES_WRITTEN)(FILE_BYTES_WRITTEN)(61239)]}{(org\.apache\.hadoop\.mapred\.Task$Counter)(Map-Reduce Framework)[(PHYSICAL_MEMORY_BYTES)(Physical memory \\(bytes\\) snapshot)(43864064)][(SPILLED_RECORDS)(Spilled Records)(0)][(CPU_MILLISECONDS)(CPU time spent \\(ms\\))(100)][(COMMITTED_HEAP_BYTES)(Total committed heap usage \\(bytes\\))(16252928)][(VIRTUAL_MEMORY_BYTES)(Virtual memory \\(bytes\\) snapshot)(442163200)]}"
which I suspect was trying to tell me something.
Why am I reading zero? Any pointers and suggestions deeply appreciated.
Thanks in advance,
dave