Hi,
I want to explore CC data by using Hive. I tried two different ways to do it, but failed. Need to say, that I'm playing on my local machine.
hive> add jar /home/git/ARCInputFormat/ArcInputFormat.jar;
hive> CREATE TABLE foo (a string) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS INPUTFORMAT 'org.commoncrawl.hadoop.io.ARCInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
OK
Time taken: 3.285 seconds
hive> load data local inpath '/home/git/common_crawl_types/1262851187581_1.arc.gz' into table foo;
Copying data from file:/home/git/common_crawl_types/1262851187581_1.arc.gz
Copying file: file:/home/git/common_crawl_types/1262851187581_1.arc.gz
Loading data to table default.foo
OK
Time taken: 1.631 seconds
hive> select * from foo limit 1;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: class org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: expects either BytesWritable or Text object!
Time taken: 0.208 seconds
In this case Deserializer fails.
Later on I found, that there is data in sequence file, so I tried:
create table seq (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS SEQUENCEFILE;
LOAD DATA LOCAL INPATH '/home/hduser/textData-00000' INTO TABLE seq;
Copying data from file:/home/hduser/textData-00000
Copying file: file:/home/hduser/textData-00000
Loading data to table default.seq
Failed with exception java.lang.RuntimeException: native snappy library not available
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
I found, that hadoop common (I have latest stable version) has snappy by default. Nevertheless, I have installed snappy lib and hadoop-snappy, however the error persist. Any idea what is wrong?
Thank you in advance,
Dzidas