|Hive||Dzidorius Martinaitis||5/25/12 1:08 AM|
I want to explore CC data by using Hive. I tried two different ways to do it, but failed. Need to say, that I'm playing on my local machine.
First of all, I tried data in ARC format. For ARC parser I used simplified CC version: https://github.com/noiano/ARCInputFormat
hive> add jar /home/git/ARCInputFormat/ArcInputFormat.jar;
hive> CREATE TABLE foo (a string) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS INPUTFORMAT 'org.commoncrawl.hadoop.io.ARCInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
Time taken: 3.285 seconds
hive> load data local inpath '/home/git/common_crawl_types/1262851187581_1.arc.gz' into table foo;
Copying data from file:/home/git/common_crawl_types/1262851187581_1.arc.gz
Copying file: file:/home/git/common_crawl_types/1262851187581_1.arc.gz
Loading data to table default.foo
Time taken: 1.631 seconds
hive> select * from foo limit 1;
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: class org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: expects either BytesWritable or Text object!
Time taken: 0.208 seconds
In this case Deserializer fails.
Later on I found, that there is data in sequence file, so I tried:
create table seq (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS SEQUENCEFILE;
LOAD DATA LOCAL INPATH '/home/hduser/textData-00000' INTO TABLE seq;
Copying data from file:/home/hduser/textData-00000
Copying file: file:/home/hduser/textData-00000
Loading data to table default.seq
Failed with exception java.lang.RuntimeException: native snappy library not available
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
I found, that hadoop common (I have latest stable version) has snappy by default. Nevertheless, I have installed snappy lib and hadoop-snappy, however the error persist. Any idea what is wrong?
Thank you in advance,
|Re: Hive||Jakob||5/25/12 5:42 PM|
A Hive serde for this format doesn't exist but would be pretty easy to
write. I've been meaning to but haven't had a chance.
> You received this message because you are subscribed to the Google Groups
> "Common Crawl" group.
> To view this discussion on the web visit
> To post to this group, send email to common...@googlegroups.com.
> To unsubscribe from this group, send email to
> For more options, visit this group at