Hive

259 views
Skip to first unread message

Dzidorius Martinaitis

unread,
May 25, 2012, 4:08:49 AM5/25/12
to common...@googlegroups.com
Hi,

I want to explore CC data by using Hive. I tried two different ways to do it, but failed. Need to say, that I'm playing on my local machine.

First of all, I tried data in ARC format. For ARC parser I used simplified CC version: https://github.com/noiano/ARCInputFormat


hive> add jar /home/git/ARCInputFormat/ArcInputFormat.jar;
hive> CREATE TABLE foo (a string) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS INPUTFORMAT 'org.commoncrawl.hadoop.io.ARCInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
OK
Time taken: 3.285 seconds
hive> load data local inpath '/home/git/common_crawl_types/1262851187581_1.arc.gz' into table foo;
Copying data from file:/home/git/common_crawl_types/1262851187581_1.arc.gz
Copying file: file:/home/git/common_crawl_types/1262851187581_1.arc.gz
Loading data to table default.foo
OK
Time taken: 1.631 seconds
hive> select * from foo limit 1;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: class org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: expects either BytesWritable or Text object!
Time taken: 0.208 seconds

In this case Deserializer fails.

Later on I found, that there is data in sequence file, so I tried:

create table seq (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS SEQUENCEFILE;
LOAD DATA LOCAL INPATH '/home/hduser/textData-00000' INTO TABLE seq;                                                                
Copying data from file:/home/hduser/textData-00000
Copying file: file:/home/hduser/textData-00000
Loading data to table default.seq
Failed with exception java.lang.RuntimeException: native snappy library not available
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask

I found, that hadoop common (I have latest stable version) has snappy by default. Nevertheless, I have installed snappy lib and hadoop-snappy, however the error persist. Any idea what is wrong?

Thank you in advance,
Dzidas

Jakob Homan

unread,
May 25, 2012, 8:42:59 PM5/25/12
to common...@googlegroups.com
A Hive serde for this format doesn't exist but would be pretty easy to
write. I've been meaning to but haven't had a chance.
> --
> You received this message because you are subscribed to the Google Groups
> "Common Crawl" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/common-crawl/-/HkxPNZSPJaAJ.
> To post to this group, send email to common...@googlegroups.com.
> To unsubscribe from this group, send email to
> common-crawl...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/common-crawl?hl=en.
Reply all
Reply to author
Forward
0 new messages