Hive

273 views

Skip to first unread message

Dzidorius Martinaitis

unread,

May 25, 2012, 4:08:49 AM5/25/12

to common...@googlegroups.com

Hi,

I want to explore CC data by using Hive. I tried two different ways to do it, but failed. Need to say, that I'm playing on my local machine.

First of all, I tried data in ARC format. For ARC parser I used simplified CC version: https://github.com/noiano/ARCInputFormat

hive> add jar /home/git/ARCInputFormat/ArcInputFormat.jar;

hive> CREATE TABLE foo (a string) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS INPUTFORMAT 'org.commoncrawl.hadoop.io.ARCInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

Time taken: 3.285 seconds

hive> load data local inpath '/home/git/common_crawl_types/1262851187581_1.arc.gz' into table foo;

Copying data from file:/home/git/common_crawl_types/1262851187581_1.arc.gz

Copying file: file:/home/git/common_crawl_types/1262851187581_1.arc.gz

Loading data to table default.foo

Time taken: 1.631 seconds

hive> select * from foo limit 1;

Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: class org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: expects either BytesWritable or Text object!

Time taken: 0.208 seconds

In this case Deserializer fails.

Later on I found, that there is data in sequence file, so I tried:

create table seq (line STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS SEQUENCEFILE;

LOAD DATA LOCAL INPATH '/home/hduser/textData-00000' INTO TABLE seq;

Copying data from file:/home/hduser/textData-00000

Copying file: file:/home/hduser/textData-00000

Loading data to table default.seq

Failed with exception java.lang.RuntimeException: native snappy library not available

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask

I found, that hadoop common (I have latest stable version) has snappy by default. Nevertheless, I have installed snappy lib and hadoop-snappy, however the error persist. Any idea what is wrong?

Thank you in advance,

Dzidas

Jakob Homan

unread,

May 25, 2012, 8:42:59 PM5/25/12

to common...@googlegroups.com

A Hive serde for this format doesn't exist but would be pretty easy to
write. I've been meaning to but haven't had a chance.

> --
> You received this message because you are subscribed to the Google Groups
> "Common Crawl" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/common-crawl/-/HkxPNZSPJaAJ.
> To post to this group, send email to common...@googlegroups.com.
> To unsubscribe from this group, send email to
> common-crawl...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/common-crawl?hl=en.

Reply all

Reply to author

Forward

0 new messages