Re: reading behemoth documents from HDFS

DigitalPebble

unread,

Jan 21, 2013, 4:47:03 AM1/21/13

to digita...@googlegroups.com

Hi Kiran

The behemoth script would make your like easier to call the commands (./behemoth reader). Calling the CorpusReader without parameter should give you the following usage message.

usage: CorpusReader

-a,--displayAnnotations display annotations in output

-c,--displayContent display binary content in output

-h,--help print this message

-i,--input <arg> input Behemoth corpus

-m,--displayMetadata display metadata in output

-t,--displayText display text in output

Alternatively see https://github.com/DigitalPebble/behemoth/wiki/Core-module for a description of the core commands.

Assuming that the previous step worked successfully then you should at least see all the URLs and mimetypes for the documents extracted from the archives. Apart from the binary content (the first 200 chars of which can be displayed with the param -c) the Behemoth docs will have no text, annotations or metadata. You will have to use the Tika module afterwards to extract them. Maybe check the counters of the WARCConverterJob to make sure that you do get some output?

HTH

Julien

On 21 January 2013 07:10, kiran <chittur...@gmail.com> wrote:

Hi,

I am working on using behemoth for extracting warc files. I am following the tutorial and i have done the map reduce job but i have problem reading the behemoth document.

I am pasting here the commands that i have tried ( I have actually tried extracting both warc and warc.gz format )

/opt/hadoop-1.0.4/bin/hadoop jar /opt/behemoth/io/target/behemoth-io-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.io.warc.WARCConverterJob hdfs://LucidN1:50001/input/virginiaEarthquake/ARCHIVEIT-2821-WEEKLY-WYYNUE-20110906022146-00007-crawling205.us.archive.org-6680.warc.gz hdfs://LucidN1:50001/output_gz

/opt/hadoop-1.0.4/bin/hadoop jar /opt/behemoth/io/target/behemoth-io-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.io.warc.WARCConverterJob hdfs://LucidN1:50001/input/virginiaEarthquake_warc/ARCHIVEIT-2821-WEEKLY-WYYNUE-20110906022146-00007-crawling205.us.archive.org-6680.warc hdfs://LucidN1:50001/output/

Now the job is complete with these commands and i am using below command to list the files or read the behemoth document. ( I am not sure if its the right way)

/opt/hadoop-1.0.4/bin/hadoop jar /opt/behemoth/core/target/behemoth-core-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.util.CorpusReader -i hdfs://LucidN1:50001/output_gz/part-*

Can you please point me to what is the right way to read the behemoth document ?

Thanks,
Kiran.

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To view this discussion on the web, visit https://groups.google.com/d/msg/digitalpebble/-/-CpfDdIWjGoJ.
To post to this group, send an email to digita...@googlegroups.com.
To unsubscribe from this group, send email to digitalpebbl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com
http://www.digitalpebble.com

kiran

unread,

Jan 21, 2013, 11:48:58 AM1/21/13

to digita...@googlegroups.com, jul...@digitalpebble.com

Hi Julien,

Thanks for your reply.

I have tried to read with behemoth reader with these commands

./behemoth reader -i hdfs://LucidN1:50001/output_gz1/part-* -c

/opt/hadoop-1.0.4/bin/hadoop jar /opt/behemoth/core/target/behemoth-core-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.util.CorpusReader -i hdfs://LucidN1:50001/output_gz1/part-* -c

but both commands gives me this exception

Exception in thread "main" java.lang.NullPointerException
at com.digitalpebble.behemoth.util.CorpusReader.run(CorpusReader.java:102)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.digitalpebble.behemoth.util.CorpusReader.main(CorpusReader.java:46)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Do you think i am doing the correct way ?

The map reduce job was successfully completed. I am also copying down here the command i ran to convert warc file to a Behemoth corpus

hadoop@LucidN1:/opt/behemoth$ /opt/hadoop-1.0.4/bin/hadoop jar /opt/behemoth/io/target/behemoth-io-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.io.warc.WARCConverterJob hdfs://LucidN1:50001/input/virginiaEarthquake/ARCHIVEIT-2821-WEEKLY-WYYNUE-20110906022146-00007-crawling205.us.archive.org-6680.warc.gz hdfs://LucidN1:50001/output_gz1
Warning: $HADOOP_HOME is deprecated.
13/01/21 16:33:20 INFO mapred.FileInputFormat: Total input paths to process : 1
13/01/21 16:33:20 INFO mapred.JobClient: Running job: job_201301182219_0280
13/01/21 16:33:21 INFO mapred.JobClient: map 0% reduce 0%
13/01/21 16:33:44 INFO mapred.JobClient: map 100% reduce 0%
13/01/21 16:33:52 INFO mapred.JobClient: Job complete: job_201301182219_0280
13/01/21 16:33:52 INFO mapred.JobClient: Counters: 20
13/01/21 16:33:52 INFO mapred.JobClient: Job Counters
13/01/21 16:33:52 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=21715
13/01/21 16:33:52 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/01/21 16:33:52 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/01/21 16:33:52 INFO mapred.JobClient: Launched map tasks=1
13/01/21 16:33:52 INFO mapred.JobClient: Data-local map tasks=1
13/01/21 16:33:52 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/01/21 16:33:52 INFO mapred.JobClient: File Input Format Counters
13/01/21 16:33:52 INFO mapred.JobClient: Bytes Read=0
13/01/21 16:33:52 INFO mapred.JobClient: File Output Format Counters
13/01/21 16:33:52 INFO mapred.JobClient: Bytes Written=139
13/01/21 16:33:52 INFO mapred.JobClient: FileSystemCounters
13/01/21 16:33:52 INFO mapred.JobClient: HDFS_BYTES_READ=13057974
13/01/21 16:33:52 INFO mapred.JobClient: FILE_BYTES_WRITTEN=23267
13/01/21 16:33:52 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=139
13/01/21 16:33:52 INFO mapred.JobClient: Map-Reduce Framework
13/01/21 16:33:52 INFO mapred.JobClient: Map input records=0
13/01/21 16:33:52 INFO mapred.JobClient: Physical memory (bytes) snapshot=79859712
13/01/21 16:33:52 INFO mapred.JobClient: Spilled Records=0
13/01/21 16:33:52 INFO mapred.JobClient: CPU time spent (ms)=4410
13/01/21 16:33:52 INFO mapred.JobClient: Total committed heap usage (bytes)=77070336
13/01/21 16:33:52 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2900570112
13/01/21 16:33:52 INFO mapred.JobClient: Map input bytes=0
13/01/21 16:33:52 INFO mapred.JobClient: Map output records=0
13/01/21 16:33:52 INFO mapred.JobClient: SPLIT_RAW_BYTES=188
13/01/21 16:33:52 INFO warc.WARCConverterJob: Conversion: done

Thank you,

Kiran.

kiran

unread,

Jan 21, 2013, 12:09:33 PM1/21/13

to digita...@googlegroups.com, jul...@digitalpebble.com, kir...@vt.edu

Sorry, looks like i was not giving the full command (changing from part-* to part-00000) and thats why null pointer exception.

The commands below gives some info but they do not display all the URLs and mime-types.

/opt/hadoop-1.0.4/bin/hadoop jar /opt/behemoth/core/target/behemoth-core-1.1-SNAPSHOT-job.jar com.digitalpebble.behemoth.util.CorpusReader -i hdfs://LucidN1:50001/output_gz2/part-00000 -c

Output:

13/01/21 17:02:19 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/01/21 17:02:19 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/01/21 17:02:19 INFO compress.CodecPool: Got brand-new decompressor
13/01/21 17:02:19 INFO compress.CodecPool: Got brand-new decompressor
13/01/21 17:02:19 INFO compress.CodecPool: Got brand-new decompressor
13/01/21 17:02:19 INFO compress.CodecPool: Got brand-new decompressor

I have tried with two warc files, it happens the same. The behemoth reader command i used was

./behemoth reader -i hdfs://LucidN1:50001/output_gz2/part-00000 -c

Assuming the above command is the right way, it also gives me the same information.

Am i missing something here ? Please let me know your suggestions.

Regards,

Kiran.

On Monday, January 21, 2013 4:47:03 AM UTC-5, DigitalPebble wrote:

DigitalPebble

unread,

Jan 21, 2013, 2:25:38 PM1/21/13

to digita...@googlegroups.com

Hi Kieran

You shouldn't need to specify part-00000 in the input, the directory should be sufficient e.g dfs://LucidN1:50001/output_gz2/

I am not sure the WARC import succeeded - if you look at the logs it says

13/01/21 16:33:52 INFO mapred.JobClient: Map output records=0

which indicates that nothing has been produced, which would explain why the reader displays nothing. Look at the size of the files in output_gz2 am pretty sure they are not bigger than a few bytes.

Could you send me the file you are using as input to behe...@digitalpebble.com? I can try and have a look at it. The arc importer has not been used very much.

Alternatively see https://github.com/DigitalPebble/behemoth-commoncrawl to convert arc files into Behemoth documents (this code is being fixed as we speak)

Thanks

Julien

To view this discussion on the web, visit https://groups.google.com/d/msg/digitalpebble/-/ufauIyCG1i4J.

To post to this group, send an email to digita...@googlegroups.com.
To unsubscribe from this group, send email to digitalpebbl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.

kiran chitturi

unread,

Jan 21, 2013, 2:43:54 PM1/21/13

to digita...@googlegroups.com

Hi Julien,

I have sent the warc file to the email address you have given.

I would love to work with the commoncrawl files but they all are arc files and most of the files that we have are warc files.

We have lots of files in warc format and it would be great if Behemoth can help in processing those documents. I have sent more details in the email to behe...@digitalpebble.com.

Thanks,

Kiran.

Kiran Chitturi

DigitalPebble

unread,

Jan 21, 2013, 4:03:02 PM1/21/13

to digita...@googlegroups.com

Hi Kiran

You are right about ARC != WARC, I tried the samples you sent as well as the examples from http://archive.org/details/ExampleArcAndWarcFiles and in both cases the conversion to BehemothDocs fails. This part of the code from the IO module had been borrowed from Lemur I think.

We could try and debug the code but it might also be easier to rely on a 3rd party library like cloud9 which seems to be under Apache license.

Julien

kiran chitturi

unread,

Jan 21, 2013, 4:52:21 PM1/21/13

to digita...@googlegroups.com

Yes Julien. I was going through their website (lemur) [0] and their code is written particularly for the WARC files that are present in their data set that they have created. Though the organizations follow WARC standards, i think every one writes files in their own formats.

I am looking at their website and the two java files (WarcRecord and WarcHTMLResponseRecord), trying to give input as the warc file i have and debug in where it differs in two files and how code can be changed.

Cloud9 library looks quite good but i am not sure if they have support for WARC files. Looks like they are also using Lemur project code (http://lintool.github.com/Cloud9/docs/content/clue.html).

Thanks,

Kiran.

[0] - http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki-index.php?page=Working+with+WARC+Files

DigitalPebble

unread,

Jan 22, 2013, 3:55:50 PM1/22/13

to digita...@googlegroups.com

Hi Kiran

I just found a similar discrepancy between the ARCs generated by Common Crawl and the ones found on http://archive.org/details/ExampleArcAndWarcFiles. The brand new code from the CC module can't read these files :-( So much for compatibility...

Let us know if you find a way of getting the Lemur code to work on standard WARC files, would be good to get that to work indeed.

Thanks

Julien

kiran chitturi

unread,

Jan 23, 2013, 10:21:27 AM1/23/13

to digita...@googlegroups.com

Hi Julien,

Common Crawl as i read has data in the ARC format. The code for that might work for the Internet Archive ARC formats but there might be a slight difference in the formats and headers. I still did not figure out what are the differences between these different formats but i am looking for that too.

Second, i take back my previous comment on cloud9. They are not using the same code from the lemur project.

They have wrote their own code based on the lemur project code and as you said it is licensed under Apache. They have added additional functionality to the code and it is present here (https://github.com/lintool/Cloud9/tree/master/src/dist/edu/umd/cloud9/collection/clue).

Right now, i am working on checking whether what their code does for clueWeb WARC files can be applied to different WARC files. Hopefully a few patches to the code might work for general WARC files. If that works, we can create a new library that can process Internet Archive WARC files.

Kiran.

kiran chitturi

unread,

Jan 26, 2013, 5:13:11 PM1/26/13

to digita...@googlegroups.com

Hi Julien,

Its not the problem with Lemur code. I think it might be a problem somewhere with in behemoth, or when the job jars are getting built.

I have taken the 5 java files from (https://github.com/DigitalPebble/behemoth/tree/master/io/src/main/java/com/digitalpebble/behemoth/io/warc) and tested them separately adding behemoth core and nutch as dependencies. The job went through successfully writing the behemoth document to HDFS and i was able to read the behemoth document using behemoth reader (./behemoth reader $FILENAME)

I am not exactly sure what is the problem with the behemoth-io-*-job.jar but those 5 java files worked separately for me. I thought earlier the problem might be with the different formats of WARC's but after going through WARCCONVERTERJOB file separately, it worked.

The code worked for me for one warc file but it gave an error for other 6 test warc files including the example warc file (http://archive.org/details/ExampleArcAndWarcFiles) (java.io.IOException: bad status line '20080430204825': For input string: "20080430204825").

I think this is throwed from the HttpResponse class. I will look in to this more.

Regards,

Kiran.

kiran chitturi

unread,

Jan 26, 2013, 7:12:11 PM1/26/13

to digita...@googlegroups.com

Also, line 262 at [1] is causing the job to fail. In my case, it has failed at robots.txt files and at text/dns file formats. Since, exceptions are thrown for some records the job is failing and the other records present in the WARC are not getting processed.

When i excluded the usage of HttpResponse, everything went fine with the WARC file including the standard WARC files. I have passed the content and binaryContent from warc record.

I am not sure if its a good idea to exclude the usage of HttpResponse, Can we use [2] in case if we get exceptions from bad records ?

Please let me know your suggestions.

Thanks,

Kiran.

[1] - http://javasourcecode.org/html/open-source/nutch/nutch-2.0/org/apache/nutch/protocol/http/HttpResponse.java.html

[2] - http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapred/SkipBadRecords.html#setMapperMaxSkipRecords(org.apache.hadoop.conf.Configuration, long)

--

Kiran Chitturi

kiran chitturi

unread,

Jan 26, 2013, 10:58:13 PM1/26/13

to digita...@googlegroups.com

Sorry for spamming, but i just realized this might be just a version issue. They might have made changes in the files and behemoth git might not have been updated since then.

I used files in their website [1] and code from your WARCCONVERTERJOB file to test and it worked pretty well. As i said in my last email, the HttpResponse can throw some errors.

Ooh, so much for different versions. I hope i thought of this before working all the way with code.

I hope this will work now.

Thanks,

Kiran.

[1] - http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki-index.php?page=Working+with+WARC+Files

--

Kiran Chitturi

DigitalPebble

unread,

Jan 27, 2013, 4:10:02 AM1/27/13

to digita...@googlegroups.com

Hi Kiran

Thanks for looking into this. I have just pushed a commit to the repo which uses the code from Lemur's site + skips non http documents. The reason why HttpResponse was crashing was, well, that some of the input could be non http :-)

Seems to be working fine on the sample you sent me a few days ago.

Let me know how it goes and thanks for investigating the issue

Julien

Visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward