WEATGenerator Null Pointer when extracting from Nutch Crawl

33 views
Skip to first unread message

John Hewitt

unread,
Jan 26, 2017, 6:22:22 PM1/26/17
to Common Crawl
Hi,

I'm using Nutch to crawl portions of the web, the results of which are to be plugged into a cleaning pipeline that operates on WARC and WET files. 
Thus, I need to convert Nutch segments to WARC, and then WARC to WET.
I'm hitting multiple problems; specific questions are in bold.

I'm on Nutch 1.12, and the latest git repo for ia-hadoop-tools. 

There are  two ways to convert Nutch files  to warc:

CommonCrawlDump (not recommended). 
Using CommonCrawlDump, a large number of truncation messages are presented, regardless of the -warcSize flag. I end up with .warc files that are truncated before the end of the HTML, leading to incorrect results:
The truncation message is as follows. Does anyone use CommonCrawlDump? Is it known to work?
http://www.express.co.uk/news/uk/... skipped. Content of size 204763 was truncated to 65536


WARCExtractor (recommended by nutch)
Using WARExtractor, one large WARC file is generated, which seems to be well-formed. More on this in a minute.

From here, I try to use the ia-hadoop-tools WEATGenerator  to extract WET files from the big WARC.
The following error is thrown:
17/01/26 18:08:10 ERROR jobs.WEATGenerator: Error processing file: file/<>/johnhew/part-00000.seg-00000.attempt-00000.warc
java.lang.NullPointerException
       at org.archive.extract.ExtractingResourceProducer.getNext(ExtractingResourceProducer.java:26)
       at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:108)
       at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:48)

There seems to be  an issue some with extracting from the warc. Attached is 3.warc, a file which recreates this issue. I have found no way to take a Nutch crawl and extract WET files. Is there a simple alternative?

Thanks,
John
3.warc

Sebastian Nagel

unread,
Jan 27, 2017, 5:45:53 AM1/27/17
to common...@googlegroups.com
Hi John,

which repositories and versions of
ia-hadoop-tools
ia-web-commons / webarchive-commons
are used. What are the commands to build and run the WARC to WET converter?

I wasn't able to reproduce the problem using the versions used at commoncrawl.org:

git clone g...@github.com:commoncrawl/ia-web-commons.git
cd ia-web-commons/
mvn -f pom-cdh5.xml install

git clone g...@github.com:commoncrawl/ia-hadoop-tools.git
cd ia-hadoop-tools/
mvn package

java -jar target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator batch-xxx .../warc/3.warc

The WET file is generated as .../wet/3.warc.wet.gz, WAT as .../wat/3.warc.wat.gz

I've observed a couple of issues in ia-web-commons / webarchive-commons. They are fixed
in commoncrawl's fork on github, patches are sent upstream to iipc/webarchive-commons.

Best,
Sebastian


On 01/27/2017 12:22 AM, John Hewitt wrote:
> Hi,
>
> I'm using Nutch to crawl portions of the web, the results of which are to be plugged into a cleaning
> pipeline that operates on WARC and WET files.
> Thus, I need to convert Nutch segments to WARC, and then WARC to WET.
> I'm hitting multiple problems; specific questions are in bold.
>
> I'm on Nutch 1.12, and the latest git repo for ia-hadoop-tools.
>
> There are two ways to convert Nutch files to warc:
>
>
> CommonCrawlDump (not recommended).
>
> Using CommonCrawlDump, a large number of truncation messages are presented, regardless of
> the -warcSize flag. I end up with .warc files that are truncated before the end of the HTML,
> leading to incorrect results:
>
> The truncation message is as follows. *Does anyone use CommonCrawlDump? Is it known to work?*
>
> http://www.express.co.uk/news/uk/... skipped. Content of size 204763 was truncated to 65536
>
>
> WARCExtractor (recommended by nutch)
>
> Using WARExtractor, one large WARC file is generated, which seems to be well-formed. More on
> this in a minute.
>
>
> From here, I try to use the ia-hadoop-tools WEATGenerator to extract WET files from the big WARC.
> The following error is thrown:
>
> 17/01/26 18:08:10 ERROR jobs.WEATGenerator: Error processing file:
> file/<>/johnhew/part-00000.seg-00000.attempt-00000.warc
> java.lang.NullPointerException
> at
> org.archive.extract.ExtractingResourceProducer.getNext(ExtractingResourceProducer.java:26)
> at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:108)
> at org.archive.hadoop.jobs.WEATGenerator$WEATGeneratorMapper.map(WEATGenerator.java:48)
>
>
> There seems to be an issue some with extracting from the warc. Attached is 3.warc, a file which
> recreates this issue. *I have found no way to take a Nutch crawl and extract WET files. Is there a
> simple alternative?*
> *
> *
> Thanks,
> John
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages