generation of WET files

341 views
Skip to first unread message

Bjarne Andersen

unread,
Apr 13, 2016, 11:35:58 AM4/13/16
to Common Crawl
Hi all.
Is there a stand alone tool that generates WET files from WARC files ?
I found a reference to a 2014 thread with a branch of ia-hadoop-tools that was not ended with a solution so I wonder what tool Commoncrawl or others use for this purpose

best
Bjarne Andersen

Sebastian Nagel

unread,
Apr 15, 2016, 11:23:12 AM4/15/16
to Common Crawl
Hi Bjarne,

the ia-hadoop-tools are used to generate the Common Crawl WAT and WET files.
And yes, it should work as described in
   https://groups.google.com/forum/#!msg/common-crawl/EyzxmQrSvTw/_Nqi1vQInsQJ

If it does not, please, send detailed error logs.

Best,
Sebastian

John Hewitt

unread,
May 17, 2016, 11:37:57 AM5/17/16
to Common Crawl
Hi Sebastian,

The tools described in the thread you mention do not work. Specifically, the Aloisius fork of ia-hadoop-tools does not compile. 
One problem is that it has dependencies that are not in the cannonical (internet-archive) fork of ia-web-commons. 
Specifically, it declares org.archive as a dependency, and looks for this dependency in the Internet Archive's Maven repository: build http://builds.archive.org/maven2/. However, none of the WET Java classes are present.

Even when using the Aloisius fork of ia-web-commons to satisfy the dependency, Aloisius/ia-hadoop-tools is still broken.
However, to start us off, here's the result of 'mvn compile' on a newly cloned Aloisius/ia-hadoop-tools

[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building ia-hadoop-tools 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ ia-hadoop-tools ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 1 resource
[INFO]
[INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ ia-hadoop-tools ---
[INFO] Compiling 129 source files to /mnt/castor/seas_home/j/johnhew/summer16/ia-hadoop-tools/target/classes
[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /mnt/castor/seas_home/j/johnhew/summer16/ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java:[29,26] error: cannot find symbol
[ERROR]  package org.archive.extract
/mnt/castor/seas_home/j/johnhew/summer16/ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java:[104,37] error: cannot find symbol
[INFO] 2 errors
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10.846 s
[INFO] Finished at: 2016-05-17T11:35:00-04:00
[INFO] Final Memory: 33M/664M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.3.2:compile (default-compile) on project ia-hadoop-tools: Compilation failure: Compilation failure:
[ERROR] /mnt/castor/seas_home/j/johnhew/summer16/ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java:[29,26] error: cannot find symbol
[ERROR] package org.archive.extract
[ERROR] /mnt/castor/seas_home/j/johnhew/summer16/ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java:[104,37] error: cannot find symbol
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException                                          

Sebastian Nagel

unread,
May 17, 2016, 12:24:56 PM5/17/16
to Common Crawl
Hi John,

I've successfully compiled everything, see below. Could you, please, send details about used tools - Java, Maven, etc.

In doubt, it could help to delete all locally cached jars from archive.org in
   ~/.m2/repository/org/archive/
and try again. Maven will log which jars are fetched.

Sebastian

% git remote -v
origin  https://github.com/commoncrawl/ia-hadoop-tools.git (fetch)
origin  https://github.com/commoncrawl/ia-hadoop-tools.git (push)

% git pull origin
Current branch master is up to date.

% mvn install

[INFO] Scanning for projects...
[INFO]                                                                        
[INFO] ------------------------------------------------------------------------
[INFO] Building ia-hadoop-tools 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
Downloading: http://builds.archive.org:8080/maven2/org/archive/ia-web-commons/1.0-SNAPSHOT/maven-metadata.xml
Downloaded: http://builds.archive.org:8080/maven2/org/archive/ia-web-commons/1.0-SNAPSHOT/maven-metadata.xml (361 B at 0.3 KB/sec)
Downloading: http://builds.archive.org:8080/maven2/org/archive/ia-web-commons/1.0-SNAPSHOT/ia-web-commons-1.0-20131207.033010-102.pom
Downloaded: http://builds.archive.org:8080/maven2/org/archive/ia-web-commons/1.0-SNAPSHOT/ia-web-commons-1.0-20131207.033010-102.pom (7 KB at 9.4 KB/sec)
Downloading: http://builds.archive.org:8080/maven2/org/archive/access-control/access-control/0.1.0-SNAPSHOT/maven-metadata.xml
Downloaded: http://builds.archive.org:8080/maven2/org/archive/access-control/access-control/0.1.0-SNAPSHOT/maven-metadata.xml (378 B at 0.4 KB/sec)
Downloading: http://builds.archive.org:8080/maven2/org/archive/access-control/access-control/0.1.0-SNAPSHOT/access-control-0.1.0-20140702.073710-171.pom
Downloaded: http://builds.archive.org:8080/maven2/org/archive/access-control/access-control/0.1.0-SNAPSHOT/access-control-0.1.0-20140702.073710-171.pom (3 KB at 3.6 KB/sec)
Downloading: http://builds.archive.org:8080/maven2/org/archive/access-control/0.1.0-SNAPSHOT/maven-metadata.xml
Downloading: http://builds.archive.org:8080/maven2/org/archive/access-control/0.1.0-SNAPSHOT/maven-metadata.xml
Downloaded: http://builds.archive.org:8080/maven2/org/archive/access-control/0.1.0-SNAPSHOT/maven-metadata.xml (363 B at 0.6 KB/sec)
Downloaded: http://builds.archive.org:8080/maven2/org/archive/access-control/0.1.0-SNAPSHOT/maven-metadata.xml (363 B at 0.3 KB/sec)
Downloading: http://builds.archive.org:8080/maven2/org/archive/access-control/0.1.0-SNAPSHOT/access-control-0.1.0-20140702.073710-174.pom
Downloaded: http://builds.archive.org:8080/maven2/org/archive/access-control/0.1.0-SNAPSHOT/access-control-0.1.0-20140702.073710-174.pom (2 KB at 1.1 KB/sec)
Downloading: http://builds.archive.org:8080/maven2/org/archive/ia-web-commons/1.0-SNAPSHOT/ia-web-commons-1.0-20131207.033010-102.jar
Downloading: http://builds.archive.org:8080/maven2/org/htmlparser/htmlparser/1.6/htmlparser-1.6.jar
Downloading: http://builds.archive.org:8080/maven2/org/archive/access-control/access-control/0.1.0-SNAPSHOT/access-control-0.1.0-20140702.073710-171.jar
Downloaded: http://builds.archive.org:8080/maven2/org/archive/access-control/access-control/0.1.0-SNAPSHOT/access-control-0.1.0-20140702.073710-171.jar (35 KB at 22.2 KB/sec)
Downloaded: http://builds.archive.org:8080/maven2/org/htmlparser/htmlparser/1.6/htmlparser-1.6.jar (282 KB at 177.5 KB/sec)
Downloaded: http://builds.archive.org:8080/maven2/org/archive/ia-web-commons/1.0-SNAPSHOT/ia-web-commons-1.0-20131207.033010-102.jar (671 KB at 383.6 KB/sec)
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------

% $JAVA_HOME/bin/java -version
openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-0ubuntu4~16.04.1-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)

% mvn -version
Apache Maven 3.3.9
Maven home: /usr/share/maven
Java version: 1.8.0_91, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-8-openjdk-amd64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.4.0-22-generic", arch: "amd64", family: "unix"

John Hewitt

unread,
May 17, 2016, 2:00:46 PM5/17/16
to Common Crawl
Sebastian,

Thanks for the help. Following the same process,

%git pull origin
Already up-to-date

%mvn install
...
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ ia-hadoop-tools ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 1 resource
[INFO] 
[INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ ia-hadoop-tools ---
[INFO] Compiling 129 source files to /mnt/castor/seas_home/j/johnhew/summer16/ia-hadoop-tools/target/classes
[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /mnt/castor/seas_home/j/johnhew/summer16/ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java:[29,26] error: cannot find symbol
[ERROR]  package org.archive.extract
/mnt/castor/seas_home/j/johnhew/summer16/ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java:[104,37] error: cannot find symbol
[INFO] 2 errors 
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 13.609 s
[INFO] Finished at: 2016-05-17T13:50:48-04:00
[INFO] Final Memory: 32M/435M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.3.2:compile (default-compile) on project ia-hadoop-tools: Compilation failure: Compilation failure:
[ERROR] /mnt/castor/seas_home/j/johnhew/summer16/ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java:[29,26] error: cannot find symbol
[ERROR] package org.archive.extract
[ERROR] /mnt/castor/seas_home/j/johnhew/summer16/ia-hadoop-tools/src/main/java/org/archive/hadoop/jobs/WEATGenerator.java:[104,37] error: cannot find symbol
...

%$JAVA_HOME/bin/java -version
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)

%mvn -version

Apache Maven 3.3.3 
Maven home: /usr/share/java/maven
Java version: 1.8.0_91, vendor: Oracle Corporation
Java home: /usr/java/jdk1.8.0_91/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.16.7-35-desktop", arch: "amd64", family: "unix"

(Note that I tried the same on an Arch box with linux "4.5.4-1-arch", with the same output) 

-John

Sebastian Nagel

unread,
May 17, 2016, 4:33:31 PM5/17/16
to common...@googlegroups.com
Hi John,

sorry, my fault. After
rm -rf ~/.m2/repository/org/archive/
I should have called
mvn clean
otherwise nothing is compiled anew.

Now I'm able to reproduce the compilation problem.

The reason why it succeeded on my laptop is pretty trivial:
I've compiled ia-web-commons (forked from https://github.com/commoncrawl/ia-web-commons.git)
immediately before. I had to force Java 7 for compilation:
JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 mvn install
(I will try to get it compiled using Java 8)

After that the right ia-web-commons jar including the class
org.archive.extract.WETExtractorOutput
was luckily in my local Maven cache.

Cheers,
Sebastian
> In doubt, it could help to delete all locally cached jars from archive.org <http://archive.org> in
> ~/.m2/repository/org/archive/
> and try again. Maven will log which jars are fetched.
>
> Sebastian
>
> % git remote -v
> origin https://github.com/commoncrawl/ia-hadoop-tools.git
> <https://github.com/commoncrawl/ia-hadoop-tools.git> (fetch)
> origin https://github.com/commoncrawl/ia-hadoop-tools.git
> <https://github.com/commoncrawl/ia-hadoop-tools.git> (push)
>
> % git pull origin
> Current branch master is up to date.
>
> % mvn install
> [INFO] Scanning for projects...
> [INFO]
> [INFO] ------------------------------------------------------------------------
> [INFO] Building ia-hadoop-tools 1.0-SNAPSHOT
> [INFO] ------------------------------------------------------------------------
> Downloading:
> http://builds.archive.org:8080/maven2/org/archive/ia-web-commons/1.0-SNAPSHOT/maven-metadata.xml
> <http://builds.archive.org/maven2/>. However, none of the WET Java classes are present.
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Sebastian Nagel

unread,
May 17, 2016, 5:13:17 PM5/17/16
to Common Crawl

Hi John,


to install ia-web-commons with Java 8, just disable the failed unit test

org.archive.util.ArchiveUtilsTest.testDoubleToString()

After ia-web-commons is installed locally, ia-hadoop-tools should

compile.


Sebastian
> To post to this group, send email to common...@googlegroups.com

John Hewitt

unread,
May 17, 2016, 5:24:55 PM5/17/16
to Common Crawl
Sebastian,

Doing so worked for me. As an aside, the test seems to expect truncation instead of rounding on the part of the doubleToString method. 

Thanks for the help.

-John
> To post to this group, send email to common...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages