Hi Vladimir,
the code to generate WAT and WET files can be found on github. It's originally from IIPC and
Internetarchive, our forks contain a few modifications which we try to push back.
The steps to get the WAT/WET extractor running are:
git clone
https://github.com/commoncrawl/ia-web-commons
cd ia-web-commons
mvn -f pom-cdh5.xml install
# could also use pom.xml
cd -
git clone
https://github.com/commoncrawl/ia-hadoop-tools
cd ia-hadoop-tools
mvn package
java -jar ./target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
name_of_archive .../warc/warcfile.warc.gz
Note that the WARC file must be placed in a folder warc/
It's also possible to run it on Hadoop:
hadoop jar target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator ...
> How is this transformation done? Is there any paper or commentary on how do you extract plain text
> from you HTML files present in .warc archives?
> As far as I know, this is a big issue, because there is a solid chance of omitting interesting
> information while doing so.
We know that the extraction is not perfect, esp. there are some issues regarding the encoding
detection and conversion to proper Unicode. There are better tools to extract clean plain text,
links, and metadata. E.g., some prefer the Gumbo parser.
Of course, hints and help in improving the WAT/WET generation are always welcome!
Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.