WARC to WET transformation (HTML to plain text)

2,011 views
Skip to first unread message

Vladimir Smatanik

unread,
Oct 27, 2016, 7:16:30 AM10/27/16
to Common Crawl
Hello, 

do you have any knowledge about how is the transformation from HTML to plain text done in commoncrawl datasets?

I've noticed that WET files contain plain text, which is extracted from original full-page HTML from .warc files.

Usually it takes a lot of time and fine tuning to extract plain text from HTML, because you can lose a lot of interesting information while doing so.

Sebastian Nagel

unread,
Oct 27, 2016, 7:33:36 AM10/27/16
to common...@googlegroups.com
Hi Vladimir,

the code to generate WAT and WET files can be found on github. It's originally from IIPC and
Internetarchive, our forks contain a few modifications which we try to push back.

The steps to get the WAT/WET extractor running are:

git clone https://github.com/commoncrawl/ia-web-commons
cd ia-web-commons
mvn -f pom-cdh5.xml install
# could also use pom.xml
cd -

git clone https://github.com/commoncrawl/ia-hadoop-tools
cd ia-hadoop-tools
mvn package

java -jar ./target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
name_of_archive .../warc/warcfile.warc.gz

Note that the WARC file must be placed in a folder warc/
It's also possible to run it on Hadoop:

hadoop jar target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator ...

> How is this transformation done? Is there any paper or commentary on how do you extract plain text
> from you HTML files present in .warc archives?
> As far as I know, this is a big issue, because there is a solid chance of omitting interesting
> information while doing so.

We know that the extraction is not perfect, esp. there are some issues regarding the encoding
detection and conversion to proper Unicode. There are better tools to extract clean plain text,
links, and metadata. E.g., some prefer the Gumbo parser.

Of course, hints and help in improving the WAT/WET generation are always welcome!

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Ivan Habernal

unread,
Oct 27, 2016, 8:13:59 AM10/27/16
to Common Crawl
Hi,

We've tackled plain text extraction (with boilerplate removal) and encoding identification on CommonCrawl data in C4Corpus-Tools: https://github.com/dkpro/dkpro-c4corpus - might be relevant to your needs (but it depend on what you want to do with the data eventually).

Best,

Ivan

Vladimir Smatanik

unread,
Oct 28, 2016, 7:03:51 AM10/28/16
to Common Crawl
Thanks everyone!

This is exactly what we were looking for.

Frank From Web

unread,
Apr 30, 2020, 12:51:06 AM4/30/20
to Common Crawl
Hi Sebastian - 
Thank you for your answer. The code works to convert to wet files.
Have you ever tried to run It on google Dataproc with GCS (google cloud storage)? 
It failed to generate wet files to GCS. looking into the code the mapper generates a wet file w.r.t a warc file and use
FileSystem io to write the file. I wonder if there is a recorderReader for the warc input (which shuffles the warc record in the files)
might resolve this issue. 
wish hear suggestion from you. 
Thanks.

Best, 
Frank

Sebastian Nagel

unread,
Apr 30, 2020, 3:44:00 AM4/30/20
to common...@googlegroups.com
Hi Frank,

> Have you ever tried to run It on google Dataproc with GCS (google cloud storage)?

No. Only on Cloudera CDH running on AWS. WARC files are read from S3 and WAT/WET files
are written to S3.

If you had a look into the class "ProducerUtils"? Various filesystem schemes
are handled: "hdfs://", "s3a://", "http://", "file://" - but not "gs://".

> I wonder if there is a recorderReader for the warc input (which shuffles the warc record in the
> files) might resolve this issue.

What's a "recorderReader"? - or did you mean "RecordReader" [1]?

Best,
Sebastian

[1] https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/RecordReader.html


On 4/30/20 6:51 AM, Frank From Web wrote:
> Hi Sebastian - 
> Thank you for your answer. The code works to convert to wet files.
> Have you ever tried to run It on google Dataproc with GCS (google cloud storage)? 
> It failed to generate wet files to GCS. looking into the code the mapper generates a wet file w.r.t a warc file and use
> FileSystem io to write the file. I wonder if there is a recorderReader for the warc input (which shuffles the warc record in the files)
> might resolve this issue. 
> wish hear suggestion from you. 
> Thanks.
>
> Best, 
> Frank
>
>
> On Thursday, October 27, 2016 at 4:33:36 AM UTC-7, Sebastian Nagel wrote:
>
> Hi Vladimir,
>
> the code to generate WAT and WET files can be found on github. It's originally from IIPC and
> Internetarchive, our forks contain a few modifications which we try to push back.
>
> The steps to get the WAT/WET extractor running are:
>
> git clone https://github.com/commoncrawl/ia-web-commons <https://github.com/commoncrawl/ia-web-commons>
> cd ia-web-commons
> mvn -f pom-cdh5.xml install
> # could also use pom.xml
> cd -
>
> git clone https://github.com/commoncrawl/ia-hadoop-tools <https://github.com/commoncrawl/ia-hadoop-tools>
> > common...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> > To post to this group, send email to common...@googlegroups.com
> > <mailto:common...@googlegroups.com>.
> > Visit this group at https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/c22b4036-4fd4-40fe-8d32-3ee125216ff7%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/c22b4036-4fd4-40fe-8d32-3ee125216ff7%40googlegroups.com?utm_medium=email&utm_source=footer>.

Frank From Web

unread,
Apr 30, 2020, 4:50:07 AM4/30/20
to Common Crawl
Thank you for your response, Sebastian .
yes I mean RecordReader.
 that makes the gcs compatible with hdfs so it should work. I will double check if there is any missing in dataproc configuration.

and, thank you so much for your information and glad to know the code adapts to s3 that saves time to copy the warc files.
and I will take a look into ProducerUtils as well.


Best,

- Frank
>     > To post to this group, send email to common...@googlegroups.com
>     > <mailto:common...@googlegroups.com>.
>     > Visit this group at https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>.
>     > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common...@googlegroups.com
> <mailto:common-crawl+unsub...@googlegroups.com>.

Sebastian Nagel

unread,
Apr 30, 2020, 5:30:59 AM4/30/20
to common...@googlegroups.com
Just in case it's only about a RecordReader implementation which consumes WARC files:
this implementation by Stephen Merity is quite concise and easy to understand:
https://github.com/commoncrawl/cc-warc-examples/blob/master/src/org/commoncrawl/warc/WARCFileRecordReader.java

... and also easy to use. See one of the examples:
https://github.com/commoncrawl/cc-warc-examples/tree/master/src/org/commoncrawl/examples/mapreduce
> >     > common...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> >     > To post to this group, send email to common...@googlegroups.com
> >     > <mailto:common...@googlegroups.com>.
> >     > Visit this group at https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>
> <https://groups.google.com/group/common-crawl <https://groups.google.com/group/common-crawl>>.
> >     > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>
> <https://groups.google.com/d/optout <https://groups.google.com/d/optout>>.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to common...@googlegroups.com
> > <mailto:common-crawl...@googlegroups.com>.
> <https://groups.google.com/d/msgid/common-crawl/c22b4036-4fd4-40fe-8d32-3ee125216ff7%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/c22b4036-4fd4-40fe-8d32-3ee125216ff7%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/49b825d4-c6e7-474e-bcea-677f87168075%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/49b825d4-c6e7-474e-bcea-677f87168075%40googlegroups.com?utm_medium=email&utm_source=footer>.

Frank From Web

unread,
Jun 10, 2020, 8:25:26 PM6/10/20
to Common Crawl
Hi Sebastian, 

Does this code (convert the html to plain text) remove boilerplate as C4Corpus-Tools? 
Thanks.

- Frank


On Thursday, October 27, 2016 at 4:33:36 AM UTC-7, Sebastian Nagel wrote:

Tom Morris

unread,
Jun 11, 2020, 6:23:44 PM6/11/20
to common...@googlegroups.com
On Wed, Jun 10, 2020 at 8:25 PM Frank From Web <wuf...@gmail.com> wrote:
Does this code (convert the html to plain text) remove boilerplate as C4Corpus-Tools? 

No, it doesn't. It just strips HTML tags. If you're interested in the C4Corpus boilerplate removal, I'd suggest checking out some of the work that I did to enhance both the functionality and run-time performance, such as https://github.com/dkpro/dkpro-c4corpus/pull/28 . There's also a list of the various issues that I found with the boilerplate removal and other text processing, not all of which are fixed by my PRs, https://github.com/dkpro/dkpro-c4corpus/issues?q=is%3Aopen++author%3Atfmorris+


Somewhat surprisingly, at least to me, the text extraction is done as part of building the WARC and the extracted text is written as a "metadata" record in the WARC where it is then fetched from later as part of the WET/WAT extraction process. Seems like it might be more efficient to write all three files at the same time instead of making multiple passes over the data. It also makes the WARCs bigger than really needed (unless I'm missing something).

Tom

 
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/4b6d95a8-2ada-4e74-b550-04d1c4db10a9o%40googlegroups.com.

Tom Morris

unread,
Jun 11, 2020, 6:38:45 PM6/11/20
to common...@googlegroups.com
Oops! Ignore the part about the WARC->WET conversion. It's wrong. I got confused by the fact that the WET & WAT files are also in WARC format.

Tom
Reply all
Reply to author
Forward
0 new messages