Including offset/length for WAT/WET entries in the CDX and parquet indexes

Colin Dellow

unread,

Jun 20, 2019, 7:22:33 AM6/20/19

to Common Crawl

Hello list!

The CDX and parquet indexes currently include offsets/lengths for WARC entries. This is super helpful, as it means many tasks can be performed for a fraction of the data transfer/processing costs you'd otherwise incur if you processed the complete set of WARC files. For many of my own tasks, I find the WET/WAT files sufficient. Indeed, Sebastian notes in this post that WET is the most popular format.

Would it be possible to include offsets/lengths for WAT/WET entries as well in the CDX and Parquet indexes? I think it would suffice to add offset-wat/length-wat and offset-wet/length-wet entries where applicable, since the corresponding filename-wet/filename-wat entries could be mechanically generated by the client based on the existing filename key. I imagine this would be fiddly - generating the CDX files would now need to incorporate data from the WET/WAT files, which would incur additional costs and either delay the availability of the CDX files, or require them to be published in two phases.

A particular use case that I'm interested in is minimizing processing costs by operating only on English WET files for homepages, as tagged by the Common Crawl's language classifications. I've explored these approaches, but each has its own tradeoffs:

(1) process WET files, do my own language classification

Language classification is expensive relative to the processing I'm doing. A smaller con is my classification may not match the Common Crawl's - this isn't as critical, but it adds some complexity when debugging why a pipeline running against WARCs resulted in differences from one running against WETs.

(2) process the CDX files to discover WARC entries, fetch that WARC entry, do my own HTML-to-text extraction

HTML-to-text extraction is also expensive, and also results in differences relative to the CC's WET extraction. If the # of candidate entries is larger than, say, 10% of the crawl, doing individual lookups is less efficient than scanning the WARCs and reading the "metadata" capture. However, in this case, we have to unzip 100% of the entries to determine if they should be processed, and gzip is actually quite expensive relative to the processing I'm doing. In theory this can be avoided by processing the CDX files first to build a list of which entries to process, but I haven't yet done this.

(3) process the CDX files to discover URLs that meet the criteria, then probe the approximate region of the WET file for a matching URL

This is fiddly and has similar performance concerns to (2), so I'm putting off doing it for the moment.

(4) compute an index of offsets for WET/WAT files that includes the WARC language classifications from the CDX files

Once I began looking at this, I figured I'd first check to see if it'd make sense / be reasonable for the Common Crawl itself to include this in the CDX files.

Thanks for any insight!

Karen Shaeffer

unread,

Jun 26, 2019, 3:04:59 PM6/26/19

to Common Crawl

Hi Colin,

I'm interested in harvesting corpora for language modeling purposes. And, as you point out, there does not seem to be any direct mapping from information in the CDX and Parquet indexes that map directly to wet file data. Realistically, in the long run, my intuition informs me the most useful method processes CDX indexes for language and WARC file offsets. Then extracting text from the WARC files enables one to implement the code to aggregate the data to satisfy exact criteria. As example, the wet files make many design decisions that affect the extracted result. One example is punctuation and emojis are lost. Also, during processing, it seems reasonable to filter specific urls for other criteria, reducing downstream processing requirements. And one only needs to write this code once, with the ability to modify and extend the implementation over time as requirements evolve. And, of course, this same code can be used in scraping the web directly, where such an effort is justifiable.

Any helpful comments or suggestions always invited.

Sebastian Nagel

unread,

Jun 26, 2019, 5:20:30 PM6/26/19

to common...@googlegroups.com

Hi Colin, hi Karen,

thanks for the suggestions and comments!

> Would it be possible to include offsets/lengths for WAT/WET entries as well in the CDX
> and Parquet indexes?

This has been a wish from users since long. And of course, it's possible.
But there are two reasons why I've never started to implement it:

1. there are some challenges to actually implement the WAT/WET indexing
(you've already mentioned it, more details below)

2. for the longer perspective I would like to move away from the
WAT and WET formats and provide the same data with more metadata
and annotations (language detection, boilerplate markup)
in a columnar format to make it easier for users to filter
the data at scale, to cheaply cut out columns (title, keywords).
I also hope to get a better compression using a columnar format.

Of course, the solution 2 means a lot of work and in any case we'll keep the WAT/WET
files for a longer period of transition. So I'll have a look what I can do to get 1
implemented.

> generating the CDX files would now need to incorporate data from the WET/WAT files,
> which would incur additional costs and either delay the availability of the CDX files,
> or require them to be published in two phases.

The best solution would be to write the WAT/WET files together with the WARC and CDX files.
Otherwise, the release of the data would happen few days later because we need
to wait for the WAT/WET files until we can write the URL indexes and get the metrics
and statistics about the crawl.

So what's the challenge?
- in the past there had been issues causing the WAT/WET generator to crash or hang [1]
but these seem to be fixed now (nothing happened during the last two years). The
point is: if the WAT/WET files are written together with the WARCs we may loose data
if the WAT/WET generator code isn't 100% reliable.
- It should be also sufficiently fast which is currently not the case: the Nutch fetcher
job writes about 140 WARC files per day and CPU core (includes: fetching, politeness +
robots.txt handling, language detection, writing WARCs). In comparison, the WAT/WET
writer can process 240 WARCs per day and CPU core. This seems slow, but I need to
profile the code to get a picture where the CPU time goes.
- the WAT/WET generator code [2] is not really well maintained - also my guilty: there
are a couple of fixes and improvements from CC's fork [3] to be pushed upstream
- and it is based and tied to the "htmlparser.org" library (last updated 8 years ago)
(there have been many claims that the text extraction could be done better)
Karen, that probably applies also to your preference of the WARC records?

> (1) process WET files, do my own language classification
> Language classification is expensive relative to the processing I'm doing.

Yes it definitively is. But wouldn't it be better to have the language annotations available in the
WET header (or as a separate column for a columnar format)?

> A particular use case that I'm interested in is minimizing processing costs by operating
> only on English WET files for homepages

Picking the 40-50% of English WET records one by one via offsets from the index could be even slower
given that there is a certain overhead to pick WARC/WAT/WET records, see [5].

Best,
Sebastian

[1] https://github.com/commoncrawl/ia-web-commons/issues?q=is%3Aissue+is%3Aclosed
[2] https://github.com/iipc/webarchive-commons/
[3] https://github.com/commoncrawl/ia-web-commons/
[4] https://mvnrepository.com/artifact/org.htmlparser/htmlparser/2.1
[5] https://groups.google.com/d/msg/common-crawl/Umi8YBrerMk/nAZl6AAzDAAJ

On 6/26/19 9:04 PM, Karen Shaeffer wrote:
> Hi Colin,
> I'm interested in harvesting corpora for language modeling purposes. And, as you point out, there
> does not seem to be any direct mapping from information in the CDX and Parquet indexes that map
> directly to wet file data. Realistically, in the long run, my intuition informs me the most useful
> method processes CDX indexes for language and WARC file offsets. Then extracting text from the WARC
> files enables one to implement the code to aggregate the data to satisfy exact criteria. As example,
> the wet files make many design decisions that affect the extracted result. One example is
> punctuation and emojis are lost. Also, during processing, it seems reasonable to filter specific
> urls for other criteria, reducing downstream processing requirements. And one only needs to write
> this code once, with the ability to modify and extend the implementation over time as requirements
> evolve. And, of course, this same code can be used in scraping the web directly, where such an
> effort is justifiable.
>
> Any helpful comments or suggestions always invited.
>
>
> On Thursday, June 20, 2019 at 4:22:33 AM UTC-7, Colin Dellow wrote:
>
> Hello list!
>
> The CDX and parquet indexes currently include offsets/lengths for WARC entries. This is super
> helpful, as it means many tasks can be performed for a fraction of the data transfer/processing
> costs you'd otherwise incur if you processed the complete set of WARC files. For many of my own
> tasks, I find the WET/WAT files sufficient. Indeed, Sebastian notes in this post

> <https://groups.google.com/forum/#!msg/common-crawl/BypZ51wplwA/4EYjUvW3AAAJ> that WET is the

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/061540fe-1041-4e97-82d7-98653a53f693%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/061540fe-1041-4e97-82d7-98653a53f693%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

Colin Dellow

unread,

Jun 27, 2019, 3:22:07 PM6/27/19

to common...@googlegroups.com

On Wed, Jun 26, 2019, 8:05 PM Karen Shaeffer <klssh...@gmail.com> wrote:

Hi Colin,
I'm interested in harvesting corpora for language modeling purposes. And, as you point out, there does not seem to be any direct mapping from information in the CDX and Parquet indexes that map directly to wet file data. Realistically, in the long run, my intuition informs me the most useful method processes CDX indexes for language and WARC file offsets. Then extracting text from the WARC files enables one to implement the code to aggregate the data to satisfy exact criteria. As example, the wet files make many design decisions that affect the extracted result. One example is punctuation and emojis are lost.

Yes, definitely the most flexible, faithful and powerful option is to process the raw WARCs.

On the other hand, even with the limitations you identify, the WETs can be very useful. I've found them to be suitable for raw greps and for running Bayesian classifiers. For both of these use cases, the actual processing done by my code is quite light. In fact, much of the cost is in network transfer, gzip decompression time and string decoding.

Given that, I have found the limitations are an acceptable trade-off in order to process 1/7th of the data. It means one pass through a crawl can be as cheap as $1. But I'm greedy, I'd like it to be even cheaper. :)

Also, during processing, it seems reasonable to filter specific urls for other criteria, reducing downstream processing requirements. And one only needs to write this code once, with the ability to modify and extend the implementation over time as requirements evolve.

This is definitely true! I'm exploring how to democratize access to the CC via an on-demand service. Pre-filtering the URLs is part of that, which is where my interest in being able to check the CDX (or other indexes) comes from.

And, of course, this same code can be used in scraping the web directly, where such an effort is justifiable.

Any helpful comments or suggestions always invited.

On Thursday, June 20, 2019 at 4:22:33 AM UTC-7, Colin Dellow wrote:
Hello list!

The CDX and parquet indexes currently include offsets/lengths for WARC entries. This is super helpful, as it means many tasks can be performed for a fraction of the data transfer/processing costs you'd otherwise incur if you processed the complete set of WARC files. For many of my own tasks, I find the WET/WAT files sufficient. Indeed, Sebastian notes in this post that WET is the most popular format.

Would it be possible to include offsets/lengths for WAT/WET entries as well in the CDX and Parquet indexes? I think it would suffice to add offset-wat/length-wat and offset-wet/length-wet entries where applicable, since the corresponding filename-wet/filename-wat entries could be mechanically generated by the client based on the existing filename key. I imagine this would be fiddly - generating the CDX files would now need to incorporate data from the WET/WAT files, which would incur additional costs and either delay the availability of the CDX files, or require them to be published in two phases.

A particular use case that I'm interested in is minimizing processing costs by operating only on English WET files for homepages, as tagged by the Common Crawl's language classifications. I've explored these approaches, but each has its own tradeoffs:

(1) process WET files, do my own language classification

Language classification is expensive relative to the processing I'm doing. A smaller con is my classification may not match the Common Crawl's - this isn't as critical, but it adds some complexity when debugging why a pipeline running against WARCs resulted in differences from one running against WETs.

(2) process the CDX files to discover WARC entries, fetch that WARC entry, do my own HTML-to-text extraction

HTML-to-text extraction is also expensive, and also results in differences relative to the CC's WET extraction. If the # of candidate entries is larger than, say, 10% of the crawl, doing individual lookups is less efficient than scanning the WARCs and reading the "metadata" capture. However, in this case, we have to unzip 100% of the entries to determine if they should be processed, and gzip is actually quite expensive relative to the processing I'm doing. In theory this can be avoided by processing the CDX files first to build a list of which entries to process, but I haven't yet done this.

(3) process the CDX files to discover URLs that meet the criteria, then probe the approximate region of the WET file for a matching URL

This is fiddly and has similar performance concerns to (2), so I'm putting off doing it for the moment.

(4) compute an index of offsets for WET/WAT files that includes the WARC language classifications from the CDX files

Once I began looking at this, I figured I'd first check to see if it'd make sense / be reasonable for the Common Crawl itself to include this in the CDX files.

Thanks for any insight!

--

You received this message because you are subscribed to the Google Groups "Common Crawl" group.

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

Visit this group at https://groups.google.com/group/common-crawl.

To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/061540fe-1041-4e97-82d7-98653a53f693%40googlegroups.com.

Colin Dellow

unread,

Jun 27, 2019, 3:49:33 PM6/27/19

to common...@googlegroups.com

On Wed, Jun 26, 2019, 10:20 PM Sebastian Nagel <seba...@commoncrawl.org> wrote:

Hi Colin, hi Karen,

thanks for the suggestions and comments!

> Would it be possible to include offsets/lengths for WAT/WET entries as well in the CDX
> and Parquet indexes?

This has been a wish from users since long. And of course, it's possible.
But there are two reasons why I've never started to implement it:

1. there are some challenges to actually implement the WAT/WET indexing
(you've already mentioned it, more details below)

2. for the longer perspective I would like to move away from the
WAT and WET formats and provide the same data with more metadata
and annotations (language detection, boilerplate markup)
in a columnar format to make it easier for users to filter
the data at scale, to cheaply cut out columns (title, keywords).
I also hope to get a better compression using a columnar format.

Ah, interesting. Trying to do something specifically with annotating boilerplate sounds very useful (if a bit daunting, at least to me)!

I find a lot of value in the conciseness of the WET format. I don't have a lot of data points, so I don't know how much this generalizes, but I find it can sometimes be cheaper to scan the WETs for captures of interest, then individually process the underlying WARCs.

The WAT format is similarly useful, although the ratio is not as good (2x vs 7x, perhaps?) I can see its data being particularly amenable to being stored more efficiently and more usefully in a columnar form.

Of course, the solution 2 means a lot of work and in any case we'll keep the WAT/WET
files for a longer period of transition. So I'll have a look what I can do to get 1
implemented.

That would be amazing!

If there are suitably discrete tasks that you think I could contribute to, I'd be happy to try to assist.

> generating the CDX files would now need to incorporate data from the WET/WAT files,
> which would incur additional costs and either delay the availability of the CDX files,
> or require them to be published in two phases.

The best solution would be to write the WAT/WET files together with the WARC and CDX files.
Otherwise, the release of the data would happen few days later because we need
to wait for the WAT/WET files until we can write the URL indexes and get the metrics
and statistics about the crawl.

So what's the challenge?
- in the past there had been issues causing the WAT/WET generator to crash or hang [1]
but these seem to be fixed now (nothing happened during the last two years). The
point is: if the WAT/WET files are written together with the WARCs we may loose data
if the WAT/WET generator code isn't 100% reliable.
- It should be also sufficiently fast which is currently not the case: the Nutch fetcher
job writes about 140 WARC files per day and CPU core (includes: fetching, politeness +
robots.txt handling, language detection, writing WARCs). In comparison, the WAT/WET
writer can process 240 WARCs per day and CPU core. This seems slow, but I need to
profile the code to get a picture where the CPU time goes.
- the WAT/WET generator code [2] is not really well maintained - also my guilty: there
are a couple of fixes and improvements from CC's fork [3] to be pushed upstream
- and it is based and tied to the "htmlparser.org" library (last updated 8 years ago)
(there have been many claims that the text extraction could be done better)

I'm very fortunate in that my use case is robust against many of the challenges that other NLP tasks face -- the odd word pair smushed together, the retention of boilerplate, this is fine for me.

Karen, that probably applies also to your preference of the WARC records?

> (1) process WET files, do my own language classification
> Language classification is expensive relative to the processing I'm doing.

Yes it definitively is. But wouldn't it be better to have the language annotations available in the
WET header (or as a separate column for a columnar format)?

Embedded in the WET would be ideal for the case where you want to examine most URLs. The main downside is that you need to unzip each entry to inspect its metadata, and the zlib format is very, very slow to decompress. Worse, there is no way to decompress only the metadata and not the payload (...well, short of heuristics to skip to the next thing that looks like a gzip magic number)

A separate index, whether CDX or Parquet, is better in my view, because it can serve either the needle-in-a-haystack scenario (consult index, fetch and uncompress only responsive records) or the sift-the-haystack scenario (consult index, build map of responsive records, fetch all records but only uncompress responsive ones).

> A particular use case that I'm interested in is minimizing processing costs by operating
> only on English WET files for homepages

Picking the 40-50% of English WET records one by one via offsets from the index could be even slower
given that there is a certain overhead to pick WARC/WAT/WET records, see [5].

Yes, but I think when you consider homepages only, it drops from 40-50% to more like 5-10%, and the math becomes favourable again. This is not a huge deal, I think it'd provide at most a factor of 2 cost reduction. (But every bit helps!)

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

Visit this group at https://groups.google.com/group/common-crawl.

To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/71249ee4-ad6f-7a88-23f2-753da7f8363b%40commoncrawl.org.

Max Jacobson

unread,

Jun 27, 2019, 4:03:02 PM6/27/19

to Common Crawl

Hi Colin,

I'd be very interested to learn more about the infrastructure you are using to take a pass through one whole crawl's worth of WET for $1 - could you please give a few more details? Are you using warcio? Are you running locally or in Glue/Batch/Something else? Are you first downloading the files or are you processing while they are in S3?

Many thanks,

Max

> common...@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/061540fe-1041-4e97-82d7-98653a53f693%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/061540fe-1041-4e97-82d7-98653a53f693%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.

To unsubscribe from this group and stop receiving emails from it, send an email to common...@googlegroups.com.

Karen Shaeffer

unread,

Jun 28, 2019, 7:19:04 PM6/28/19

to Common Crawl

Hi Sebastian,

Thanks for your helpful comments. I appreciate there are so many different ways to parse the fetches. From your perspective, I think the optimal criteria is to separate the application text from the presentation and document data. And so you want to leave all the application text, including punctuation and emojis as example -- they provide a lot of information for the application. My point of view is no matter what criteria you use, the state of the text won't be optimal for many users. The issue is it is easy to filter out punctuation, if you don't want it. If you do want it, and it's already removed, then you cannot use the data. Then you have to work from the WARC files. So, less filtering of the text is better for more users. Even so, I don't think there is a right and wrong way to do it from your point view. It's an interesting problem that isn't too difficult using python -- just requires a lot of effort and resources.

best,

Karen

> common...@googlegroups.com <mailto:common-crawl+unsub...@googlegroups.com>.

Karen Shaeffer

unread,

Jun 29, 2019, 1:54:30 AM6/29/19

to Common Crawl

Actually, I see the wet files do have punctuation. Its just that there is so much unstructured text in those fetches, I didn't look closely enough.

Even so, I think I'll continue to process WARC files. For anyone coding in Python, processing WARC files with the warcio library and BeautifulSoup is a trivial task. The warcio ArchiveIterator provides a stream of record objects that one can process to filter the WARC file contents. As example, language, charset, and url. Then for those fetches that satisfy your criteria, feed the record.content_stream().read() into BeautifulSoup. And you've got a wet file that only includes fetches of interest to you and sans the WARC file data. Trivial. Of course, there is quite a bit more processing to create a language corpus.

best,

Karen

Thinh983 Bui

unread,

Jun 29, 2019, 8:59:11 AM6/29/19

to common...@googlegroups.com

Vào 18:22, Th 5, 20 thg 6, 2019 Colin Dellow <clde...@gmail.com> đã viết:

--

You received this message because you are subscribed to the Google Groups "Common Crawl" group.

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

Visit this group at https://groups.google.com/group/common-crawl.

To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/6572ec44-69bb-47ca-a4a7-213fe1bd0738%40googlegroups.com.

Colin Dellow

unread,

Jun 29, 2019, 4:44:22 PM6/29/19

to common...@googlegroups.com

On Thu, Jun 27, 2019, 9:03 PM Max Jacobson <maxmja...@gmail.com> wrote:

Hi Colin,

I'd be very interested to learn more about the infrastructure you are using to take a pass through one whole crawl's worth of WET for $1 - could you please give a few more details?

Happy to! In exchange, I'd love to hear how you use the common crawl & what your costs are like, if you're willing to share.

Are you using warcio?

I'm using an in-house framework written in Scala. I think it's similar in spirit to https://github.com/iipc/jwarc - it tries to minimize buffer allocations and copies.

Are you running locally or in Glue/Batch/Something else? Are you first downloading the files or are you processing while they are in S3?

It runs in EC2, on spot instances, in us-east-1. I find Amazon's managed services charge a high premium. In some cases, it's worth it, but I think not for this workload. Eg prices are something like the following for 1 hour of 4 CPUs with 16gb of RAM:

$0.44/hr - Glue

$0.1296/hr - EMR m5.xlarge spot instance

$0.0816/hr - EC2 m5.xlarge spot instance

Another benefit of using raw EC2 is that you can use modern instance types, like the a1 family which is not yet supported on EMR.

Instead, I have a framework where a user provides a function that receives a WARC entry as input and can emit some number of lists of strings. The user indicates what class of records they wish to process (for example, matches URL x, in language y), and the framework runs their code on the responsive records. The output is saved in S3 and contains the user-emitted strings plus the URL of the underlying WET/WARC file and the offset of the entry within it.

In practical terms, after writing their function and compiling it into a self-contained uberjar, the user queues the job via a command line tool that:

- creates an SQS queue

- inserts messages describing the work to be done. Currently, a message enumerates some # of WET/WARC files. You might describe a crawl with 600 messages that list 100 WETs each, encompassing the 60,000 WETs in a given crawl. This can be extended to other approaches. For example, 300 messages each describing 1 parquet file to query as a precursor to fetching individual records. That isn't always faster than scanning the entire WET/WARC file, so some care is needed.

- publishes the user's uberjar to S3

- publishes a config file to S3 that describes the job's metadata (SQS queue name, function entry point, etc)

At this point, the job is ready to be processed.

The user can launch as many spot instances as they'd like to process the job. I use instance user data to bootstrap the jar and config file from S3. The jar runs and polls for work from the SQS queue. When the SQS queue is empty, the app terminates, which causes the instance to terminate. Any failure during the bootstrap phase also terminates the instance.

Progress is monitored coarsely via SQS metrics, or granularly via app-specific metrics exported via an HTTP endpoint. When all is done, there are some number of output files in S3.

It's also worth mentioning that this is a greatly constrained processing model - it's map, not map/reduce. Its outputs would fit nicely into a pipeline that used Glue or Spark, though.

Lastly, it's important that the actual code you run be efficient. For string searching, I use either Aho-Corasick (for fixed strings) or brics.dk's automaton matcher (for regexes). I'd also like to try Intel's hyperscan library, but haven't gotten around to it yet.

Hope that helps!

Tom Morris

unread,

Jul 2, 2019, 8:41:59 PM7/2/19

to common...@googlegroups.com

On Wed, Jun 26, 2019 at 3:05 PM Karen Shaeffer <klssh...@gmail.com> wrote:

I'm interested in harvesting corpora for language modeling purposes. And, as you point out, there does not seem to be any direct mapping from information in the CDX and Parquet indexes that map directly to wet file data. Realistically, in the long run, my intuition informs me the most useful method processes CDX indexes for language and WARC file offsets. Then extracting text from the WARC files enables one to implement the code to aggregate the data to satisfy exact criteria. As example, the wet files make many design decisions that affect the extracted result.

I would definitely recommend against using the WET files for assembly language modeling corpora.

And actually, I'm kind of surprised that their so popular at all since there's no boilerplate removal done. This makes them *very* noisy and because all the structure is gone, there's no way to do the boilerplate removal one's self.

As an example, below is the first page of the first WET file in the most recent crawl with the non-boilerplate text highlighted. Admittedly it's from a site which is mostly boilerplate, but all pages in the crawl are going to be affected to a greater or lesser degree.

--------------

Air Jordan 10 “Bobcat” 310805-026 Sz. 13 With Box | Kixify Marketplace
Filters
Sort
Toggle navigation
Toggle navigation
Sell
Login
Signup
Air Jordan 10 “Bobcat” 310805-026 Sz. 13 With Box
Air jordan 10 “bobcat” 310805-026 sz. 13 with box
Want+9
$65
$65.00
Size: 13
There's only 1 left!
Estimated Delivery: Jun. 19 - Jun. 23
Pre-Owned
Men's
143 Views
Shipping
USA $15
Canada $40
International $50
Shipping
USA $15
Canada $40
International $50
A1sKicks
139 ITEMS 5061 Followers 99.38% Feedback
Follow Contact
Product Details
Size 13
Condition 8/10
With Box
Products Are exactly as pictured we describe products to the best of our ability. Every SHOE and apparel IS GUARANTEED 1000% AUTHENTIC!! We take as much detailed pictures as we can to show the product then we package them so we dont have extra pictures. Free shipping in U.S as with all our products listings. Shipping is 24-48 hour, TRACKING WILL BE ATTACHED TO PAYPAL. Shoes will be delivered 6-10 days (most make it 2-3) from the date bought.. BE PATIENT, asking about shipping won’t make it come faster:). We do our best to get it to you asap!! No trades, also each product posted is listed on multiple sites and sell fast typically so if you want them act fast. We typically sell lower than anyone else :). All products listed “used” may have wear such as paint chipping , bends, material rubbing off, scratches, smells such as smoke, or a persons prior wear, insole wear, replacement soles that can have wear, paint marks. All products listed “for parts not working” nay have cracks, missing parts, nothing is guaranteed to work and may have signs of use again we list what we see but We let you be the judge. Again there are no refunds Please leave a 5 star review , we will leave you one also :). No refunds. Thanks.
Style Code: 310805-026
more » « less
Shipping & Payment Policy
Free shipping
No Returns. All shoes are 1000% authentic. Paypal used for safety.
more » « less
Air Jordan 10 - Bobcats Air Jordan 1 Air Jordan 10 Air Jordan 13 Air Jordan
ABOUT / FAQ / TERMS / PRIVACY / CONTACT
BUYER PROTECTION / RELEASE DATES / SELL / REVIEWS
© 2019 Kixify.com
Newest
Top Sellers
Release Calendar
Gender / Age
Men
Women
Kids
Brands
Adidas

Air Jordan
Air Jordan 1
Air Jordan 2
Air Jordan 3
Air Jordan 4
Air Jordan 5
Air Jordan 6
Air Jordan 7
Air Jordan 8
Air Jordan 9
Air Jordan 10
Air Jordan 11
Air Jordan 12
Air Jordan 13
Air Jordan 14
Air Jordan 15
Air Jordan 16
Air Jordan 17
Air Jordan 18
Air Jordan 19
Air Jordan 20
Air Jordan 21
Air Jordan 22
Air Jordan 23
Other Jordans
Asics
Nike
Nike ACG
Nike Basketball
Nike KD
Nike Kobe
Nike Lebron
Other Nike Basketball
Nike Foamposite
Nike Running
Nike SB
Nike Sportswear
Nike Training
Other Nikes
New Balance
Reebok
Vans
Converse
Ewing Athletics
Fila
Li Ning
Puma
Radii
Saucony
Sperry
Supra
Timberland
Toms
Under Armour
Other Brands
×
Transaction Result
No transaction
×
Shipping Info
No transaction
Close

Sebastian Nagel

unread,

Jul 7, 2019, 10:00:48 AM7/7/19

to common...@googlegroups.com

Hi Colin,

> If there are suitably discrete tasks that you think I could contribute to, I'd be happy
> to try to assist.

I've put the tasks into
https://github.com/commoncrawl/nutch/issues/9
The challenge is to get it sufficiently fast so that we can make sure not to delay the crawls or
even loose data. Second, we need to keep it extensible for the future
to get better WET and WAT extracts over time.

> The WAT format is similarly useful, although the ratio is not as good (2x vs 7x, perhaps?)

WATs are about 3x smaller than the corresponding WARCs.

But the content stored in WAT files (web page metadata and links) has been
*the example* for columnar file formats, see Google's Dremel paper
https://research.google.com/pubs/pub36632.html

Using WAT you still need to read a lot of data to get the values of a single
"column", eg. page title or keywords.

Best,
Sebastian

> - and it is based and tied to the "htmlparser.org <http://htmlparser.org>" library (last updated

> > common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>
> <mailto:common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>>.

> > To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>

> > <mailto:common...@googlegroups.com <mailto:common...@googlegroups.com>>.

> > Visit this group at https://groups.google.com/group/common-crawl.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/common-crawl/061540fe-1041-4e97-82d7-98653a53f693%40googlegroups.com
> >
> <https://groups.google.com/d/msgid/common-crawl/061540fe-1041-4e97-82d7-98653a53f693%40googlegroups.com?utm_medium=email&utm_source=footer>.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>.

> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/common-crawl/71249ee4-ad6f-7a88-23f2-753da7f8363b%40commoncrawl.org.
> For more options, visit https://groups.google.com/d/optout.
>

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/common-crawl/CAEdTJowno3FYGJY7YWhiopju0k%2B9vxQcCgLTBKSLHK5dRkX6Mw%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CAEdTJowno3FYGJY7YWhiopju0k%2B9vxQcCgLTBKSLHK5dRkX6Mw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

> For more options, visit https://groups.google.com/d/optout.

s

Sebastian Nagel

unread,

Jul 7, 2019, 2:24:57 PM7/7/19

to common...@googlegroups.com

Hi Tom, hi Colin

> And actually, I'm kind of surprised that their so popular at all since there's no
> boilerplate removal done. This makes them *very* noisy and because all the structure is
> gone, there's no way to do the boilerplate removal one's self.

If you take the ratio of "available bytes / requested bytes" or the number of times a particular
data set or format is processed, then the popularity rank is (starting with most popular):
- columnar index
- CDX index
- WET
- WARC
- WAT
Of course, WARC is the most popular in terms of
- total requested bytes (that's because WARC files are larger
in size compared to WET by a factor of 6-7)
- total requests (because there is an index users can request single records)

> I find a lot of value in the conciseness of the WET format. I don't have a lot of data
> points, so I don't know how much this generalizes, but I find it can sometimes be
> cheaper to scan the WETs for captures of interest, then individually process the
> underlying WARCs.

Because of the smaller size. It makes a difference whether to
- scan/grep 7-8 TiB WET files (already in UTF-8)
- or to process 50 TiB WARC files
- in addition processing the WARC files requires to
* recode the character encoding (although 90% web pages are UTF-8)
* ev. parse the HTML
* or resolve HTML entities

For a filter scan the boilerplate isn't probably a big issue.

But yes, I agree, a format with boilerplate removed or marked so that
it is possible to skip it, is definitely more useful for language modeling.
Similarly, multi-lingual documents should be split or also annotated.

Best,
Sebastian

> *Air Jordan 10 “Bobcat” 310805-026 Sz. 13 With Box*

> Air jordan 10 “bobcat” 310805-026 sz. 13 with box
> Want+9
> $65

> *$65.00
> Size: 13
> *There's only 1 left!

> Estimated Delivery: Jun. 19 - Jun. 23

> *Pre-Owned
> *Men's

> 143 Views
> Shipping
> USA $15
> Canada $40
> International $50
> Shipping
> USA $15
> Canada $40
> International $50
> A1sKicks
> 139 ITEMS 5061 Followers 99.38% Feedback
> Follow Contact
> Product Details
> Size 13

> *Condition 8/10
> *With Box

> Products Are exactly as pictured we describe products to the best of our ability. Every SHOE and
> apparel IS GUARANTEED 1000% AUTHENTIC!! We take as much detailed pictures as we can to show the
> product then we package them so we dont have extra pictures. Free shipping in U.S as with all our
> products listings. Shipping is 24-48 hour, TRACKING WILL BE ATTACHED TO PAYPAL. Shoes will be
> delivered 6-10 days (most make it 2-3) from the date bought.. BE PATIENT, asking about shipping
> won’t make it come faster:). We do our best to get it to you asap!! No trades, also each product
> posted is listed on multiple sites and sell fast typically so if you want them act fast. We
> typically sell lower than anyone else :). All products listed “used” may have wear such as paint
> chipping , bends, material rubbing off, scratches, smells such as smoke, or a persons prior wear,
> insole wear, replacement soles that can have wear, paint marks. All products listed “for parts not
> working” nay have cracks, missing parts, nothing is guaranteed to work and may have signs of use
> again we list what we see but We let you be the judge. Again there are no refunds Please leave a 5
> star review , we will leave you one also :). No refunds. Thanks.

> *Style Code: 310805-026
> *more » « less

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/common-crawl/CAE9vqEGEksGXf%3D%3DT75ujDAT-2k0oTwTTczHt-WyzefXsbD%2BxSQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CAE9vqEGEksGXf%3D%3DT75ujDAT-2k0oTwTTczHt-WyzefXsbD%2BxSQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Reply all

Reply to author

Forward