Hi Colin, hi Karen,
thanks for the suggestions and comments!
> Would it be possible to include offsets/lengths for WAT/WET entries as well in the CDX
> and Parquet indexes?
This has been a wish from users since long. And of course, it's possible.
But there are two reasons why I've never started to implement it:
1. there are some challenges to actually implement the WAT/WET indexing
(you've already mentioned it, more details below)
2. for the longer perspective I would like to move away from the
WAT and WET formats and provide the same data with more metadata
and annotations (language detection, boilerplate markup)
in a columnar format to make it easier for users to filter
the data at scale, to cheaply cut out columns (title, keywords).
I also hope to get a better compression using a columnar format.
Of course, the solution 2 means a lot of work and in any case we'll keep the WAT/WET
files for a longer period of transition. So I'll have a look what I can do to get 1
implemented.
> generating the CDX files would now need to incorporate data from the WET/WAT files,
> which would incur additional costs and either delay the availability of the CDX files,
> or require them to be published in two phases.
The best solution would be to write the WAT/WET files together with the WARC and CDX files.
Otherwise, the release of the data would happen few days later because we need
to wait for the WAT/WET files until we can write the URL indexes and get the metrics
and statistics about the crawl.
So what's the challenge?
- in the past there had been issues causing the WAT/WET generator to crash or hang [1]
but these seem to be fixed now (nothing happened during the last two years). The
point is: if the WAT/WET files are written together with the WARCs we may loose data
if the WAT/WET generator code isn't 100% reliable.
- It should be also sufficiently fast which is currently not the case: the Nutch fetcher
job writes about 140 WARC files per day and CPU core (includes: fetching, politeness +
robots.txt handling, language detection, writing WARCs). In comparison, the WAT/WET
writer can process 240 WARCs per day and CPU core. This seems slow, but I need to
profile the code to get a picture where the CPU time goes.
- the WAT/WET generator code [2] is not really well maintained - also my guilty: there
are a couple of fixes and improvements from CC's fork [3] to be pushed upstream
- and it is based and tied to the "
htmlparser.org" library (last updated 8 years ago)
(there have been many claims that the text extraction could be done better)
Karen, that probably applies also to your preference of the WARC records?
> (1) process WET files, do my own language classification
> Language classification is expensive relative to the processing I'm doing.
Yes it definitively is. But wouldn't it be better to have the language annotations available in the
WET header (or as a separate column for a columnar format)?
> A particular use case that I'm interested in is minimizing processing costs by operating
> only on English WET files for homepages
Picking the 40-50% of English WET records one by one via offsets from the index could be even slower
given that there is a certain overhead to pick WARC/WAT/WET records, see [5].
Best,
Sebastian
[1]
https://github.com/commoncrawl/ia-web-commons/issues?q=is%3Aissue+is%3Aclosed
[2]
https://github.com/iipc/webarchive-commons/
[3]
https://github.com/commoncrawl/ia-web-commons/
[4]
https://mvnrepository.com/artifact/org.htmlparser/htmlparser/2.1
[5]
https://groups.google.com/d/msg/common-crawl/Umi8YBrerMk/nAZl6AAzDAAJ
On 6/26/19 9:04 PM, Karen Shaeffer wrote:
> Hi Colin,
> I'm interested in harvesting corpora for language modeling purposes. And, as you point out, there
> does not seem to be any direct mapping from information in the CDX and Parquet indexes that map
> directly to wet file data. Realistically, in the long run, my intuition informs me the most useful
> method processes CDX indexes for language and WARC file offsets. Then extracting text from the WARC
> files enables one to implement the code to aggregate the data to satisfy exact criteria. As example,
> the wet files make many design decisions that affect the extracted result. One example is
> punctuation and emojis are lost. Also, during processing, it seems reasonable to filter specific
> urls for other criteria, reducing downstream processing requirements. And one only needs to write
> this code once, with the ability to modify and extend the implementation over time as requirements
> evolve. And, of course, this same code can be used in scraping the web directly, where such an
> effort is justifiable.
>
> Any helpful comments or suggestions always invited.
>
>
> On Thursday, June 20, 2019 at 4:22:33 AM UTC-7, Colin Dellow wrote:
>
> Hello list!
>
> The CDX and parquet indexes currently include offsets/lengths for WARC entries. This is super
> helpful, as it means many tasks can be performed for a fraction of the data transfer/processing
> costs you'd otherwise incur if you processed the complete set of WARC files. For many of my own
> tasks, I find the WET/WAT files sufficient. Indeed, Sebastian notes in this post
> <
https://groups.google.com/forum/#!msg/common-crawl/BypZ51wplwA/4EYjUvW3AAAJ> that WET is the
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/061540fe-1041-4e97-82d7-98653a53f693%40googlegroups.com
> <
https://groups.google.com/d/msgid/common-crawl/061540fe-1041-4e97-82d7-98653a53f693%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit
https://groups.google.com/d/optout.