WAT File Index

99 views
Skip to first unread message

Dave Lucas

unread,
May 1, 2021, 12:44:38 PM5/1/21
to Common Crawl
Hi,

Thx for creating such an awesome resource.

I noticed that you have created The Common Crawl Index of WARC files.

Is it possible for you to also create an index of the WAT files?

The reason being there is some useful info in the WAT files, which is not in the WARC files.

Your assistance is greatly appreciated :-).

Sebastian Nagel

unread,
May 11, 2021, 12:38:04 PM5/11/21
to common...@googlegroups.com
Hi,

sorry for the late reply...

> Is it possible for you to also create an index of the WAT files?

There's actually an issue to address this [1] and, yes, we know it's a wish
since long, see this discussion [2].

> The reason being there is some useful info in the WAT files, which is not in the WARC files.

Every information contained in the WAT file is also contained in the WARC file but it needs
to be extracted from the HTML again.

> Your assistance is greatly appreciated :-).

Thanks, but I fear, right now I'm unable to bring this work forward.

Best,
Sebastian

[1] https://github.com/commoncrawl/nutch/issues/9
[2] https://groups.google.com/g/common-crawl/c/P0zmZbMGt_8/m/bciT8oeVBgAJ
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/eac43f82-89c4-43e4-b42d-f86d77c9eb67n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/eac43f82-89c4-43e4-b42d-f86d77c9eb67n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Dave Lucas

unread,
May 17, 2021, 2:11:00 AM5/17/21
to Common Crawl
Thx Sabastian, that makes sense.

I do agree that a more user friendly format would be a better solution.

For instance I have created an Athena script to extract particular urls of interest. A couple of million. Now I need to figure out a way to extract the html from those files. 

It would be so much easier if I could just say, fetch these urls and their html, page title, links etc...

Please add that suggestion to the wishlist :-).

Regards
Dave

Sebastian Nagel

unread,
May 17, 2021, 9:22:00 AM5/17/21
to common...@googlegroups.com
Hi Dave,

> A couple of million. Now I need to figure out a way to extract the html from those files.

See
https://github.com/commoncrawl/cc-pyspark/blob/master/cc_index_word_count.py
of if you want to download the million records first and put the in WARC files:
https://github.com/commoncrawl/cc-index-table/#export-subsets-of-the-common-crawl-archives

Using Spark is just one way to parallelize the work. There's no need to use a Spark or Hadoop
cluster for few million URLs. A single EC2 instance with a couple of workers in local mode
should be sufficient to fetch and process the data within few hours. There was a discussion
regarding the performance of this kind of jobs, see:
https://groups.google.com/g/common-crawl/c/ItWeFtWPLjw


> It would be so much easier if I could just say, fetch these urls and their html,
> page title, links etc...

If I understand you right, you want a table which holds the URL, page metadata, links etc.
(basically, what's in the WAT records) as columns. Well, the HTML content itself is probably
too big to be stored efficiently in a Parquet file because one big column would negatively
impact the overall performance of the table by forcing the row groups to include to few rows.

> Please add that suggestion to the wishlist :-).

Yes, we hear you!

Best,
Sebastian

On 5/17/21 8:11 AM, Dave Lucas wrote:
> Thx Sabastian, that makes sense.
>
> I do agree that a more user friendly format would be a better solution.
>
> For instance I have created an Athena script to extract particular urls of interest. A couple of million. Now I need to figure out a way to
> extract the html from those files.
>
> It would be so much easier if I could just say, fetch these urls and their html, page title, links etc...
>
> Please add that suggestion to the wishlist :-).
>
> Regards
> Dave
>
> On Tuesday, May 11, 2021 at 6:38:04 PM UTC+2 Sebastian Nagel wrote:
>
> Hi,
>
> sorry for the late reply...
>
> > Is it possible for you to also create an index of the WAT files?
>
> There's actually an issue to address this [1] and, yes, we know it's a wish
> since long, see this discussion [2].
>
> > The reason being there is some useful info in the WAT files, which is not in the WARC files.
>
> Every information contained in the WAT file is also contained in the WARC file but it needs
> to be extracted from the HTML again.
>
> > Your assistance is greatly appreciated :-).
>
> Thanks, but I fear, right now I'm unable to bring this work forward.
>
> Best,
> Sebastian
>
> [1] https://github.com/commoncrawl/nutch/issues/9 <https://github.com/commoncrawl/nutch/issues/9>
> [2] https://groups.google.com/g/common-crawl/c/P0zmZbMGt_8/m/bciT8oeVBgAJ
> <https://groups.google.com/g/common-crawl/c/P0zmZbMGt_8/m/bciT8oeVBgAJ>
>
> On 5/1/21 6:44 PM, Dave Lucas wrote:
> > Hi,
> >
> > Thx for creating such an awesome resource.
> >
> > I noticed that you have created The Common Crawl Index of WARC files.
> >
> > Is it possible for you to also create an index of the WAT files?
> >
> > The reason being there is some useful info in the WAT files, which is not in the WARC files.
> >
> > Your assistance is greatly appreciated :-).
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> > <mailto:common-crawl...@googlegroups.com>.
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/common-crawl/eac43f82-89c4-43e4-b42d-f86d77c9eb67n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/eac43f82-89c4-43e4-b42d-f86d77c9eb67n%40googlegroups.com>
> >
> <https://groups.google.com/d/msgid/common-crawl/eac43f82-89c4-43e4-b42d-f86d77c9eb67n%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/eac43f82-89c4-43e4-b42d-f86d77c9eb67n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/b4ff91c8-8c7f-407a-a7ff-084f0bda685fn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/b4ff91c8-8c7f-407a-a7ff-084f0bda685fn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Dave Lucas

unread,
May 31, 2021, 3:51:28 AM5/31/21
to Common Crawl
Thx  Sebastian
Reply all
Reply to author
Forward
0 new messages