Hi Dave,
> A couple of million. Now I need to figure out a way to extract the html from those files.
See
https://github.com/commoncrawl/cc-pyspark/blob/master/cc_index_word_count.py
of if you want to download the million records first and put the in WARC files:
https://github.com/commoncrawl/cc-index-table/#export-subsets-of-the-common-crawl-archives
Using Spark is just one way to parallelize the work. There's no need to use a Spark or Hadoop
cluster for few million URLs. A single EC2 instance with a couple of workers in local mode
should be sufficient to fetch and process the data within few hours. There was a discussion
regarding the performance of this kind of jobs, see:
https://groups.google.com/g/common-crawl/c/ItWeFtWPLjw
> It would be so much easier if I could just say, fetch these urls and their html,
> page title, links etc...
If I understand you right, you want a table which holds the URL, page metadata, links etc.
(basically, what's in the WAT records) as columns. Well, the HTML content itself is probably
too big to be stored efficiently in a Parquet file because one big column would negatively
impact the overall performance of the table by forcing the row groups to include to few rows.
> Please add that suggestion to the wishlist :-).
Yes, we hear you!
Best,
Sebastian
On 5/17/21 8:11 AM, Dave Lucas wrote:
> Thx Sabastian, that makes sense.
>
> I do agree that a more user friendly format would be a better solution.
>
> For instance I have created an Athena script to extract particular urls of interest. A couple of million. Now I need to figure out a way to
> extract the html from those files.
>
> It would be so much easier if I could just say, fetch these urls and their html, page title, links etc...
>
> Please add that suggestion to the wishlist :-).
>
> Regards
> Dave
>
> On Tuesday, May 11, 2021 at 6:38:04 PM UTC+2 Sebastian Nagel wrote:
>
> Hi,
>
> sorry for the late reply...
>
> > Is it possible for you to also create an index of the WAT files?
>
> There's actually an issue to address this [1] and, yes, we know it's a wish
> since long, see this discussion [2].
>
> > The reason being there is some useful info in the WAT files, which is not in the WARC files.
>
> Every information contained in the WAT file is also contained in the WARC file but it needs
> to be extracted from the HTML again.
>
> > Your assistance is greatly appreciated :-).
>
> Thanks, but I fear, right now I'm unable to bring this work forward.
>
> Best,
> Sebastian
>
> [1]
https://github.com/commoncrawl/nutch/issues/9 <
https://github.com/commoncrawl/nutch/issues/9>
> <
https://groups.google.com/d/msgid/common-crawl/eac43f82-89c4-43e4-b42d-f86d77c9eb67n%40googlegroups.com?utm_medium=email&utm_source=footer
> <
https://groups.google.com/d/msgid/common-crawl/eac43f82-89c4-43e4-b42d-f86d77c9eb67n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
https://groups.google.com/d/msgid/common-crawl/b4ff91c8-8c7f-407a-a7ff-084f0bda685fn%40googlegroups.com
> <
https://groups.google.com/d/msgid/common-crawl/b4ff91c8-8c7f-407a-a7ff-084f0bda685fn%40googlegroups.com?utm_medium=email&utm_source=footer>.