Suggested change to WET files : Language

78 views
Skip to first unread message

Simon Burfield

unread,
May 17, 2020, 8:28:46 AM5/17/20
to common...@googlegroups.com
Hi CommonCrawl

Thanks for doing an epic job, I have been using your files for building/ learning etc for ages

I am now downloading all the WET files as I want to process a large amount of English text

Because the Text corpse is very useful to a lot of projects, could I suggest that we add the Language field to it that is in the index of the urls.

This would really help people who don't use the index but actually download the entire WET corpse, they could easily then filter out what they wanted.

The only other way would be to entirely scan the whole index (300 files) and then make a request for each page that's the language they want 

Thanks

--
Simon Burfield
iOS/Android Developer + LEGO MINDSTORMS / Robotics Builder

Sebastian Nagel

unread,
May 19, 2020, 11:51:32 AM5/19/20
to common...@googlegroups.com
Hi Simon,

> could I suggest that we add the Language field to it that is in the index of
> the urls.

Thanks, good idea! I'll have a look whether this can be done.

It's a little bit tricky: the WAT/WET writing code [1] works strictly record by record.
However, the detected language is stored in the WARC files in a metadata record, separate
from the response record but linked to it. The WARC file format [2] explicitly mentions
"discovered language" as one use case for the linked metadata records.

Maybe there is a hackish solution to this problem. I'll let you know.

Best,
Sebastian

[1] https://github.com/commoncrawl/ia-web-commons/
[2] https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/
> BurfDevelopment.com <http://BurfDevelopment.com>
>
> Burf Search Engine <http://burf.co>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/CADf%2Bbte6LZ825aPzhA2AdvjYexCzheF8XZ2a9LChGhNATumFPA%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CADf%2Bbte6LZ825aPzhA2AdvjYexCzheF8XZ2a9LChGhNATumFPA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Simon Burfield

unread,
May 20, 2020, 10:44:46 AM5/20/20
to common...@googlegroups.com
Hi Sebastian

It could end up reducing the usage of AWS to, if people need both the text and the language they are going to need to hit the servers more.

It would be great to do this, for the moment I have downloaded all the wet and all of the index files and hope to link them up, then delete all of the non-english text

Thanks

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/98bd5d66-77fd-831a-3d15-806bd709eee4%40commoncrawl.org.


--
Simon Burfield
iOS/Android Developer + LEGO MINDSTORMS / Robotics Builder

Sebastian Nagel

unread,
Jun 10, 2020, 9:23:13 AM6/10/20
to common...@googlegroups.com
Hi Simon,

the content language was added to the WET files starting with the latest crawl
(CC-MAIN-2020-24), please see the announcement for more information:
https://commoncrawl.org/2020/06/may-june-2020-crawl-archive-now-available/

Best,
Sebastian


On 5/20/20 4:44 PM, Simon Burfield wrote:
> Hi Sebastian
>
> It could end up reducing the usage of AWS to, if people need both the text and the language they are going to need to hit the servers more.
>
> It would be great to do this, for the moment I have downloaded all the wet and all of the index files and hope to link them up, then delete
> all of the non-english text
>
> Thanks
>
> <mailto:common-crawl%2Bunsu...@googlegroups.com>
> > <mailto:common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>>.
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/common-crawl/CADf%2Bbte6LZ825aPzhA2AdvjYexCzheF8XZ2a9LChGhNATumFPA%40mail.gmail.com
> >
> <https://groups.google.com/d/msgid/common-crawl/CADf%2Bbte6LZ825aPzhA2AdvjYexCzheF8XZ2a9LChGhNATumFPA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/98bd5d66-77fd-831a-3d15-806bd709eee4%40commoncrawl.org.
>
>
>
> --
> Simon Burfield
> iOS/Android Developer + LEGO MINDSTORMS / Robotics Builder
> BurfDevelopment.com <http://BurfDevelopment.com>
>
> Burf Search Engine <http://burf.co>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/CADf%2BbtfaOpe_UedrXGa7MtHJSZuJCiPX1W6W8_mKp17FS5gChg%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CADf%2BbtfaOpe_UedrXGa7MtHJSZuJCiPX1W6W8_mKp17FS5gChg%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Simon Burfield

unread,
Jun 10, 2020, 9:43:14 AM6/10/20
to common...@googlegroups.com
Hi Sebastian

I have just seen, thank you so much!!!!

Thanks
Simon

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/19bb58a9-9489-134d-2165-8b0cadcaf413%40commoncrawl.org.


--
Simon Burfield
iOS/Android Developer + LEGO MINDSTORMS / Robotics Builder
Reply all
Reply to author
Forward
0 new messages