Common Crawl Live Data and Schema.org

356 views
Skip to first unread message

Vikash Rathee

unread,
Sep 8, 2019, 12:52:38 AM9/8/19
to Common Crawl
Hi,

I have 2 questions and couldn't find their answer online, so posting here for CC community.
  1. Does CC have anyway to access the live data as soon it's uploaded instead waiting for 30 days? If not live, may be 1-2 days/week later...For example 56k WARC file is processed in Aug 19. Can't we access them as soon they uploaded to S3 in the same month/date instead waiting for the complete crawl to be published and released. (Means rolling-basis)
  2. Do we have any fields in Common Crawl Index which can tell if the page crawled has any structured Schema? Like - Origination, Event, Product etc.
    If not, can we request to add a feature which has a column name SCHEMA and the value may be any of below :
    • Just a flag : Yes or No
    • Or the count : 5 Schema Found
    • Or the list : [Origination, Event, Product]
I am sure it will help a lot to everyone, what's your thoughts?

Disclosure - I am Founder of Agenty and we are considering CC for an internal research project.

Thanks,
Vikash

Colin Dellow

unread,
Sep 8, 2019, 11:49:32 AM9/8/19
to Common Crawl
Hi Vikash,

I'm not associated with the Common Crawl, but I can answer some of your questions.


On Sunday, 8 September 2019 00:52:38 UTC-4, Vikash Rathee wrote:
Hi,

I have 2 questions and couldn't find their answer online, so posting here for CC community.
  1. Does CC have anyway to access the live data as soon it's uploaded instead waiting for 30 days? If not live, may be 1-2 days/week later...For example 56k WARC file is processed in Aug 19. Can't we access them as soon they uploaded to S3 in the same month/date instead waiting for the complete crawl to be published and released. (Means rolling-basis)
Individual WARC files are uploaded as they are completed (reference). You'd need to write something that monitors the relevant S3 prefix for their appearance.

WET, WAT, and CDX and Parquet index files are only generated at the end of the entire crawl, though. 
  1. Do we have any fields in Common Crawl Index which can tell if the page crawled has any structured Schema? Like - Origination, Event, Product etc.
    If not, can we request to add a feature which has a column name SCHEMA and the value may be any of below :
    • Just a flag : Yes or No
    • Or the count : 5 Schema Found
    • Or the list : [Origination, Event, Product]

Those fields don't currently exist in the index. Would the Web Data Commons be sufficient? They periodically extract structured data from the Common Crawl, although much less often than the Common Crawl itself runs. It should be enough for you to validate whatever idea you have, though.

Hope this helps!
Colin 

Vikash Rathee

unread,
Sep 8, 2019, 4:24:51 PM9/8/19
to common...@googlegroups.com
Thanks Colin for your answers, these are helpful. I will read more about #1 reference this week.

For #2 - Web data common is not updated since November 2018, so won’t fit in our use case. As we are looking for recent structure data, so I was thinking if there was any field in index to drill down and find the relevant pages only instead processing the entire crawl dump to find the schema pages.

Thanks 
Vikash 

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com.

Sebastian Nagel

unread,
Sep 9, 2019, 7:02:29 AM9/9/19
to common...@googlegroups.com
Hi Vikash,

yes, it might be a good idea to mark pages containing linked data in the URL index.
Thanks for the suggestion!

Given that linked data is widely used (more than 30% of the pages in the last WDC extract
based on the Nov 2018 crawl, see [1]) and schema.org lists quite a lot types/classes,
only a list of found types seems useful. I will have a look whether and how this could be done.

Best,
Sebastian


[1] http://webdatacommons.org/structureddata/2018-12/stats/stats.html


On 9/8/19 10:24 PM, Vikash Rathee wrote:
> Thanks Colin for your answers, these are helpful. I will read more about #1 reference this week.
>
> For #2 - Web data common is not updated since November 2018, so won’t fit in our use case. As we are
> looking for recent structure data, so I was thinking if there was any field in index to drill down
> and find the relevant pages only instead processing the entire crawl dump to find the schema pages.
>
> Thanks 
> Vikash 
>
> On Sun, 8 Sep 2019 at 9:19 PM, Colin Dellow <clde...@gmail.com <mailto:clde...@gmail.com>> wrote:
>
> Hi Vikash,
>
> I'm not associated with the Common Crawl, but I can answer some of your questions.
>
> On Sunday, 8 September 2019 00:52:38 UTC-4, Vikash Rathee wrote:
>
> Hi,
>
> I have 2 questions and couldn't find their answer online, so posting here for CC community.
>
> 1. Does CC have anyway to access the live data as soon it's uploaded instead waiting for 30
> days? If not live, may be 1-2 days/week later...For example 56k WARC file is processed
> in Aug 19. Can't we access them as soon they uploaded to S3 in the same month/date
> instead waiting for the complete crawl to be published and released. (Means rolling-basis)
>
> Individual WARC files are uploaded as they are completed (reference
> <https://groups.google.com/forum/#!searchin/common-crawl/warc$20uploaded%7Csort:date/common-crawl/DTYMLkN78qc/vizoD2uRAwAJ>).
> You'd need to write something that monitors the relevant S3 prefix for their appearance.
>
> WET, WAT, and CDX and Parquet index files are only generated at the end of the entire crawl,
> though. 
>
> 1. Do we have any fields in Common Crawl Index which can tell if the page crawled has any
> structured Schema? Like - Origination, Event, Product etc.
> If not, can we request to add a feature which has a column name *SCHEMA* and the value
> may be any of below :
>
> o Just a flag : Yes or No
> o Or the count : 5 Schema Found
> o Or the list : [Origination, Event, Product]
>
>
> Those fields don't currently exist in the index. Would the Web Data Commons
> <http://webdatacommons.org/> be sufficient? They periodically extract structured data from the
> Common Crawl, although much less often than the Common Crawl itself runs. It should be enough
> for you to validate whatever idea you have, though.
>
> Hope this helps!
> Colin 
>
>
> Disclosure - I am Founder of Agenty <https://www.agenty.com/> and we are considering CC for
> an internal research project.
>
> Thanks,
> Vikash
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> <https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/CAMUd6b60mKu_9iix8OB4Fx-JL4aQc7LPgG6w14vEvE25kShq3g%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CAMUd6b60mKu_9iix8OB4Fx-JL4aQc7LPgG6w14vEvE25kShq3g%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Vikash Rathee

unread,
Sep 13, 2019, 12:10:26 AM9/13/19
to Common Crawl
Thanks Sebastian,

I would suggest to keep all the schema names, instead hard-coding which one is important/not.
Using abbreviation will save some bytes if you prefer that way, for example -
  1. Event : EV
  2. Organization : OG
  3. JobsPosting : JB
  4. Books : BK
  5. And so on.
Here is my gist on C# to give you a quick idea about how I look for schema names in valid JSON with type=application/ld+json - https://gist.github.com/vickyrathee/dfc972b7e76acf11100fdf297bf6ce49

Or REGEX here if you prefer to make it easy - https://rubular.com/r/eLnqW07O6jGenl using regex.

private List<string> GetSchemaNames(IHtmlDocument htmlDocument)
{
   
List<string> schemas = new List<string>();
   
   
var jsonScripts = htmlDocument.Scripts.Where(x => x.Type.StartsWith("application/ld+json", StringComparison.InvariantCultureIgnoreCase));
   
foreach (var script in jsonScripts)
   
{
       
if (script.Text.TrimStart().StartsWith("["))
       
{
           
// Parse the JSON Array
           
JArray jsonArray = JArray.Parse(script.Text);
           
foreach (var jsonObject in jsonArray)
           
{
               
if (jsonObject != null)
               
{
                   
var schemaType = jsonObject.SelectToken("@type");
                    schemas
.Add(schemaType.ToString());
               
}
           
}            
       
}
       
else
       
{
           
// Parse the JSON Object
           
var jsonObject = JObject.Parse(script.Text);
           
if (jsonObject != null)
           
{
               
var schemaType = jsonObject.SelectToken("@type");
                schemas
.Add(schemaType.ToString());
           
}
       
}
   
}
   
return schemas;
}


>     To view this discussion on the web visit
>     https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com
>     <https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

Vikash Rathee

unread,
Sep 13, 2019, 10:18:15 AM9/13/19
to Common Crawl
Hi Colin and Sebastian,

As per the reference pointed by Colin for #1, I don't see any new folder in s3://commoncrawl/crawl-data/ for latest crawl data for current month
Can you please point out where the WARC files are uploaded instantly as they complete in same date? as you pointed out in this message https://groups.google.com/d/msg/common-crawl/DTYMLkN78qc/vizoD2uRAwAJ

Sebastian Nagel

unread,
Sep 13, 2019, 10:38:25 AM9/13/19
to common...@googlegroups.com
Hi,

the September crawl is in preparation and WARC files will appear on
s3://commoncrawl/crawl-data/CC-MAIN-2019-39/
I expect the first WARC files to be uploaded the upcoming Sunday.
Then it will take about 2 weeks until everything will be there.

Thanks, for the codes snippets to detect linked data. I'll have a look
soon!

Best,
Sebastian

On 9/13/19 4:18 PM, Vikash Rathee wrote:
> Hi Colin and Sebastian,
>
> As per the reference pointed by Colin for #1, I don't see any new folder in
> *s3://commoncrawl/crawl-data/ *for latest crawl data for current month*
> *
> Can you please point out where the WARC files are uploaded instantly as they complete in same date?
> as you pointed out in this message https://groups.google.com/d/msg/common-crawl/DTYMLkN78qc/vizoD2uRAwAJ
>
>
> **
>
> On Friday, 13 September 2019 09:40:26 UTC+5:30, Vikash Rathee wrote:
>
> Thanks Sebastian,
>
> I would suggest to keep all the schema names, instead hard-coding which one is important/not.
> Using abbreviation will save some bytes if you prefer that way, for example -
>
> 1. Event : EV
> 2. Organization : OG
> 3. JobsPosting : JB
> 4. Books : BK
> 5. And so on.
>
> Here is my gist on C# to give you a quick idea about how I look for schema names in valid JSON
> with type=|application/ld+json|-
> https://gist.github.com/vickyrathee/dfc972b7e76acf11100fdf297bf6ce49
> <https://gist.github.com/vickyrathee/dfc972b7e76acf11100fdf297bf6ce49>
>
> Or REGEX here if you prefer to make it easy - https://rubular.com/r/eLnqW07O6jGenl
> <https://rubular.com/r/eLnqW07O6jGenl> using regex.
>
> |
> privateList<string>GetSchemaNames(IHtmlDocumenthtmlDocument)
> {
>     List<string>schemas =newList<string>();
>    
>     varjsonScripts =htmlDocument.Scripts.Where(x
> =>x.Type.StartsWith("application/ld+json",StringComparison.InvariantCultureIgnoreCase));
>     foreach(varscript injsonScripts)
>     {
>         if(script.Text.TrimStart().StartsWith("["))
>         {
>             // Parse the JSON Array
>             JArrayjsonArray =JArray.Parse(script.Text);
>             foreach(varjsonObject injsonArray)
>             {
>                 if(jsonObject !=null)
>                 {
>                     varschemaType =jsonObject.SelectToken("@type");
>                     schemas.Add(schemaType.ToString());
>                 }
>             }           
>         }
>         else
>         {
>             // Parse the JSON Object
>             varjsonObject =JObject.Parse(script.Text);
>             if(jsonObject !=null)
>             {
>                 varschemaType =jsonObject.SelectToken("@type");
>                 schemas.Add(schemaType.ToString());
>             }
>         }
>     }
>     returnschemas;
> }
> |
>
>
>
>
>
>
>
> On Monday, 9 September 2019 16:32:29 UTC+5:30, Sebastian Nagel wrote:
>
> Hi Vikash,
>
> yes, it might be a good idea to mark pages containing linked data in the URL index.
> Thanks for the suggestion!
>
> Given that linked data is widely used (more than 30% of the pages in the last WDC extract
> based on the Nov 2018 crawl, see [1]) and schema.org <http://schema.org> lists quite a lot
> >     common...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> <https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl"
> group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> <https://groups.google.com/d/msgid/common-crawl/CAMUd6b60mKu_9iix8OB4Fx-JL4aQc7LPgG6w14vEvE25kShq3g%40mail.gmail.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/CAMUd6b60mKu_9iix8OB4Fx-JL4aQc7LPgG6w14vEvE25kShq3g%40mail.gmail.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/a16d1f49-22f9-4680-a715-da9bb8f62f6a%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/a16d1f49-22f9-4680-a715-da9bb8f62f6a%40googlegroups.com?utm_medium=email&utm_source=footer>.

Vikash Rathee

unread,
Sep 13, 2019, 10:56:19 AM9/13/19
to Common Crawl

Hi Sebastian,

 

Does that mean the crawler is running, but it doesn't upload the file instantly on S3 as soon a batch completes? And there will be some time leg between the CRAWL_DATE vs. the WARC_UPLOAD_DATE?

 

Or the crawler will start running on Sunday? Ideally if it’s running in 56k batches every month, shouldn’t it upload 2k WARC files every day because 15 days will be passed for this month if it uploads on Sunday.

 

I just wanted to understand if there are anyway to access the WARC file on the same date when the crawling happened.

 

Thanks,

Vikash

>         >     To view this discussion on the web visit
>         >    
>         https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com
>         <https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com>
>
>         >    
>         <https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com?utm_medium=email&utm_source=footer
>         <https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>         >
>         > --
>         > You received this message because you are subscribed to the Google Groups "Common Crawl"
>         group.
>         > To unsubscribe from this group and stop receiving emails from it, send an email to

Sebastian Nagel

unread,
Sep 13, 2019, 11:12:17 AM9/13/19
to common...@googlegroups.com
Hi Vikash,

right now, the batches are prepared which also includes steps like
duplicate removal. I expect that on Sunday the batches start to
get fetched which will take 8-10 days. Then the pages of each
batch are shuffled and written to WARC files. That's to ensure
that every WARC file is a sample by its own. There is a short
delay between crawl data and upload of WARC files, 30 min. - 5h.
That's how the main crawler operates: it's not running continuously.

Best,
Sebastian
> >         >     common...@googlegroups.com <mailto:common-crawl...@googlegroups.com
> <javascript:>>.
> >         > common...@googlegroups.com <mailto:common-crawl...@googlegroups.com
> <javascript:>>.
> > common...@googlegroups.com <javascript:> <mailto:common-crawl...@googlegroups.com
> <javascript:>>.
> <https://groups.google.com/d/msgid/common-crawl/a16d1f49-22f9-4680-a715-da9bb8f62f6a%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/a16d1f49-22f9-4680-a715-da9bb8f62f6a%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/6c3cd729-906b-4aae-ad37-a1fca3886e81%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/6c3cd729-906b-4aae-ad37-a1fca3886e81%40googlegroups.com?utm_medium=email&utm_source=footer>.

Vikash Rathee

unread,
Nov 19, 2019, 12:51:08 AM11/19/19
to Common Crawl
Hi Sebastian,

Any update about adding schema.org names/linked data in the URL index?
Please let me know when can we expect that to be available.

Thanks,
Vikash


On Monday, 9 September 2019 16:32:29 UTC+5:30, Sebastian Nagel wrote:
>     To view this discussion on the web visit
>     https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com
>     <https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

Sebastian Nagel

unread,
Nov 19, 2019, 11:40:40 AM11/19/19
to common...@googlegroups.com
Hi Vikash,

the only step I did so far was to setup the WDC extraction framework
and run it over a set of WARC files from recent crawl. This brought two results:

1. it's not only about embedded JSON-LD - Microdata is more common,
and also RDFa is still there. This is in accordance with the coarse numbers of the latest WDC
extract from Nov 2018 [1].

2. the variety of formats explains why the extraction is relatively
expensive. My measurement was 45 CPU milliseconds in average to extract the triples from a single
WARC response record (HTML). Note:
- the extraction has to be done for 2.5 - 3 billion records
- for comparison: our crawler spends only 15 ms in average to
download a single web page and pack it into a WARC record

> Please let me know when can we expect that to be available.

The performance issue answers the question:
At present, I cannot do the extraction of linked data in the crawler which makes it impossible to
add this information to the URL/CDX index. The crawler must be lean and fast.

The only option would be to do the linked data extraction/scanning separately and create a dedicated
index or data set. But there is no concrete plan for that.
On 11/19/19 6:51 AM, Vikash Rathee wrote:
> Hi Sebastian,
>
> Any update about adding schema.org names/linked data in the URL index?
> Please let me know when can we expect that to be available.
>
> Thanks,
> Vikash
>
>
> On Monday, 9 September 2019 16:32:29 UTC+5:30, Sebastian Nagel wrote:
>
> Hi Vikash,
>
> yes, it might be a good idea to mark pages containing linked data in the URL index.
> Thanks for the suggestion!
>
> Given that linked data is widely used (more than 30% of the pages in the last WDC extract
> based on the Nov 2018 crawl, see [1]) and schema.org <http://schema.org> lists quite a lot
> >     common...@googlegroups.com <javascript:> <mailto:common-crawl...@googlegroups.com
> <javascript:>>.
> <https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common...@googlegroups.com <javascript:> <mailto:common-crawl...@googlegroups.com
> <javascript:>>.
> <https://groups.google.com/d/msgid/common-crawl/CAMUd6b60mKu_9iix8OB4Fx-JL4aQc7LPgG6w14vEvE25kShq3g%40mail.gmail.com>
>
> >
> <https://groups.google.com/d/msgid/common-crawl/CAMUd6b60mKu_9iix8OB4Fx-JL4aQc7LPgG6w14vEvE25kShq3g%40mail.gmail.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/CAMUd6b60mKu_9iix8OB4Fx-JL4aQc7LPgG6w14vEvE25kShq3g%40mail.gmail.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/24bb71a7-e669-4b7f-a207-fac063f5def5%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/24bb71a7-e669-4b7f-a207-fac063f5def5%40googlegroups.com?utm_medium=email&utm_source=footer>.

Reply all
Reply to author
Forward
0 new messages