Hi,I have 2 questions and couldn't find their answer online, so posting here for CC community.
- Does CC have anyway to access the live data as soon it's uploaded instead waiting for 30 days? If not live, may be 1-2 days/week later...For example 56k WARC file is processed in Aug 19. Can't we access them as soon they uploaded to S3 in the same month/date instead waiting for the complete crawl to be published and released. (Means rolling-basis)
- Do we have any fields in Common Crawl Index which can tell if the page crawled has any structured Schema? Like - Origination, Event, Product etc.
If not, can we request to add a feature which has a column name SCHEMA and the value may be any of below :
- Just a flag : Yes or No
- Or the count : 5 Schema Found
- Or the list : [Origination, Event, Product]
--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com.
application/ld+json - https://gist.github.com/vickyrathee/dfc972b7e76acf11100fdf297bf6ce49private List<string> GetSchemaNames(IHtmlDocument htmlDocument)
{
List<string> schemas = new List<string>();
var jsonScripts = htmlDocument.Scripts.Where(x => x.Type.StartsWith("application/ld+json", StringComparison.InvariantCultureIgnoreCase));
foreach (var script in jsonScripts)
{
if (script.Text.TrimStart().StartsWith("["))
{
// Parse the JSON Array
JArray jsonArray = JArray.Parse(script.Text);
foreach (var jsonObject in jsonArray)
{
if (jsonObject != null)
{
var schemaType = jsonObject.SelectToken("@type");
schemas.Add(schemaType.ToString());
}
}
}
else
{
// Parse the JSON Object
var jsonObject = JObject.Parse(script.Text);
if (jsonObject != null)
{
var schemaType = jsonObject.SelectToken("@type");
schemas.Add(schemaType.ToString());
}
}
}
return schemas;
}> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
Hi Sebastian,
Does that mean the crawler is running, but it doesn't upload the file instantly on S3 as soon a batch completes? And there will be some time leg between the CRAWL_DATE vs. the WARC_UPLOAD_DATE?
Or the crawler will start running on Sunday? Ideally if it’s running in 56k batches every month, shouldn’t it upload 2k WARC files every day because 15 days will be passed for this month if it uploads on Sunday.
I just wanted to understand if there are anyway to access the WARC file on the same date when the crawling happened.
Thanks,
Vikash
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com>
>
> >
> <https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl"
> group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/common-crawl/CAMUd6b60mKu_9iix8OB4Fx-JL4aQc7LPgG6w14vEvE25kShq3g%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CAMUd6b60mKu_9iix8OB4Fx-JL4aQc7LPgG6w14vEvE25kShq3g%40mail.gmail.com>
>
> >
> <https://groups.google.com/d/msgid/common-crawl/CAMUd6b60mKu_9iix8OB4Fx-JL4aQc7LPgG6w14vEvE25kShq3g%40mail.gmail.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/CAMUd6b60mKu_9iix8OB4Fx-JL4aQc7LPgG6w14vEvE25kShq3g%40mail.gmail.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/30d59055-e309-41e8-9f85-65a7705e2d0b%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to