Hi Lawrence,
no, there is no index. It will come sooner or later but likely with some
delay for updates (eg. monthly).
> One site that stood out from Oct that I couldn't figure out how it
> got in was Nike.com had 74418 records in the months WARCs.
It was there until I blocked it.
> a) it was a new source?
So, in my opinion it isn't. But the site uses the same ways to announce
parts of their content as news sites do - via news feeds and news
sitemaps. The problem is that other news sites sell slots in their news
feeds and sitemaps and put advertisements there. The crawler follows
these links the same way as it follows links to news articles. Because
of a news sitemap auto-detection feature, thousands of "news" articles
from the target site are then possibly crawled.
This issue is to be addressed together with the upgrade. I'm not sure
what the solution will be: disabling the auto-detection or a strict
cross-submit verification. The latter isn't trivial because not a few
sites delegate the assembling and hosting of feeds and sitemaps to
third-party domains.
b) urls can be duplicated across WARC files?
Because the crawler uses only feeds and sitemaps, there should be only
very few duplicates in general. And no duplicated URLs at all, except in
one case - when the crawler crashed and was restarted there might be few
duplicated WARC records, one immediately before and the duplicate soon
after the crash.
Best,
Sebastian
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/659ab14e-5ded-474c-ba86-2bb1d204bfa8n%40googlegroups.com <
https://groups.google.com/d/msgid/common-crawl/659ab14e-5ded-474c-ba86-2bb1d204bfa8n%40googlegroups.com?utm_medium=email&utm_source=footer>.