You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Hello, all!
I am trying to find content published on the web under Creative Commons' licenses and Common Crawl has been very useful in our research phase but now that we are ready to move forward we have some questions that, I am sure, folks here will be able to help us with.
Our strategy to find CC content is very simple: we analyze the WAT files from the previous 12 crawls looking for webpages that link back to any of our domains (creativecommons.org) and save the URL (domain and path) and Creative Commons' URL (normally a license page) and the WARC location. This produces a list of 580,427,833 items that we then analyze to choose domains that we want to integrate content from. For instance, metmuseum.org has 557,237 webpages that link to http://creativecommons.org/publicdomain/zero/1.0/
Of those 557,237 metmuseum.org's pages, only ~68,000 are unique URLs which surprised us because we were assuming that the overlap between crawls in one year was going to be relatively low but we see, on average, that each webpage is repeated 8 times but there are some URLs that are repeated ~2,000 times. Is this due to the crawling strategy? Are those redirects?
Another question I have is how can we 'donate' URLs for you to crawl? Is there a process to do this? Could someone request specific domains to be crawled?
Anyway, thanks a lot for this fantastic project!
--pv
Sebastian Nagel
unread,
Mar 15, 2018, 3:07:41 PM3/15/18
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
> but there are some URLs that are repeated ~2,000
> times. Is this due to the crawling strategy? Are those redirects?
Interesting. It may happen, e.g. if a site is relaunched and many/all
"old" URLs are redirected to a single error page. All deduplication
is done post-crawl, it's about to avoid that we fetch the same
duplicate again.
> Another question I have is how can we 'donate' URLs for you to crawl? Is there a process to do this?
Yes and thanks in advance. Please contact me off-list.
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to pa...@creativecommons.org, common...@googlegroups.com
Hi Paola,
This is an old thread, but just to add to what Sebastien said, if you're looking at the historical crawl archives (Mar 2016 - Mar 2017, I'm guessing from your note), you'll find significant variability in the amount of duplication over time. This is an area in which the crawl has been improved significantly over time. You can tell whether the duplication was caused by redirects based on the HTTP response codes in the crawl, but my guess is that they're not the culprit.
Was this Creative Commons project ever written up anywhere? I'd be interested in reading the results. If you were to rerun the analysis, you'd probably cover a lot more ground, because the the level of month-to-month duplication (ie recrawls) has also been significantly reduced in the last year.