Finding content with a Creative Commons license

44 views
Skip to first unread message

pa...@creativecommons.org

unread,
Mar 15, 2018, 1:54:46 PM3/15/18
to Common Crawl
Hello, all! 

I am trying to find content published on the web under Creative Commons' licenses and Common Crawl has been very useful in our research phase but now that we are ready to move forward we have some questions that, I am sure, folks here will be able to help us with.

Our strategy to find CC content is very simple: we analyze the WAT files from the previous 12 crawls looking for webpages that link back to any of our domains (creativecommons.org) and save the URL (domain and path) and Creative Commons' URL (normally a license page) and the WARC location. This produces a list of 580,427,833 items that we then analyze to choose domains that we want to integrate content from. For instance, metmuseum.org has 557,237 webpages that link to http://creativecommons.org/publicdomain/zero/1.0/

Of those 557,237 metmuseum.org's pages, only ~68,000 are unique URLs which surprised us because we were assuming that the overlap between crawls in one year was going to be relatively low but we see, on average, that each webpage is repeated 8 times but there are some URLs that are repeated ~2,000 times. Is this due to the crawling strategy? Are those redirects?

Another question I have is how can we 'donate' URLs for you to crawl? Is there a process to do this? Could someone request specific domains to be crawled?

Anyway, thanks a lot for this fantastic project! 

--pv

Sebastian Nagel

unread,
Mar 15, 2018, 3:07:41 PM3/15/18
to common...@googlegroups.com
Hi Paola,

are you aware of this paper?
C4Corpus: Multilingual Web-Size Corpus with Free License
http://www.lrec-conf.org/proceedings/lrec2016/pdf/388_Paper.pdf
But I'm sure you get an answer form Ivan Habernal, one the authors.
He's following this list.

> but there are some URLs that are repeated ~2,000
> times. Is this due to the crawling strategy? Are those redirects?

Interesting. It may happen, e.g. if a site is relaunched and many/all
"old" URLs are redirected to a single error page. All deduplication
is done post-crawl, it's about to avoid that we fetch the same
duplicate again.

> Another question I have is how can we 'donate' URLs for you to crawl? Is there a process to do this?

Yes and thanks in advance. Please contact me off-list.


Thanks,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Tom Morris

unread,
Aug 26, 2018, 4:23:22 PM8/26/18
to pa...@creativecommons.org, common...@googlegroups.com
Hi Paola,

This is an old thread, but just to add to what Sebastien said, if you're looking at the historical crawl archives (Mar 2016 - Mar 2017, I'm guessing from your note), you'll find significant variability in the amount of duplication over time. This is an area in which the crawl has been improved significantly over time. You can tell whether the duplication was caused by redirects based on the HTTP response codes in the crawl, but my guess is that they're not the culprit.

Was this Creative Commons project ever written up anywhere? I'd be interested in reading the results. If you were to rerun the analysis, you'd probably cover a lot more ground, because the the level of month-to-month duplication (ie recrawls) has also been significantly reduced in the last year.

Tom

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages