Datasheet, copyright, or privacy information for Common Crawl?

552 views
Skip to first unread message

Alex Hanna

unread,
Mar 3, 2022, 8:29:12 AM3/3/22
to Common Crawl
Hi all,

I'm working on a paper with a legal scholar on some of the legal dimensions of training and benchmark data used in machine learning, including large language models.

I'm wondering if you all have any kind of datasheet for the Common Crawl, or, barring that, any kind of copyright information or information regarding the privacy of the data subjects mentioned in the data.

Thanks,
- Alex

Pete Warden

unread,
Mar 3, 2022, 1:12:30 PM3/3/22
to common...@googlegroups.com
As a follow-up, I've worked with Alex for several years and suggested she contact this group directly, since I'm not aware of this kind of documentation for CC's work, and I believe it would be useful now that the corpus has become an important data source for a lot of ML models. We would greatly appreciate any help on this, even if it's just to say that there hasn't been any work on these topics yet.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/dd1e69b4-a9b5-494a-a85b-7e6c8550288cn%40googlegroups.com.

Tom Morris

unread,
Mar 3, 2022, 10:52:37 PM3/3/22
to common...@googlegroups.com
I've never seen such a thing, but they do have their Terms and Conditions, which say, among other things, that you're not allowed to mine personally identifiable information or do anything illegal with the data.  Of course, the vast majority (all?) of the data is copyright by its authors as well, although some of it is released under various types of CC licenses and there are corpuses which filter on those licenses.

My advice to anyone who wants to use CC for large scale training applications is that they consult their own lawyers. Anyone else's opinion doesn't really matter. The advice that they offer should integrate both the business' risk tolerance and the risk profile of the intended uses.

Tom

Sebastian Nagel

unread,
Mar 5, 2022, 3:37:59 AM3/5/22
to common...@googlegroups.com
Hi Alex, hi Pete, hi Tom,

> I've never seen such a thing

Correct. We do not have one. The best documentation of the data are
the discussions in this group in combination with announcements / blog
posts on our site and some presentations given on conferences.

> this kind of documentation for CC's work, and I believe it would be
> useful now that the corpus has become an important data source for a
> lot of ML models.

Yes, no questions. And we are aware that there are already data sheets
for data sets derived from Common Crawl, for example:
https://arxiv.org/pdf/2201.07311.pdf

One point: Common Crawl is an ongoing project with more than 80 crawl
data sets released so far. The methodology how the crawls are run
changes over time. This also applies to the data formats and the tools
(software) used to extract or process them. Some data may also change
over time, ie. a newer version replaces the old one.

Do you have any advice how to design such a multi-dimensional data
sheet? - growing in time, multiple formats, changes over time?


> ... barring that, any kind of copyright
> information or information regarding the privacy of the data
> subjects mentioned in the data.

I can only list what we do (or do not) in order to conform with the
fair use principle and in order to avoid that sensitive data is
"leaked" into the crawl archives:
- sampling pages: we do not aim to have a complete copy of any
web site
- respect the robots.txt rules
- following only public links
- no cookies, no HTML forms, no execution of Javascript
- no usage of (residential) proxies, no fake user-agent strings


I'm happy to continue the discussion and would be grateful in
any help to achieve a good or better documentation of the Common Crawls.
It's definitely a matter of time, not missing will.


Best,
Sebastian


On 3/4/22 04:52, Tom Morris wrote:
> I've never seen such a thing, but they do have their Terms and
> Conditions <https://commoncrawl.org/terms-of-use/>, which say, among

Alex Hanna

unread,
Mar 8, 2022, 11:38:14 AM3/8/22
to common...@googlegroups.com
Thanks so much, Sebastian, this is helpful.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.


--
Alex Hanna, PhD
Director of Research, DAIR Institute
Co-Chair, Sociologists for Trans Justice (s4tj.com)
Book some time with me: https://calendly.com/dr-alex-hanna

Tiran Moyal

unread,
Mar 26, 2022, 11:24:13 AM3/26/22
to common...@googlegroups.com
can you recommend on developer that can help me with it for pay offcource
i want to pull from db specific urls based on language and orginized by keywords
something like this

language = german -- keyword -- title -- meta description -- link of site

is this possible?

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.


--


LogoTiran.jpg

טירן ניהול דיגיטלי בע”מ ח.פ 515373405

tira...@gmail.com• טל • 050-9417028

https://www.yousell.co.il/ פרסום כתבות באתרים המובילים בישראל

  http://gridi.co.il אתר העיצוב של ישראל

Reply all
Reply to author
Forward
0 new messages