Pls validate data usage as per Terms

59 views
Skip to first unread message

ProjectML

unread,
Feb 26, 2021, 1:34:35 AM2/26/21
to Common Crawl
Hi,

Can i use Common Crawl data for...

1. Get detail for URLs
2. Feed to machine learning project (To be done)
3. Derive summary for each URL
4. Display summary on my (future) website...  With possible google ads or similar ads

Requesting to confirm.

Regards,
Vijay

Alex Xue

unread,
Feb 26, 2021, 2:31:41 PM2/26/21
to Common Crawl

What sorts of details for the URL are you interested in?

From: https://commoncrawl.org/the-data/get-started/ if you look at the WARC examples, they contain info like metadata about the web page, etc. It also contains the html of the page which you could use for deriving the summary (I assume you mean textual summary?). Although if you're only interested in the text you can look at the WET examples (which is just text, not the raw html). Hope this helps.

ProjectML

unread,
Feb 26, 2021, 10:34:31 PM2/26/21
to Common Crawl
Thanks for guiding on usage of WARC and WET.

Get details for URL:
- Title, description and meta
- H1, h2... of home page
- text on home page (only in worst case)

Summary:
- How these websites are similar to each other

Regarding textual summary:
- Yes, it can be generated, but currently not considering as immediate work item. (Other use case is relation between keywords and URLs).

On https://commoncrawl.org/terms-of-use/ , following usage is restricted. Hence, requesting to confirm, if above use case is allowed or not.
  • Communicate for commercial solicitation

Regards,
Vijay

Tom Morris

unread,
Feb 27, 2021, 2:29:13 PM2/27/21
to common...@googlegroups.com
On Fri, Feb 26, 2021 at 10:34 PM ProjectML <vijay...@gmail.com> wrote:

On https://commoncrawl.org/terms-of-use/ , following usage is restricted. Hence, requesting to confirm, if above use case is allowed or not.
  • Communicate for commercial solicitation

You should ask your lawyer. No one here can give you legal advice. My personal interpretation of that bullet is that it prohibits illegal spam, but you should get your lawyers interpretation of the Terms of Use and whether or not your application falls within them. Note what falls within the scope of "Don't do anything illegal with Common Crawl" may vary from jurisdiction to jurisdiction, which is another good reason to get your lawyer to weigh in.

Tom 

Vijay Patel

unread,
Feb 27, 2021, 4:46:03 PM2/27/21
to common...@googlegroups.com
Hi,

Thanks for sharing your interpretation. It is helpful.

Regards,
Vijay.



--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/pqV5jZTFpgU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/CAE9vqEH%2B5bc6HnOdSRfyY1h2RcvCsgVbn1SHkg785%2BnL5jxdMA%40mail.gmail.com.

Virus-free. www.avast.com
Reply all
Reply to author
Forward
0 new messages