Introducing Common Screens Project, based on data from Common Crawl

246 views
Skip to first unread message

Bpm Tips

unread,
Nov 4, 2022, 7:58:28 AM11/4/22
to Common Crawl
I will like to introduce all of you to the https://commonscreens.com Common Screens project which is based on data derived from Common Crawl, it supplements the Common Crawl project with screenshots of around 150 Million websites and domains, The screens can be used for OCR, Machine Learning, Domain profiling and categorization etc.
You can hotlink AWS cloudfront CDN screenshot images directly into your project or website.
Common Screens is also hosted on aws under the Amazon Web Services’ Open Data Sponsorships Program.

Checkout the metadata section which provides a csv format database of all major domain information.

The uploads are in process and will take 15 days to fully upload all the data.

Sebastian Nagel

unread,
Nov 4, 2022, 10:28:02 AM11/4/22
to common...@googlegroups.com
Hi,

great project - thanks for the notice!

Looking forward to have a look on the data, not only the screenshots
but also the metadata.

Best,
Sebastian

On 11/4/22 12:58, Bpm Tips wrote:
> I will like to introduce all of you to the https://commonscreens.com
> <https://commonscreens.com> Common Screens project which is based on
> data derived from Common Crawl, it supplements the Common Crawl project
> with screenshots of around 150 Million websites and domains, The screens
> can be used for OCR, Machine Learning, Domain profiling and
> categorization etc.
> You can hotlink AWS cloudfront CDN screenshot images directly into your
> project or website.
> Common Screens is also hosted on aws under the Amazon Web Services’ Open
> Data Sponsorships <https://aws.amazon.com/opendata/> Program.

Jason Duke

unread,
Nov 8, 2022, 3:58:36 AM11/8/22
to common...@googlegroups.com
Amazing, thank you :D
-- 
Jason Duke

* Book a Meeting with me https://booking.strangelogic.ltd/  * 

https://StrangeLogic.com/ - Wisdom & Experience is Strangely Logical 


Email: ja...@strangelogic.com        
Twitter: @JasonD




The information contained within this email along with any attachments are confidential, may be legally privileged and/or protected by copyright.   If you are not the intended recipient of this email then further dissemination, copying or printing is prohibited. If you have received this email in error then you should notify the sender by replying to this email and thereafter permanently deleting the email from your systems.

Any views or opinions in this email are solely those of the sender.  This email is not intended to form a binding contract and as such all communications are “subject to contract” unless it is expressly indicated to the contrary and is properly authorised.  You should not rely on any information contain within this email, and any actions taken are at the recipient’s own risk.

Strange Logic Limited is a company registered in England and Wales (Company No. 10995931 ) with its registered address being 1 Alfriston Park, Seaford, East Sussex. BN25 3LS


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/07dacaa4-a248-41f3-8a94-fc5aadc427d7n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages