Using Common Crawl for a new project

159 views
Skip to the first unread message

Besnik Hajredini

unread,
10 Jan 2017, 00:49:2810/01/2017
to Common Crawl
Hi, 

This is my first post here, and I believe I will be around for a long time :) But at this time I need some help from more advanced users here

1. Is there any way to run a python script on Common Crawl data sets, and get URLs, and for that given URL get Title, Description, og: meta tags etc... I searched here and everywhere but I could not see any example on how to do that... 

2. If the above example is not possible, what would be best approach to run a python script and and grab all URL's on dataset, than use my crawler to get those meta tags... ( Even though I believe that above method is possible.. but I don't know how to.. ) 

3. Is it possible to separate each page from the WARC, from the beginning to the end of page, and name that file, than continue ? 

Looking forward to receive any suggestions / help from anyone. 
Regards,
Besnik

Sebastian Nagel

unread,
11 Jan 2017, 04:11:3111/01/2017
to common...@googlegroups.com
Hi Besnik,

there are multiple ways to iterate over WARC or WAT records (the latter to get URL, title and
metatags). Have a look at

https://github.com/commoncrawl/cc-mrjob

It's easy to understand and good starting points are:
- server_analysis.py (for processing WAT files)
- tag_counter.py (WARC files)

> 3. Is it possible to separate each page from the WARC, from the beginning to the end of page, and
> name that file, than continue ?

Every WARC/WAT/WET record is passed to the method process_record, you only have to implement what
to extract from the record.

Also have a look at
http://commoncrawl.org/the-data/get-started/

... and for further reading some blog posts and projects which use Python to crunch the Common Crawl
data:

http://eliteinformatiker.de/2016/05/01/analyzing-the-commoncrawl-using-mapreduce
https://dmorgan.info/posts/common-crawl-python/
https://github.com/qadium-memex/CommonCrawlJob
http://engineeringblog.yelp.com/2015/03/analyzing-the-web-for-the-price-of-a-sandwich.html

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages