Hi,
This is my first post here, and I believe I will be around for a long time :) But at this time I need some help from more advanced users here
1. Is there any way to run a python script on Common Crawl data sets, and get URLs, and for that given URL get Title, Description, og: meta tags etc... I searched here and everywhere but I could not see any example on how to do that...
2. If the above example is not possible, what would be best approach to run a python script and and grab all URL's on dataset, than use my crawler to get those meta tags... ( Even though I believe that above method is possible.. but I don't know how to.. )
3. Is it possible to separate each page from the WARC, from the beginning to the end of page, and name that file, than continue ?
Looking forward to receive any suggestions / help from anyone.
Regards,
Besnik