Re: Use Common Crawl to store in MySQL

61 views
Skip to first unread message
Message has been deleted

Dusan Jovanovski

unread,
Dec 31, 2022, 6:52:15 AM12/31/22
to Common Crawl
I am looking for the same thing, but there's no option so far for supporting this.
You probably need to store this data in CSV than in MySQL and from there on to re-scan the possible information based on the unique URLs

On Friday, December 30, 2022 at 8:06:34 PM UTC+1 Marius wrote:
Hey there,

do you have any guidance (tutorial, examples) on how you could use Common Crawl to insert data into  MySQL? Let's say I want to have a database that I want to populate with the help of Common Crawl.

That database has a table where I want to insert:
- the title of a website
- <meta description="">
- <Hn>
- etc, basically extract some texts that are either in HTML tags or in header/footer 

And from the multitude of given elements to scan and store, insert just the ones that can be found on a website and for the rest to add an "-" 

Also, if there's domain.com/page1.html, domain.com/page2.html I want those to be ignored, I want just the main domain to be inserted in the database.

Name: <title> 
Website: main URL
Phone: (in footer if there is, grab it and place it in the table), if there's no phone number in footer, insert a "-" in the table
Email: same, grab it from the header or footer if there is
Description: <meta description="">

I hope it makes sense 😄
Message has been deleted

Dusan Jovanovski

unread,
Dec 31, 2022, 9:18:12 AM12/31/22
to Common Crawl
Use Athena to query the results you want Then download it as CSV and import it to the MySQL.
From MySQL you can start another process to scan the informations you want from this sites.

Hope this helps to you.

On Saturday, December 31, 2022 at 1:14:47 PM UTC+1 Marius wrote:
That could work too. Do you have an example on how could I dump those tags that I need into a CSV? 
Reply all
Reply to author
Forward
0 new messages