Get the text of a list of Urls

85 views
Skip to first unread message

Jonathan Pagel

unread,
Jan 18, 2023, 10:59:08 AM1/18/23
to Common Crawl
Hi everyone,
Im kind of overhelmed with Commencrawl.
I have a large (200k) list of ulrs: 
My goal is to just get the texts from each spefic URL.
Im good with Python, however didnt had much success with the comcraw package.
Can you give some hints for starting points :)?
Greetings :D

Sebastian Nagel

unread,
Jan 19, 2023, 6:22:00 AM1/19/23
to common...@googlegroups.com
Hi Jonathan,

> with the comcraw package.

I guess you mean [1]?


> I have a large (200k) list of ulrs

Given the size of the list, you better use the columnar index.
I've described the general idea and procedure in [2].
But there are other ways to do this as well, maybe even
more efficient. See [3], for just one example. Which programming
language and platform you want to use? The 200k URLs are about
200k web pages or web sites (domains)? Do you want only the most
recent capture per web page or also the history?

Best,
Sebastian

[1] https://github.com/michaelharms/comcrawl
[2]
https://github.com/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb
[3]
https://medium.com/@jaderd/one-click-to-download-exactly-the-web-pages-you-may-want-no-matter-how-many-they-are-d4834265a0a3
Reply all
Reply to author
Forward
0 new messages