Can you obtain specific parts (urls) from requesting the single WET file?

36 views
Skip to first unread message

trader...@gmail.com

unread,
Jan 29, 2019, 2:07:54 PM1/29/19
to Common Crawl
For my NLP purpose I just need specific text belonging to mostly just one URL within the whole WET file. Since loading single files with gzip in python takes long time to load it in memory, it is a but unsatisfactory. So I have several thousands of wet files where I just need to grap one text for a URL.

I wonder why requesting a wet file doesn't send back a json where you can pick just with warc-uri?

Or is this possible? I am just new to common crawl


Thanks!

jay patel

unread,
Jan 29, 2019, 10:23:31 PM1/29/19
to Common Crawl
We tend to work directly with warc files and do our own boilerplate removal so no idea about WET file, but check out the thread I started where Sebastian and Tom mentioned how to use offsets and length to only download the warc record we need using the index (so no need to download the whole 1 GB file if you only need one url).

Also check this out for reading WET files; maybe the scripts there will give you an idea on how to proceed.

Thanks,

Jay.
Reply all
Reply to author
Forward
0 new messages