warc data

27 views
Skip to first unread message

has....@gmail.com

unread,
May 25, 2017, 2:47:52 AM5/25/17
to Common Crawl
Hi,

I have downloaded one warc part file from warc.paths. It contains raw data of around 56K webpages which is crawled.
My problem is that there are many webpages which are in other languages such as Chinese and Spanish.
I only want warc data of English webpages. How is it possible to get them?
Is there any filter or anything else?
Help is appreciated.

Thanks,
Moid
Reply all
Reply to author
Forward
0 new messages