Hi,
I have downloaded one warc part file from warc.paths. It contains raw data of around 56K webpages which is crawled.
My problem is that there are many webpages which are in other languages such as Chinese and Spanish.
I only want warc data of English webpages. How is it possible to get them?
Is there any filter or anything else?
Help is appreciated.
Thanks,
Moid