how to crawl Indian newspaper sites

Debamitro Chakraborti

unread,

Dec 9, 2013, 9:07:24 AM12/9/13

to data...@googlegroups.com

Any way to crawl the back issues of prominent Indian newspapers like The Hindu, TOI, Indian Express, Hindustan Times etc?
I was part of a team which needed to analyse news reports from a time frame and we hacked together a TOI crawler (which still has limitations) and were working on a The Hindu crawler -- would love to know about something simpler that is already available.

Debamitro

Meera

unread,

Dec 9, 2013, 9:18:32 AM12/9/13

to data...@googlegroups.com

See if newsrack.in fits your needs, it uses rss feeds though. But allows programming it so more powerful than Google news.

Regds, Meera

~ Bangalore's own interactive newsmagazine at www.citizenmatters.in ~

--
For more details about this list
http://datameet.org/discussions/
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Debamitro Chakraborti

unread,

Dec 9, 2013, 12:54:36 PM12/9/13

to data...@googlegroups.com

I know of newsrack (in fact I created the NREGA topic on the site long long ago) but what I am looking for is a crawler of past records which I can use for my own research. Maybe the code behind newsrack can be reused to build such a crawler -- but I didn't see it anywhere on the site.

Anyway, thanks.

Debamitro

Arvind Batra

unread,

Dec 9, 2013, 1:06:34 PM12/9/13

to data...@googlegroups.com

Hi Debamitro,

Couple of months ago, me and few of my friends built a media monitoring tool to track what traditional media was writing about Aam Aadmi Party. Our work can be seen here - http://aap.mediatrack.in

As part of the process, we wrote a crawler that crawls Hindu, TOI, HT and three other sources. To keep track of scale we are crawling a depth of 2 starting from the daily site map page of each of these news sites. I can share our crawler code. We also have last two months of crawl data from these sources, we will be happy to share that as well.

Please do let me know if you are interested.

Thanks,

arvind

Gora Mohanty

unread,

Dec 10, 2013, 1:11:03 AM12/10/13

to data...@googlegroups.com

The newsrack.in recommendation is a good one, but Newsrack
is intended as much more than a simple crawler. If a crawler is
what you need, you should look into something like Nutch
( http://nutch.apache.org/ ). If you prefer to write your own for
simple, non-generic. needs we have happily used Scrapy in
the Python world ( http://scrapy.org/ ).

Regards,
Gora

Debamitro Chakraborti

unread,

Dec 10, 2013, 4:55:19 AM12/10/13

to data...@googlegroups.com

Hi Arvind,

Yes I am interested. What we need is historical data, so we need to crawl archives. Unfortunately not all Indian newspapers have good archive pages. The Hindu has a systematic urls for its archive, but it creates the links for each day dynamically using javascript. TOI has static archived pages and is the easiest to crawl. We couldn't find archives for HT. You see, our application needs the dates of the reports so it is best to crawl datewise. I'll contact you separately.

Debamitro

Debamitro Chakraborti

unread,

Dec 10, 2013, 4:57:14 AM12/10/13

to data...@googlegroups.com

Hi Gora,

The problem with using scrapy (or just simply BeautifulSoup) is that some newspapers generate content dynamically, using javascript. The possible solutions we found was using phantomjs or Goose (a python library). If Nutch can handle content generated through javascript (which it doesn't appear to) then we'll use it.

Debamitro

Gora Mohanty

unread,

Dec 10, 2013, 9:28:38 AM12/10/13

to data...@googlegroups.com

Hi,

That problem is common to whatever crawler you use. Nutch will extract links from JavaScript, but that's it. I would use Rhino, and a custom HtmlParser plugin for Nutch. This is admittedly non-trivial, but I know of no open source tool that already does this.

Regards,
Gora

Amit Tiwari

unread,

Oct 28, 2016, 4:47:00 AM10/28/16

to datameet, arvin...@gmail.com

Hi arvind,could u please share the crawler code,I also want to design a similar crawler to design newspaper websites.

Reply all

Reply to author

Forward