news sites scrapper

33 views
Skip to first unread message

ditesh.kumar

unread,
Jan 1, 2012, 12:04:43 AM1/1/12
to sinar-...@googlegroups.com
At our last hacking session with yc and sweemeng, here is a scrapper for news sites:


Right now it scrapes:

* The Star Online
* The Malaysian Insider
* Free Malaysia Kini
* Malay Mail
* Utusan Malaysia (have a look at their html [hint: search for <html> tag])
* Merdeka Review (Malay language edition)

Any other sites to scrap?

The scrapper extracts relevant content from the news articles and tags it (so that it can be used in yc's rails mp app).

Next step: expose a web api for apps.

Will be good to have a host for the scrapper (low cpu, bw requirements) and the api.

Happy hacking.

Ditesh

Sian Lerk Lau

unread,
Jul 16, 2012, 1:44:53 AM7/16/12
to sinar-...@googlegroups.com
Hi, I'm into scrapping. If everyone ok I can continue to work on this scrapper.

sweemeng ng

unread,
Jul 16, 2012, 1:47:20 AM7/16/12
to sinar-...@googlegroups.com
GREAT!!!! Just for the repo and start working, buzz us here if you need help. 

Reminding myself to start running the scraper on our server
--
Just a random living organic computer code generator

sweemeng ng

unread,
Jul 16, 2012, 1:55:09 AM7/16/12
to sinar-...@googlegroups.com
Reminding the team to see what is missing from this scraper, i think it is complete, need to see how

Khairil Yusof

unread,
Jul 16, 2012, 3:59:07 AM7/16/12
to sinar-...@googlegroups.com
On Sun, 2012-01-01 at 13:04 +0800, ditesh.kumar wrote:

>
> Right now it scrapes:
>
>
> * The Star Online
> * The Malaysian Insider
> * Free Malaysia Kini
> * Malay Mail
> * Utusan Malaysia (have a look at their html [hint: search for <html>
> tag])
> * Merdeka Review (Malay language edition)

For more middle ground views we should have these too:

+ Sun Daily
+ Sinar Harian



kiawin

unread,
Jul 16, 2012, 8:53:26 PM7/16/12
to sinar-...@googlegroups.com
Sure, will code it tonite :)
--
to be or not to be? http://blog.kiawin.com

Khairil Yusof

unread,
Jul 16, 2012, 9:21:17 PM7/16/12
to sinar-...@googlegroups.com
Awesome,

Kiawin, can I put your name contact email as the new maintainer of
this project on the main website?

sweemeng ng

unread,
Jul 16, 2012, 9:28:42 PM7/16/12
to sinar-...@googlegroups.com
Hi Kiawin 

2 thing here, 
It is also good that you output json/xml/any format for the news, start with source(i.e) URL then the news. 

The original proposal also is to, tag each news with a person of interest with the Kratos database, 

I will put it in a proper ticket or pivotal tracker later, after work. 

Thanks for working o this

kiawin

unread,
Jul 16, 2012, 10:07:05 PM7/16/12
to sinar-...@googlegroups.com
Wokey, I will add the additional new sources first, then later work on the output (sorry if I didn't get it right, I think you are saying creating a API for the scrapped news right? :))

sweemeng ng

unread,
Jul 16, 2012, 10:21:35 PM7/16/12
to sinar-...@googlegroups.com
Detail, comes later need to work now

Sian Lerk Lau

unread,
Jul 18, 2012, 12:29:46 PM7/18/12
to sinar-...@googlegroups.com
Sorry guys, need some tips as I'm new to github.

When I'm done with my changes I just send a pull request to Sinar?

Thanks :)

Khairil Yusof

unread,
Jul 18, 2012, 1:09:59 PM7/18/12
to sinar-...@googlegroups.com
You commit locally first. Don't forget to add all the files changed.
If you're used to SVN you'll have to do this.

Then: git push to push it to the main repo.
Reply all
Reply to author
Forward
0 new messages