Best practice for parsing an entries web page and not it's own content

40 views

Skip to first unread message

adam

unread,

Jan 28, 2013, 2:52:38 AM1/28/13

to feed...@googlegroups.com

Im going to use feedzirra to parse ingredient information from lots of cooking websites. My plan is to subscribe to their respective rss feeds. Unfortunately the entries only contain snippets and the actual ingredient info is back on the entries corresponding webpage.

So my original idea was just to use feedzirra to give me the latest entries for various feeds. I'd then extract each entries source url (url of the original webpage) and then pass that on to my own libraryto get the info I wanted.

But I noticed this in the readme

The final feature of Feedzirra is the ability to define custom parsing classes. In truth, Feedzirra could be used to parse much more than feeds. Microformats, page scraping, and almost anything else are fair game.

So now im wondering if I should just extend the rss parser that comes with feedzirra and extend the parse method to fetch(curl) the url in question and do the relevant parsing of the downloaded html.

Is this in line with what Paul was talking about, or should this functionality be separate from feedzirra?

Reply all

Reply to author

Forward

0 new messages