grabbing meta-data from an a web page

9 views
Skip to first unread message

Matthew Terenzio

unread,
Nov 21, 2011, 6:30:20 PM11/21/11
to Meta Meta Project
One of the first things my project needed was the ability to look at a web page and extract the headline and body of a news article. Didn't want to add navigation text to article comparison.

( actually detecting if it was indeed an article and not another type of web page should come first. any ideas on that?)

So I had a python CGI script somewhere that uses either pyreadability or decruft to extract that. Also had done one in PHP with phpreadability

Probably better to detect if the page is marked up with RDFa or hNews (or other) and extract with a tool for those and then fall back on these non-semantic tools as a second resort.
Reply all
Reply to author
Forward
0 new messages