You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Meta Meta Project
One of the first things my project needed was the ability to look at a
web page and extract the headline and body of a news article. Didn't
want to add navigation text to article comparison.
( actually detecting if it was indeed an article and not another type of web page should come first. any ideas on that?)
So I had a python CGI script somewhere that uses either pyreadability or decruft to extract that. Also had done one in PHP with phpreadability
Probably better to detect if the page is marked up with RDFa or hNews
(or other) and extract with a tool for those and then fall back on these
non-semantic tools as a second resort.