Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

ANC: tmlrss.tcl - process RSS newsfeeds for tclhttpd

13 views
Skip to first unread message

David Gravereaux

unread,
Mar 1, 2007, 4:26:02 PM3/1/07
to
tmlrss will process RSS/RDF (0.90, 0.91, 0.92, 0.93, 0.94, 1.0, 2.0, Atom 0.3,
Atom 1.0, podcast) newsfeeds into simple 4.01 HTML tables. Just drop it into your
custom directory and call it from your template files. It does extra effort to
make sure it generates *legal* HTML such as replacing block-level elements and
fixing improper encoding errors.

See it running @ http://www.pobox.com/~davygrvy/news.tml

Get the code:
http://www.pobox.com/~davygrvy/tclstuff/tmlrss.tcl

Get the example source:
http://www.pobox.com/~davygrvy/tclstuff/news.tml.txt

--
I liked things better when I didn't understand them.
-- Calvin

signature.asc

David Gravereaux

unread,
Mar 1, 2007, 5:56:08 PM3/1/07
to
David Gravereaux wrote:
> It does extra effort to
> make sure it generates *legal* HTML such as ... fixing improper encoding errors.

Just for fun, I thought I'd explain this part, because I think it's such an
interesting problem. Many newsfeeds are themselves collections of other feeds
that come from all kinds of sources. Thus any errors become additive.

When one gets the feed, which is in XML, over HTTP, encodings are sometimes done
in transit (MIME header Content-Type), or processed by the XML parser (TDOM in my
case). As TDOM's Expat parser reads the XML declaration, it makes a large mistake
by doing the translation it contains. TDOM's performance enhancements make it
necessary to remove the declaration as TDOM subverts the Tcl_Obj interface and
goes right to the internal representation and Expat assumes utf-8 without a
declaration, which in this case, is correct. So that leaves me to do [encoding
convertto ...] manually and remove the declaration before passing to TDOM. Which
is just fine by me, as Tcl is very well encoding conversion capable.

Well, that was the first issue. Second, was the big lies about the content in the
XML files. Ignoring that most early RSS formats can't describe what the format of
their <content> elements are really in, I found this bugger of a problem:

One of the common things I found were either entities or actual characters in the
range of &#130; through &#159; when the XML file itself claimed to be in
iso-8859-1 (or whatever after decoding). Characters in those ranges are not
defined for iso-8859-1. The problem is discussed @
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

So I assume those are chars where meant to be in cp1252 and move them to their
correct unicode rep. The great example would be &#151; which is supposed to be
\u2014 (em dash), which after verifying using TDOM's html parser spits back
&mdash; when I ask for the document back as html. And life is good :)

Well, that's my story.. What a mess the world is in.

--
Why waste time learning, when ignorance is instantaneous?
-- Calvin

signature.asc

David Gravereaux

unread,
Mar 1, 2007, 7:03:21 PM3/1/07
to
David Gravereaux wrote:
> The great example would be &#151; which is supposed to be
> \u2014 (em dash), which after verifying using TDOM's html parser spits back
> &mdash; when I ask for the document back as html. And life is good :)

Even the big guys can't get that one straight:
http://news.yahoo.com/s/ap/20070301/ap_on_en_ot/ignoring_paris_hilton

See the 'em dash' in the text (maybe as an empty square even) in this spot:

"It was only meant to be a weeklong ban — not the boldest of journalistic
initiatives,"

If you look at the page source, charset is claimed as UTF-8, server served it as
UTF-8, yet there's a &#151; entity in there and UTF-8 has no glyph for that
character! See

Now you go, FireFox, as you're display for me as \u2014! Lucky me, I think? Or
is FireFox perpetuating the problem by supporting mis-interpretations?

--
As a math atheist, I think I should be excused from this.
--- Calvin, to Hobbes

signature.asc

David Gravereaux

unread,
Mar 2, 2007, 4:31:57 AM3/2/07
to
David Gravereaux wrote:

> Now you go, FireFox, as you're display for me as \u2014! Lucky me, I think? Or
> is FireFox perpetuating the problem by supporting mis-interpretations?

Post a comment to https://bugzilla.mozilla.org/show_bug.cgi?id=372325
Thanks.

--
"My ethicator machine must've had a built-in moral compromise
spectral release phantasmatron! I'm a genius!" --- Calvin

signature.asc
0 new messages