See it running @ http://www.pobox.com/~davygrvy/news.tml
Get the code:
http://www.pobox.com/~davygrvy/tclstuff/tmlrss.tcl
Get the example source:
http://www.pobox.com/~davygrvy/tclstuff/news.tml.txt
--
I liked things better when I didn't understand them.
-- Calvin
Just for fun, I thought I'd explain this part, because I think it's such an
interesting problem. Many newsfeeds are themselves collections of other feeds
that come from all kinds of sources. Thus any errors become additive.
When one gets the feed, which is in XML, over HTTP, encodings are sometimes done
in transit (MIME header Content-Type), or processed by the XML parser (TDOM in my
case). As TDOM's Expat parser reads the XML declaration, it makes a large mistake
by doing the translation it contains. TDOM's performance enhancements make it
necessary to remove the declaration as TDOM subverts the Tcl_Obj interface and
goes right to the internal representation and Expat assumes utf-8 without a
declaration, which in this case, is correct. So that leaves me to do [encoding
convertto ...] manually and remove the declaration before passing to TDOM. Which
is just fine by me, as Tcl is very well encoding conversion capable.
Well, that was the first issue. Second, was the big lies about the content in the
XML files. Ignoring that most early RSS formats can't describe what the format of
their <content> elements are really in, I found this bugger of a problem:
One of the common things I found were either entities or actual characters in the
range of ‚ through Ÿ when the XML file itself claimed to be in
iso-8859-1 (or whatever after decoding). Characters in those ranges are not
defined for iso-8859-1. The problem is discussed @
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
So I assume those are chars where meant to be in cp1252 and move them to their
correct unicode rep. The great example would be — which is supposed to be
\u2014 (em dash), which after verifying using TDOM's html parser spits back
— when I ask for the document back as html. And life is good :)
Well, that's my story.. What a mess the world is in.
--
Why waste time learning, when ignorance is instantaneous?
-- Calvin
Even the big guys can't get that one straight:
http://news.yahoo.com/s/ap/20070301/ap_on_en_ot/ignoring_paris_hilton
See the 'em dash' in the text (maybe as an empty square even) in this spot:
"It was only meant to be a weeklong ban — not the boldest of journalistic
initiatives,"
If you look at the page source, charset is claimed as UTF-8, server served it as
UTF-8, yet there's a — entity in there and UTF-8 has no glyph for that
character! See
Now you go, FireFox, as you're display for me as \u2014! Lucky me, I think? Or
is FireFox perpetuating the problem by supporting mis-interpretations?
--
As a math atheist, I think I should be excused from this.
--- Calvin, to Hobbes
> Now you go, FireFox, as you're display for me as \u2014! Lucky me, I think? Or
> is FireFox perpetuating the problem by supporting mis-interpretations?
Post a comment to https://bugzilla.mozilla.org/show_bug.cgi?id=372325
Thanks.
--
"My ethicator machine must've had a built-in moral compromise
spectral release phantasmatron! I'm a genius!" --- Calvin