--
You received this message because you are subscribed to the Google Groups "NewsRob User Group" group.
To post to this group, send email to new...@googlegroups.com.
To unsubscribe from this group, send email to newsrob+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/newsrob?hl=en.
thanks a lot for taking care of this. Even though my problem is only
partially solved, I've bought the Pro version of NewsRob, just to
appreciate your work [..]
(actually, Android Market seems to be the only
place where one would like to make more than they actually have to,
the price for your app seems ridiculously low.)
Your explanation in crystal clear and now I understand why this
happens but I'm not sure if this is the correct behavior.
Because for
all feeds where I get this issue the problem is as follows: when the
contents of an article is not saved for offline reading and it is
accessed online, there's no problem and all characters are displayed
right.
The problem only occurs when the contents is downloaded for
offline reading because such contents is, before saving, processed.
I am responsible (respectively my script is) for converting some of
the feeds into a format better suited for offline reading. For
instance the one I gave as an example. The RSS link is
http://rss.paloch.net/hiking.sk/rss.php, the only processing my script
does is it replaces the links to the articles to another page, which
processes the original and outputs a simplified version better suited
for viewing on my phone. It leaves the original encoding untouched
(even that double meta with 'utf-8' comes from the original). Based on
what you've written, I tried to update the script (php) which sends
the reformatted page with
header( 'Content-Type: text/html; charset=utf-8' );
and only then
echo( $xhtml );
which contains the complete reformatted xhtml string. And voila, the
problem's fixed.
However, then I checked some other feeds which I take directly from
the original, these are e.g.:
http://rss.sme.sk/rss/rss.asp?sek=smeonline
http://servis.idnes.cz/rss.asp
http://www.lidovky.cz/export/rss.asp?c=ln_lidovky
http://dnevnik.bg
They are all feeds for online versions of the largest dailies of the
Czech republic,
Slovakia and Bulgaria (all non ISO-8859-1). And, if I
choose the same setting as before, i.e. "Articles + Images + Web Page"
and "Display Web Page", the characters get always mangled. So do they
all serve their feeds with tens of thousands subscribers wrong?
Well, I do not know the standards but no desktop or mobile rss reader
I've ever tried has had a problem with any of these pages including
the ones converted by my scripts, neither does NewsRob, unless it
saves the pages for offline reading.
It seems redundant to me to send
first the encoding as a header and once again in the page itself.
Every script I've ever written which processed any web pages only
analyzed the encoding as set in <meta http-equiv="Content-Type"
content="text/html; charset=XXXX" />. It might be the standards to
first send the encoding in a header and then again in the page itself,
but in all these cases the browsers seem to take into account only the
encoding as specified in the page itself.
Every time I work with the contents of a webpage, I first load it into
a string (it doesn't matter at all what the default encoding is at the
moment). Then, if needed, I get the page encoding from the meta tag
(all chars in the meta are ASCII, the same in any encoding) and
perform an iconv conversion specifying both input (from the page
itself) and output encoding. That's in php, anyway. It might me more
complicated in Android Java.
String charsetName = null;
for (HeaderElement he : response.getEntity().getContentType().getElements()) {
NameValuePair nvp = he.getParameterByName("charset");
if (nvp != null) {
charsetName = nvp.getValue();
break;
}
}
[...]
Charset charset = Charset.forName("ISO-8859-1");
if (charsetName != null)
try {
charset = Charset.forName(charsetName);
} catch (Exception e) {
// stick with the default
}
InputStreamReader isr = charset != null ? new InputStreamReader(is, charset) : new InputStreamReader(is);
thanks a lot for getting back to me so quickly.
I did a bit of investigation and checked a lot of sites. Most of them, including those I included in the previous post, declare no charset in the header at all. Mostly, the response is "Content-Type: text/html" (you can check here: http://test.paloch.net/headers.php?path=zpravy.idnes.cz, just pust any URL after the path parameter).
I found this on W3C recommendation page (http://www.w3.org/TR/html4/charset.html#h-5.2.2):
The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter.
It seems most servers don't send this header because they don't have to. NB "user agents must not assume any default value".
As I wrote before, I don't know Dalvik nor Java so I don't know how they handle strings. I don't even know how PHP handles strings internaly but I suppose when I use file_get_contents, it simply loads it as a byte string into the memory, without any default encoding attached to it. And PHP only takes charset into account when it needs to, as most operations are charset independent.
I looked into the files NewsRob caches on the SD card and it seems the only processing you do is replacing the original links with cached links. So what if you always load the string as say ISO-8859-1 (does any charset need to be specified at all?) and apply no charset conversion to it at all. Then replace the links (with REGEX?). And save it as ISO-8859-1 (again, does any charset need to be specified?). And when displaying the page, you feed WebKit whatever has been saved and as you say, WebKit works the correct encoding out by itself (from meta).
David
http://www.w3.org/TR/html4/charset.html
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/