Full web page download broken (bad charset)

1 view
Skip to first unread message

Caroig

unread,
Feb 6, 2011, 5:12:04 PM2/6/11
to NewsRob User Group
Hello, from all RSS readers I found NewsRob the best one in every way.
However, the functionality I'm mostly interested in seems broken.
There are a couple off rss feeds which point to an already simplified
web page, thus I don't need the the "Simplified Web Page" by Google. I
set the synchronization type to "Articles + Images + Web Page". When I
acces the article's Web Page in NewsRob, it's unreadable as the non-
ASCII characters are displayed all wrong. When I refresh the page from
within the reader, the chars get displayed correctly but I don't have
the offline version anymore. It seems the offline page gets double
utf-8 encoded.

How to replicate: add e.g. this RSS to your Google Reader:
http://rss.paloch.net/hiking.sk/rss.php, go to Manage Feed and select
"Articles + Images + Web Page" and "Display Web Page". After the feed
has been synced, simply click on any article, what you get is a page
which looks like utf-8 characters displayed as ANSII.

Thank you

Mariano Kamp

unread,
Feb 6, 2011, 6:38:22 PM2/6/11
to new...@googlegroups.com
David,

you didn't provide a screenshot to show what went wrong, so the following lines are based on your textual description and my speculation what went wrong.


The short story:

Are you in touch with the publisher? I think it would be more fruitful and definitively quicker if the publisher would fix the site and let their web server return the same content type as is specified/used in the actual content.

I could also try to find an implementation that takes care of the issue, but it's not likely happening anytime soon as it is a major change.


The long story:

Ok, here's what I think happens.

mkamp$ curl -I http://rss.paloch.net/hiking.sk/?id=1772 HTTP/1.1 200 OK Server: nginx Date: Sun, 06 Feb 2011 23:09:09 GMT Content-Type: text/html Connection: keep-alive Vary: User-Agent,Accept-Encoding

The server doesn't specify the encoding explicitly, which means according to the standard that the default is to be used: ISO-8859-1.

NewsRob now expects ISO-8859-1 and tries to decode the content using ISO-8859-1, then stores it in UTF-8, which is the common format for all content that NewsRob stores. Unfortunately, the server later on changes its mind and now would prefer UTF-8:


...
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> 
<meta name="robots" content="follow,noindex" /> 
<meta name="author" content="Michal Mikulas" /> 
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
<meta http-equiv="Expire" content="now" /> 
<link rel="shortcut icon" href="http://hiking.sk/favicon.ico" /> 
<style type="text/css" media="screen, print, projection"> 
...

For good measure the server rendering even specifies this twice. But it could be worse, at least the server says the same think both times; at least the most recent two times.

But anyway, what should NewsRob do with this? Maybe parse the document and when it encounters a change of mind from a server that is configured that way, get rid of what was downloaded and go to the server again and re-request the document, then ignoring what the web server says and using the content type from the attempt before. Maybe implement a raw buffer facility to eliminate this step.
In any way this extra parsing / checking needs then to be done for 100% of the sites, even though it likely only makes a difference for less than one percent of the web sites.

What ever way you look at it, there is at least significant work to be done in NewsRob, but also every user will likely take a performance/battery hit to deal with that issue.

I hope you see my reasoning why it is likely a more efficient approach to deal with the issue at the source.

I hope I made sense so that you were able to follow along?

Best,
Mariano


--
You received this message because you are subscribed to the Google Groups "NewsRob User Group" group.
To post to this group, send email to new...@googlegroups.com.
To unsubscribe from this group, send email to newsrob+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/newsrob?hl=en.


Caroig

unread,
Feb 7, 2011, 4:35:39 PM2/7/11
to NewsRob User Group
Hi Mariano,

thanks a lot for taking care of this. Even though my problem is only
partially solved, I've bought the Pro version of NewsRob, just to
appreciate your work (actually, Android Market seems to be the only
place where one would like to make more than they actually have to,
the price for your app seems ridiculously low.)

Your explanation in crystal clear and now I understand why this
happens but I'm not sure if this is the correct behavior. Because for
all feeds where I get this issue the problem is as follows: when the
contents of an article is not saved for offline reading and it is
accessed online, there's no problem and all characters are displayed
right. The problem only occurs when the contents is downloaded for
offline reading because such contents is, before saving, processed.

I am responsible (respectively my script is) for converting some of
the feeds into a format better suited for offline reading. For
instance the one I gave as an example. The RSS link is
http://rss.paloch.net/hiking.sk/rss.php, the only processing my script
does is it replaces the links to the articles to another page, which
processes the original and outputs a simplified version better suited
for viewing on my phone. It leaves the original encoding untouched
(even that double meta with 'utf-8' comes from the original). Based on
what you've written, I tried to update the script (php) which sends
the reformatted page with

header( 'Content-Type: text/html; charset=utf-8' );

and only then

echo( $xhtml );

which contains the complete reformatted xhtml string. And voila, the
problem's fixed.

However, then I checked some other feeds which I take directly from
the original, these are e.g.:

http://rss.sme.sk/rss/rss.asp?sek=smeonline
http://servis.idnes.cz/rss.asp
http://www.lidovky.cz/export/rss.asp?c=ln_lidovky
http://dnevnik.bg

They are all feeds for online versions of the largest dailies of the
Czech republic, Slovakia and Bulgaria (all non ISO-8859-1). And, if I
choose the same setting as before, i.e. "Articles + Images + Web Page"
and "Display Web Page", the characters get always mangled. So do they
all serve their feeds with tens of thousands subscribers wrong?

Well, I do not know the standards but no desktop or mobile rss reader
I've ever tried has had a problem with any of these pages including
the ones converted by my scripts, neither does NewsRob, unless it
saves the pages for offline reading. It seems redundant to me to send
first the encoding as a header and once again in the page itself.
Every script I've ever written which processed any web pages only
analyzed the encoding as set in <meta http-equiv="Content-Type"
content="text/html; charset=XXXX" />. It might be the standards to
first send the encoding in a header and then again in the page itself,
but in all these cases the browsers seem to take into account only the
encoding as specified in the page itself.

Every time I work with the contents of a webpage, I first load it into
a string (it doesn't matter at all what the default encoding is at the
moment). Then, if needed, I get the page encoding from the meta tag
(all chars in the meta are ASCII, the same in any encoding) and
perform an iconv conversion specifying both input (from the page
itself) and output encoding. That's in php, anyway. It might me more
complicated in Android Java.

Thanks a lot for your help,

David

On Feb 7, 12:38 am, Mariano Kamp <mariano.k...@gmail.com> wrote:
> David,
>
> you didn't provide a screenshot to show what went wrong, so the following
> lines are based on your textual description and my speculation what went
> wrong.
>
> The short story:
>
> Are you in touch with the publisher? I think it would be more fruitful and
> definitively quicker if the publisher would fix the site and let their web
> server return the same content type as is specified/used in the actual
> content.
>
> I could also try to find an implementation that takes care of the issue, but
> it's not likely happening anytime soon as it is a major change.
>
> The long story:
>
> Ok, here's what I *think* happens.
>
> mkamp$ curl -Ihttp://rss.paloch.net/hiking.sk/?id=1772HTTP/1.1 200 OK
> Server: nginx Date: Sun, 06 Feb 2011 23:09:09 GMT Content-Type: text/html
> Connection: keep-alive Vary: User-Agent,Accept-Encoding
>
> The server doesn't specify the encoding explicitly, which means according to
> the standard that the default is to be used: ISO-8859-1.
>
> NewsRob now expects ISO-8859-1 and tries to decode the content using
> ISO-8859-1, then stores it in UTF-8, which is the common format for all
> content that NewsRob stores. Unfortunately, the server later on changes its
> mind and now would prefer UTF-8:
>
> view-source:http://rss.paloch.net/hiking.sk/?id=1772
>
> ...
> *<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> *
> <meta name="robots" content="follow,noindex" />
> <meta name="author" content="Michal Mikulas" />
> *<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> *

Mariano Kamp

unread,
Feb 8, 2011, 6:41:38 AM2/8/11
to new...@googlegroups.com
thanks a lot for taking care of this. Even though my problem is only
partially solved, I've bought the Pro version of NewsRob, just to
appreciate your work [..]
Thanks.
 
(actually, Android Market seems to be the only
place where one would like to make more than they actually have to,
the price for your app seems ridiculously low.)
The economics and expectations still puzzle me too ;)
 
Your explanation in crystal clear and now I understand why this
happens but I'm not sure if this is the correct behavior.
No, I don't think it is. I guess it is legal to change the encoding in the document itself. I just don't know how to best deal with that in an efficient way, so that the user isn't hit by a bandwidth, battery, sync time, garbage collection penalty, and I am not hit by the extra development time, at least not at the moment.
 
Because for
all feeds where I get this issue the problem is as follows: when the
contents of an article is not saved for offline reading and it is
accessed online, there's no problem and all characters are displayed
right.
That is easily explained. When the pages are not available offline I hand the urls of to WebView/WebKit and it is able to handle a change of heart.

The problem only occurs when the contents is downloaded for
offline reading because such contents is, before saving, processed.
Yes, see above.
 
I am responsible (respectively my script is) for converting some of
the feeds into a format better suited for offline reading. For
instance the one I gave as an example. The RSS link is
http://rss.paloch.net/hiking.sk/rss.php, the only processing my script
does is it replaces the links to the articles to another page, which
processes the original and outputs a simplified version better suited
for viewing on my phone. It leaves the original encoding untouched
(even that double meta with 'utf-8' comes from the original). Based on
what you've written, I tried to update the script (php) which sends
the reformatted page with

       header( 'Content-Type: text/html; charset=utf-8' );

and only then

       echo( $xhtml );

which contains the complete reformatted xhtml string. And voila, the
problem's fixed.
That's good, but I think it's approaching the issue from the wrong side. The problem is that the HTTP server returns a different encoding than the content itself. In the case you mentioned I think it would be slightly more correct to add to the HTTP that it returns UTF-8.

However, then I checked some other feeds which I take directly from
the original, these are e.g.:

http://rss.sme.sk/rss/rss.asp?sek=smeonline
http://servis.idnes.cz/rss.asp
http://www.lidovky.cz/export/rss.asp?c=ln_lidovky
http://dnevnik.bg
Sorry, don't have the time to check them. If you consider the detailed results of essence, could I ask you to please provide the information that I provided for your previous example.
 
They are all feeds for online versions of the largest dailies of the
Czech republic,
Will be going there, Marienstadt, on vacation in two weeks ;)
 
Slovakia and Bulgaria (all non ISO-8859-1). And, if I
choose the same setting as before, i.e. "Articles + Images + Web Page"
and "Display Web Page", the characters get always mangled. So do they
all serve their feeds with tens of thousands subscribers wrong?
I can't tell without the detailed results if those cases suffer from the same cause.

No, I didn't say they do it wrong, or at least I didn't intend to say that, but what they return is inconsistent and NewsRob cannot handle that and I don't see a reason for the inconsistency to be kept. I don't believe that they change the encoding from article to article, so they should just set the encoding on their webserver to the same encoding they use in their pages.

In the future there may be a solution from within NewsRob, but I don't see this within the next say six months. I may change my mind further down our discussion, but currently I don't see that happening.

Well, I do not know the standards but no desktop or mobile rss reader
Ditto. At least not all of it.
 
I've ever tried has had a problem with any of these pages including
the ones converted by my scripts, neither does NewsRob, unless it
saves the pages for offline reading.
As explained above.
 
It seems redundant to me to send
first the encoding as a header and once again in the page itself.
Actually this happens here too. ISO-8859-1 (as implicit default) and then UTF-8.

FWIW from my point of view if I had to pick just one method I would use the HTTP header, because it's more efficient to know what to expect ahead of time, then probing afterwards.

Every script I've ever written which processed any web pages only
analyzed the encoding as set in <meta http-equiv="Content-Type"
content="text/html; charset=XXXX" />. It might be the standards to
first send the encoding in a header and then again in the page itself,
but in all these cases the browsers seem to take into account only the
encoding as specified in the page itself.
Could you please try that and set the header from within your script to the actual encoding?
 
Every time I work with the contents of a webpage, I first load it into
a string (it doesn't matter at all what the default encoding is at the
moment). Then, if needed, I get the page encoding from the meta tag
(all chars in the meta are ASCII, the same in any encoding) and
perform an iconv conversion specifying both input (from the page
itself) and output encoding. That's in php, anyway. It might me more
complicated in Android Java.
Yes, as I said in my last mail, this is one way to handle it.
However I don't think I should load it into a string, because at this time it is unclear what encoding is used and the encoding is necessary for the interpretation of the data. At least that's what I think. It may work by accident, but that's it.
I would need to load it into a byte buffer, than probe it with ASCII (any spec that says what to do here? If you can find out, please share), check for the header, then create another String with the detected encoding or in absence of that with the encoding from the HTTP headers or in absence of that with ISO-8859-1.
Besides this being extra work the drawbacks are that I would need more memory as I possibly have to load the document into memory (with more work I could try to stop it after the </head>, but how to I decode that etc.?) and then decode it, so that I have the document at least twice in memory and it makes streaming (not loading the whole document at once) much harder.

No idea, how an actual browser handles this, but I suspect something like that, but the browser developer is actually in the HTML/HTTP business dealing with all kind of inconsistencies that are the heart of HTML/HTTP. I don't see the same priorities for NewsRob.

If you find something about the actual spec and it is pretty clear on how to deal with this explicitly and efficiently I will reconsider.

FWIW NewsRob works like this:

         String charsetName = null;

         for (HeaderElement he : response.getEntity().getContentType().getElements()) {

             NameValuePair nvp = he.getParameterByName("charset");

             if (nvp != null) {

                 charsetName = nvp.getValue();

                 break;               

             }

         }


        [...]


        Charset charset = Charset.forName("ISO-8859-1");

        if (charsetName != null)

            try {

                charset = Charset.forName(charsetName);

            } catch (Exception e) {

                // stick with the default

            }

        InputStreamReader isr = charset != null ? new InputStreamReader(is, charset) : new InputStreamReader(is);

David Paloch

unread,
Feb 8, 2011, 7:59:27 AM2/8/11
to new...@googlegroups.com

Hello Mariano,

thanks a lot for getting back to me so quickly.

I did a bit of investigation and checked a lot of sites. Most of them, including those I included in the previous post, declare no charset in the header at all. Mostly, the response is "Content-Type: text/html" (you can check here: http://test.paloch.net/headers.php?path=zpravy.idnes.cz, just pust any URL after the path parameter).

I found this on W3C recommendation page (http://www.w3.org/TR/html4/charset.html#h-5.2.2):

The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter.

It seems most servers don't send this header because they don't have to. NB "user agents must not assume any default value".

As I wrote before, I don't know Dalvik nor Java so I don't know how they handle strings. I don't even know how PHP handles strings internaly but I suppose when I use file_get_contents, it simply loads it as a byte string into the memory, without any default encoding attached to it. And PHP only takes charset into account when it needs to, as most operations are charset independent.

I looked into the files NewsRob caches on the SD card and it seems the only processing you do is replacing the original links with cached links. So what if you always load the string as say ISO-8859-1 (does any charset need to be specified at all?) and apply no charset conversion to it at all. Then replace the links (with REGEX?). And save it as ISO-8859-1 (again, does any charset need to be specified?). And when displaying the page, you feed WebKit whatever has been saved and as you say, WebKit works the correct encoding out by itself (from meta).

David


http://www.w3.org/TR/html4/charset.html


--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Reply all
Reply to author
Forward
0 new messages