How to deal with different encodings? (Latin-1 strings end up as 'NULL'.)

136 views
Skip to first unread message

Max Romantschuk

unread,
Mar 1, 2010, 7:24:43 AM3/1/10
to FireConsole
Hi everyone,

I'm working on a project which has a Latin-1 backend (historical
reasons), but we're converting the frontend to UTF-8. I'm not sure if
this issue is something which should be reported in Zend's issue
tracker, but I thought I'd start here. (I'm using the Zend stack, with
the data going through the Zend_Wildfire_Protocol_JsonStream class,
ultimately ending up in Zend_Json.)

Basically, I was trying to debug some backend data (which is latin-1)
with FirePHP. Surprisingly, all strings containing latin-1 data were
reported as 'NULL'. After some debugging the reason was obvious:
json_encode, which is the function ultimately handling the encoding,
was breaking the output. (As documented in the PHP docs: "This
function only works with UTF-8 encoded data. ")

The issue here is that when debugging something you most certainly
don't want your data to be mangled like this. When I'm using FirePHP
to debug latin-1 data I would expect it do get through as unharmed as
possible, perhaps with my broken latin-1 letters omitted/replaced, but
certainly not as null. That is very misleading when trying to output
some random data structure or object for debugging.

I'm not sure what the best way to fix this issue would be, but I can't
really convert my data for debugging. The whole point of outputting my
data for debugging is to see what's going on in the backend, which is
in latin-1. In any case, this would affect any site which isn't UTF-8
all the way.

I realize this is a backend issue in the stack I'm using, but it's
also a pretty major issue when doing some fairly rudimentary
debugging. Thoughts, suggestions? :)

Regards,
Max Romantschuk

Christoph Dorn

unread,
Mar 1, 2010, 4:51:19 PM3/1/10
to firec...@googlegroups.com

Handling different encodings properly has definitely been a piecemeal
approach in the past. I have not spent any time on an end-to-end
approach for dealing with all encodings.

This may be a good opportunity for your feedback to help flesh out a
good design for FireConsole.

Firefox (and thus FireConsole) can handle a large variety of encodings.
The challenge is getting the data there in a non-destructive way.

We need a way to encode the payload using ASCII characters to ensure it
is valid for transport in HTTP headers. I think base64 could do the
trick, but I am not sure if that works for all encodings.

If json_encode trashes your data we may need to roll our own json
encoder or base64 encode string values before json encoding them.

Any suggestions?

Christoph

Max Romantschuk

unread,
Mar 2, 2010, 2:57:35 AM3/2/10
to FireConsole
> Firefox (and thus FireConsole) can handle a large variety of encodings.
> The challenge is getting the data there in a non-destructive way.

Indeed. We also don't have the luxury of having any metadata to rely
upon for the choice of display encoding-wise. For example in my
instance the page is UTF-8 and the data latin-1. I guess there are
three options for display:

1. Use whatever the webpage uses, and show invalid characters as
broken.
2. Try to auto-detect the encoding. Probably fails horribly outside of
ASCII, UTF-8 and latin-1.
3. Allow the user to choose the display encoding for the console and/
or variable being shown.


> We need a way to encode the payload using ASCII characters to ensure it
> is valid for transport in HTTP headers. I think base64 could do the
> trick, but I am not sure if that works for all encodings.

Since base64 encodes binary data to ASCII it should work fine for any
encoding.

From a compatibility standpoint it would make sense to have the client
support strings as they are now, and use some kind of indicator to
inform the client that the data is a base64 payload instead of a bare
string. This way old backends would still work with newer clients, and
new backends could be configurable in how they deal with non-ascii
strings.


> If json_encode trashes your data we may need to roll our own json
> encoder or base64 encode string values before json encoding them.

According to the PHP docs json_encode (by design) only supports UTF-8.
I'm guessing the only option is to write a dedicated encoder.


That's about all I can come up with for now... :)

Max

Christoph Dorn

unread,
Mar 2, 2010, 3:19:25 AM3/2/10
to firec...@googlegroups.com
Max Romantschuk wrote:
> 1. Use whatever the webpage uses, and show invalid characters as
> broken.
> 2. Try to auto-detect the encoding. Probably fails horribly outside of
> ASCII, UTF-8 and latin-1.
> 3. Allow the user to choose the display encoding for the console and/
> or variable being shown.

How about (3) with (1) as the default. I can put a drop-down into the
variable viewer. Any suggestions on a list of display encodings we
should offer?


> From a compatibility standpoint it would make sense to have the client
> support strings as they are now, and use some kind of indicator to
> inform the client that the data is a base64 payload instead of a bare
> string. This way old backends would still work with newer clients, and
> new backends could be configurable in how they deal with non-ascii
> strings.

Right, it will be backwards compatible.


> According to the PHP docs json_encode (by design) only supports UTF-8.
> I'm guessing the only option is to write a dedicated encoder.

I'll include a custom encoder in the next build.

Christoph

Max Romantschuk

unread,
Mar 2, 2010, 5:54:07 AM3/2/10
to firec...@googlegroups.com
> Max Romantschuk wrote:
>> 1. Use whatever the webpage uses, and show invalid characters as
>> broken.
>> 3. Allow the user to choose the display encoding for the console and/
>> or variable being shown.

Christoph Dorn wrote:
> How about (3) with (1) as the default. I can put a drop-down into the
> variable viewer. Any suggestions on a list of display encodings we
> should offer?

Firefox's approach under 'View/Character Encoding' makes sense I think,
but duplicating that is probably an unpleasant endeavor. I've never
dealt with extending Firefox myself, so I don't know really.

UTF-8, and ISO-8859-15 and ISO-8859-1 (the former being the version of
latin-1 with the euro sign) cover most of Europe. But that still leaves
Cyrillic, Greek, and all the Asian scripts.

If it's possible to query the browser about what encodings are available
perhaps there could be a list of the five last used encodings plus a
'more...'-option?


>> According to the PHP docs json_encode (by design) only supports UTF-8.
>> I'm guessing the only option is to write a dedicated encoder.
>
> I'll include a custom encoder in the next build.

I'd be happy to help test. Since the backend is in the Zend stack for
me, will I be able patch these changes in somehow? I'm using a local dev
env so I do have full control of the code.

.max


--
Max Romantschuk
m...@romantschuk.fi
http://max.romantschuk.fi/

Christoph Dorn

unread,
Mar 2, 2010, 11:53:28 PM3/2/10
to firec...@googlegroups.com
Max Romantschuk wrote:
> If it's possible to query the browser about what encodings are available
> perhaps there could be a list of the five last used encodings plus a
> 'more...'-option?

I am pretty sure I can get a list of supported encodings from the
browser. Great idea.


>> I'll include a custom encoder in the next build.
>
> I'd be happy to help test. Since the backend is in the Zend stack for
> me, will I be able patch these changes in somehow? I'm using a local dev
> env so I do have full control of the code.

Yeah. Should be no problem.

I have created an issue here:

http://github.com/cadorn/fireconsole/issues#issue/8

Christoph


Reply all
Reply to author
Forward
0 new messages