decoding ISO-8859-1 text

2,303 views
Skip to first unread message

Manlio Perillo

unread,
Jan 22, 2016, 4:58:42 PM1/22/16
to golang-nuts
Hi.

I need to decode some ISO-8859-1 text, but it seems that the encoding is not available in golang.org/x/text/encoding/charmap.
What is the reason?

https://encoding.spec.whatwg.org/ does not have ISO-8859-1, but says that windows-1252 is equivalent to it.

Wikipedia says that:
The popular Windows-1252 character set adds all the missing characters provided by ISO/IEC 8859-15, plus a number of typographic symbols, by replacing the rarely used C1 controls in the range 128 to 159 (hex 80 to 9F). It is very common to mislabel text data with the charset label ISO-8859-1, even though the data is really Windows-1252 encoded. Many web browsers and e-mail clients will interpret ISO-8859-1 control codes as Windows-1252 characters in order to accommodate such mislabeling but it is not standard behaviour and care should be taken to avoid generating these characters in ISO-8859-1 labeled content.

I suspect that whatwg consider ISO-8859-1 as an alias for Windows-1252 for this reason.  x/text should not follow whatwg, IMHO.


Thanks  Manlio

Nigel Tao

unread,
Jan 22, 2016, 5:56:38 PM1/22/16
to Manlio Perillo, golang-nuts
On Sat, Jan 23, 2016 at 8:58 AM, Manlio Perillo
<manlio....@gmail.com> wrote:
> What is the reason?
>
> ...
>
> I suspect that whatwg consider ISO-8859-1 as an alias for Windows-1252 for
> this reason. x/text should not follow whatwg, IMHO.

Yes, the reason is that WHATWG says to do so. There is some sense that
the wonderful thing about standards is that there are so many of them
to choose from, but we did have to choose one, and we chose WHATWG,
since billions of people use web browsers and e-mail clients, and e.g.
Go is often used to write web clients and servers.

IIUC, ISO-8859-1 is a subset of Windows-1252, so if you're only
decoding and not encoding, and decoding assuming valid input, then
your decoded output should be just fine anyway.

Even if it isn't, it should be easy to write your own Encoding
implementation with the exact semantics you want.

Andy Balholm

unread,
Jan 22, 2016, 6:46:40 PM1/22/16
to Nigel Tao, Manlio Perillo, golang-nuts
Nigel Tao wrote:

> Even if it isn't, it should be easy to write your own Encoding
> implementation with the exact semantics you want.

…especially since ISO-8859-1 is a subset of Unicode. If b is an ISO-8859-1 byte, rune(b) is the corresponding Unicode value.

Uli Kunitz

unread,
Jan 22, 2016, 6:58:12 PM1/22/16
to golang-nuts, manlio....@gmail.com
I volunteer to provide an ISO-8859-1 implementation for x/text/encoding/charmap, because it is a fully standardized character set and Go would be the only encoding library I know that doesn't support it. The encoding has been the default on Linux systems at least for German users for a long time, so there are still files around in that encoding. There are even some older file format specifications requiring it.

Is x/text/encoding using the same process for CL code reviews as the Go sources?

Uli Kunitz

unread,
Jan 22, 2016, 7:34:21 PM1/22/16
to golang-nuts, manlio....@gmail.com
While I have still to test it, it is actually quite easy to extend golang.org/x/text/encoding/charmap/maketables.go using the ICU ucm file. The new tables.go looks promising. I will add a test tomorrow and provide the CL.

Manlio Perillo

unread,
Jan 22, 2016, 7:50:51 PM1/22/16
to Nigel Tao, golang-nuts
On Fri, Jan 22, 2016 at 11:56 PM, Nigel Tao <nige...@golang.org> wrote:
> On Sat, Jan 23, 2016 at 8:58 AM, Manlio Perillo
> <manlio....@gmail.com> wrote:
>> What is the reason?
>>
>> ...
>>
>> I suspect that whatwg consider ISO-8859-1 as an alias for Windows-1252 for
>> this reason. x/text should not follow whatwg, IMHO.
>
> Yes, the reason is that WHATWG says to do so. There is some sense that
> the wonderful thing about standards is that there are so many of them
> to choose from, but we did have to choose one, and we chose WHATWG,
> since billions of people use web browsers and e-mail clients, and e.g.
> Go is often used to write web clients and servers.
>

This is simply wrong, IMHO.
If you want to use Windows-1252 for compatibility reasons, as
suggested by WHATWG, then it is your application responsibility to
define an encoding mapping that considers ISO-8859-1 an alias for
Windows-1252.

> IIUC, ISO-8859-1 is a subset of Windows-1252, so if you're only
> decoding and not encoding, and decoding assuming valid input, then
> your decoded output should be just fine anyway.
>
> Even if it isn't, it should be easy to write your own Encoding
> implementation with the exact semantics you want.

It can be easy, but I don't see why it should not be available from
the text package. I was surprised to found all the ISO-8859 encodings
except latin1.


Thanks Manlio

andrey mirtchovski

unread,
Jan 22, 2016, 8:48:15 PM1/22/16
to Manlio Perillo, Nigel Tao, golang-nuts
May I suggest Roger Peppe's wonderful "github.com/rogpeppe/go-charset/charset"?

I've been using it for a project for at least three years and it's
proved nothing but wonderful.

all one needs to do (in my case) is import:

import "github.com/rogpeppe/go-charset/charset"
import _ "github.com/rogpeppe/go-charset/data"

then:

p := xml.NewDecoder(resp.Body)
p.CharsetReader = charset.NewReader

https://godoc.org/github.com/rogpeppe/go-charset/charset

Nigel Tao

unread,
Jan 22, 2016, 9:10:23 PM1/22/16
to Manlio Perillo, Marcel van Lohuizen, golang-nuts
On Sat, Jan 23, 2016 at 11:50 AM, Manlio Perillo
<manlio....@gmail.com> wrote:
> This is simply wrong, IMHO.

Well, you and me repeating "IMHO I'm right and you're wrong" ad
infinitum isn't probably going to change anyone's mind. :-)

I'll ask mpvl, co-author of the golang.org/x/text packages, for a third opinion.

Nigel Tao

unread,
Jan 22, 2016, 9:16:48 PM1/22/16
to andrey mirtchovski, Manlio Perillo, golang-nuts
On Sat, Jan 23, 2016 at 12:47 PM, andrey mirtchovski
<mirtc...@gmail.com> wrote:
> May I suggest Roger Peppe's wonderful "github.com/rogpeppe/go-charset/charset"?

Roger Peppe himself recommends using the golang.org/x/text/encoding
packages instead of github.com/rogpeppe/go-charset. See
https://groups.google.com/d/msg/golang-nuts/FHmzYmM5r5Y/3hTRqPKjCwAJ

For example, preliminary benchmarks suggest that
golang.org/x/text/encoding is faster and allocates less memory:
https://groups.google.com/d/msg/golang-dev/UfT00vJBW8Y/iQKwDM5PSzcJ

Nigel Tao

unread,
Jan 22, 2016, 9:28:29 PM1/22/16
to Uli Kunitz, golang-nuts, Manlio Perillo
On Sat, Jan 23, 2016 at 10:58 AM, Uli Kunitz <uli.k...@gmail.com> wrote:
> Is x/text/encoding using the same process for CL code reviews as the Go
> sources?

Yes, same process, although personally, I'd wait until we first reach
consensus on what to do.

Uli Kunitz

unread,
Jan 23, 2016, 4:23:18 AM1/23/16
to golang-nuts, uli.k...@gmail.com, manlio....@gmail.com
Nigel, many thanks for your answer. I found in the log files that you are the original author of the x/text/encoding package. Many thanks for that work, because character sets handling is for most developers today a no-brainer, but for folks like me who had to write code before Unicode and UTF-8 became widespread it was and still is a major headache, particularly when your native language cannot be written correctly in ASCII. Since ISO-8859-1 included the German Umlauts äöü as well as the SZ character ß we used as the default set for UNIX systems and in databases. There is even a legacy Financial Transaction Services / Homebanking Computer Interface (FinTS/HBCI) specification that declares a subset of ISO-8859-1 as mandatory, which is still in use in Germany today (FINTS 3.0).

It appears that WHATWG intended to make things work for Windows users accepting non-conformance with international standards. Windows-1252 is a superset of ASCII and ISO-8859-1,  but it isn't identical as the WHATWG specification defines. While this may work for decoding files and addresses wrong labeling, it breaks if text is generated and processed by decoders that are compliant with the actual international standards. I remember a user-acceptance test issue where the use of the Windows-1252 EURO sign € in a ISO-8859-1 labeled text broke a decoder in one of our backend systems putting the production date at risk.

It appears that there is an intention to provide alternative indexes to character set encodings like ianaindex. Would you support a subpackage uts22 that is compliant with Unicode Technical Standard #22, and leave WHATWG the default?

Manlio Perillo

unread,
Jan 23, 2016, 9:23:07 AM1/23/16
to golang-nuts, manlio....@gmail.com, mp...@golang.org
Thanks.

Manlio 

oju...@gmail.com

unread,
Jan 24, 2016, 3:33:21 AM1/24/16
to golang-nuts, uli.k...@gmail.com, manlio....@gmail.com
I was working for a news agency when we got a surprise with the EURO sign disappearing from our news. Then I learned that Windows-1252 and ISO-8859-1 "are the same, but not quite" and that most of the time, when people state the information is ISO-8859-1 encoded, that really means Windows-1252.

Manlio Perillo

unread,
Jan 24, 2016, 9:10:15 AM1/24/16
to golang-nuts, uli.k...@gmail.com, manlio....@gmail.com, oju...@gmail.com
Il giorno domenica 24 gennaio 2016 09:33:21 UTC+1, oju...@gmail.com ha scritto:
I was working for a news agency when we got a surprise with the EURO sign disappearing from our news. Then I learned that Windows-1252 and ISO-8859-1 "are the same, but not quite" and that most of the time, when people state the information is ISO-8859-1 encoded, that really means Windows-1252.
 

I'm not against considering ISO-8859-1 an alias for Windows-1252, since it is probably the right thing to to.
I'm against the idea of *not supporting at all* the ISO-8859-1 decoder in the charmap package. 

A normal user will access an encoding by its name, and here ISO-8859-1 should be considered an alias for Windows-1252.
However if I really want ISO-8859-1, I should be able to get it from charmap.ISO_8859_1.

Also, this behavior should be documented.
Documentation should be added even if nothing is changed, both in the charmap package documentation and in the Windows-1252 variable documentation, explaining why ISO-8859-1 is not available.

Thanks
Manlio

Nigel Tao

unread,
Jan 28, 2016, 7:23:46 PM1/28/16
to Marcel van Lohuizen, golang-nuts
On Sat, Jan 23, 2016 at 1:10 PM, Nigel Tao <nige...@golang.org> wrote:
> I'll ask mpvl, co-author of the golang.org/x/text packages, for a third opinion.

mpvl: ping.

Marcel van Lohuizen

unread,
Feb 4, 2016, 3:19:55 AM2/4/16
to golang-nuts, mp...@golang.org
Sorry for me chiming in late.

I agree that x/text should not only support WhatWG. Browsers are a major use case, but not the only one. Also, there is no need for x/text to limit the selection of encodings to one particular point of view.

Case in point, the x/text repo has both an encoding/htmlindex (done) and encoding/ianaindex (draft checked in) to allow for different standards in selecting encodings. The former conforms to WhatWG the latter to IANA/MIME.

It makes sense to disallow or discourage some encodings for security reasons, but ISO-8859-1 would be good to have. Internally Encodings are marked with a IANA MIB. The purpose of this ID is to have some handle about which sets are different and which are not and then how they are defined exactly. Any encoding that is listed in x/text/encoding/internal/identifier/mib.go is fair game to be added (ideally only the common ones and the others on a need-to-have basis). IMO ;)

Nigel Tao

unread,
Feb 4, 2016, 5:11:52 AM2/4/16
to Uli Kunitz, golang-nuts
Yeah, the more that I think about it, the more I agree with y'all that
we should have an charmap.ISO8859_1 as distinct from
charmap.Windows1252.

Uli, do you still want to cook up a CL? I'm happy to do it tomorrow,
if you're busy.

As for a UTS 22 decoder, I'm not sure exactly what you're suggesting,
I don't think we need one just yet. At the least, one doesn't have to
live under golang.org/x just yet.

mpvl

unread,
Feb 4, 2016, 11:11:36 AM2/4/16
to golang-nuts, mp...@golang.org

Uli Kunitz

unread,
Feb 4, 2016, 4:10:06 PM2/4/16
to golang-nuts, uli.k...@gmail.com

I proposed the uts22 package for the case that compliance with the WHATWG encoding standard would have seen as paramount. 

Nigel Tao

unread,
Feb 4, 2016, 6:50:37 PM2/4/16
to Uli Kunitz, golang-nuts
On Fri, Feb 5, 2016 at 8:10 AM, Uli Kunitz <uli.k...@gmail.com> wrote:
> I proposed the uts22 package for the case that compliance with the WHATWG
> encoding standard would have seen as paramount.

Well, it's not like we're seeking some sort of formal tick of approval
that we're Complying with WHATWG. Parsing
https://www.w3.org/TR/encoding/indexes/encodings.json (or
equivalently, AFAICT, https://encoding.spec.whatwg.org/encodings.json)
seems to work perfectly fine without having to write a UTS 22 decoder.

Ulrich Kunitz

unread,
Feb 4, 2016, 7:18:29 PM2/4/16
to Nigel Tao, golang-nuts
I admit the whole proposal has not been thought through a lot, I didn't want to push the XML format. I regard their approach quite reasonable, particularly the first chapter. Since ICU implements UTS 22 and does the right thing (from my point of view) regarding ISO 8859-1, I proposed it. 

As we have discussed using windows-1252 to decode ISO 8859-1 will practically generate no issues at all. But we should not support people encode windows-1252 and label it with ISO_8859-1. This will break and the issue will most likely involve the EURO sign €, which is not supported by ISO 8859-1 but by 1252.

mp...@golang.org

unread,
Feb 5, 2016, 4:08:38 AM2/5/16
to Ulrich Kunitz, Nigel Tao, golang-nuts
I agree with Nigel that this is not the time to add UTS 22.  Note that UTS 22 is quite old already an things have settled down a bit since.  I think it is quite a bit of overkill for where we are at today. Using the IANA charset listing as index serves as a reasonable compromise and covers most use cases nowadays I reckon.

In line with UTS-22, it would not be bad to be a bit more resilient with during encoding by retrying using decomposition on failure or detecting possible compositions first. For example, a character set supporting è will not convert e\u0300 identically (neither do any other implementations afaik, but would be cool and consistent with the x/text philosophy of the user not having to worry about normalization). But one doesn't need UTS-22 for that.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages