Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Text file encodings in OS-X (ISO Latin1 8859 vs UTF-8)

704 views
Skip to first unread message

JF Mezei

unread,
Dec 18, 2012, 3:43:25 PM12/18/12
to
Out of curiosity, when I save a text file in TextEdit, I am given the
chance to specify the text encoding (ISO 8859-1 Latin1, UTF-8 and amny
others) in the "save as" menu option.

How/where is this stored ? From the command line, is there a way to see
and possibly change the text encoding associated with a file ?

Recently used PHP to download data from the CRTC web site which the HTTP
headers specified as UTF-8 but PHP has great problems dealing with
accented characters both when data read directly via HTTP or if the HTML
files were first stored locally as text files.

And now, I have TextEdit telling me it can no longer save a text
document because I pasted text that probably contains characters not
possible in latin-1 so I have to save-as UTF-8.

I'd like to have a better understanding on how text files are processed
under OS-X.

John Holt

unread,
Dec 18, 2012, 4:58:00 PM12/18/12
to
As far as I have been able to tell, encoding determination in Mac OS X is
just like everywhere else. In other words, the protocol for the file type
specifies (indirectly in some cases like XML) the encoding.

The determination of the encoding for a file can be a little tough. For
the most part, the encoding is simply implied.



Good luck,

--
John Holt

Wes Groleau

unread,
Dec 18, 2012, 10:57:47 PM12/18/12
to
On 12-18-2012 16:58, John Holt wrote:
> As far as I have been able to tell, encoding determination in Mac OS X is
> just like everywhere else. In other words, the protocol for the file type
> specifies (indirectly in some cases like XML) the encoding.

I could be wrong, but I believe some programs use a resource fork and
many (including TextEdit) use a system-call version of the file command.

man file

to see how that works.

--
Wes Groleau

I've noticed lately that the paranoid fear of computers becoming
intelligent and taking over the world has almost entirely disappeared
from the common culture. Near as I can tell, this coincides with
the release of MS-DOS.
— Larry DeLuca

Message has been deleted

JF Mezei

unread,
Dec 19, 2012, 2:59:32 AM12/19/12
to
On 12-12-18 23:35, Lewis wrote:

> UTF8 is the most common file type on the Internet.

But for text files, especially OS config files and bash scripts, I am
not sure if UTF-8 is the most common.



> You'd best make sure
> your web browser and php are setup properly.

When browsing those pages from CRTC, the accents display correctly. But
when using PHP to extract those pages and save the data to a text file
or create file name with those accented characters, it was screwed up.

For instance I would take "Alain Gagné" as a name and convert it to
0057_Alain_Gagné.PDF" where this guy's submission was to be stored, but
the file name wouldn't have the é but some multi character combo.

Just wondering if running PHP in a xterm window causes it to go into
some mode which won't support UTF8. The CRTC web site does provide UTF-8
as the Character set in the HTTP response header.

bi...@mix.com

unread,
Dec 19, 2012, 3:48:27 AM12/19/12
to
JF Mezei <jfmezei...@vaxination.ca> writes:

Your article, by the way, was posted thusly -

> Content-Type: text/plain; charset=ISO-8859-1

But I don't know from whence it was created (the
computer with the problem, or somewhere else).

> Just wondering if running PHP in a xterm window causes it to go into
> some mode which won't support UTF8. The CRTC web site does provide UTF-8
> as the Character set in the HTTP response header.

You might want to start by defining LANG = en_CA.UTF-8 or maybe fr_CA.UTF-8
on your local machine. locale -a will give you a list of available choices.
xterm may have something to set as well, I don't use it so I don't know.

Billy Y..
--
sub #'9+1 ,r0 ; convert ascii byte
add #9.+1 ,r0 ; to an integer
bcc 20$ ; not a number

Paul Sture

unread,
Dec 19, 2012, 4:33:36 AM12/19/12
to
In article <karv0r$pf8$1...@reader1.panix.com>, bi...@MIX.COM wrote:

> JF Mezei <jfmezei...@vaxination.ca> writes:
computer with the problem, or somewhere else).
>
> > Just wondering if running PHP in a xterm window causes it to go into
> > some mode which won't support UTF8. The CRTC web site does provide UTF-8
> > as the Character set in the HTTP response header.

FWIW I had a heck of a problem getting accented characters correct in
the Tiger version of Terminal so xterm might be the problem.

> You might want to start by defining LANG = en_CA.UTF-8 or maybe fr_CA.UTF-8
> on your local machine. locale -a will give you a list of available choices.
> xterm may have something to set as well, I don't use it so I don't know.

You can do that on the invoking line with something like:

LANG=fr_CA.UTF-8 php <mumble>

For example the following gives me French month abbreviations in the
dates (in Terminal), including the e-acute in "déc":

LANG=fr_CA.UTF-8 ls -l

--
Paul Sture

Q: pleasecanyoufixmyspacebar?
A: myspaceisdeadyouneedtotryfacebook
Message has been deleted

Paul Sture

unread,
Dec 19, 2012, 11:43:43 AM12/19/12
to
In article <nospam-B794AC....@news.chingola.ch>,
Paul Sture <nos...@sture.ch> wrote:

> In article <karv0r$pf8$1...@reader1.panix.com>, bi...@MIX.COM wrote:
>
> > JF Mezei <jfmezei...@vaxination.ca> writes:
> computer with the problem, or somewhere else).
> >
> > > Just wondering if running PHP in a xterm window causes it to go into
> > > some mode which won't support UTF8. The CRTC web site does provide UTF-8
> > > as the Character set in the HTTP response header.
>
> FWIW I had a heck of a problem getting accented characters correct in
> the Tiger version of Terminal so xterm might be the problem.
>
> > You might want to start by defining LANG = en_CA.UTF-8 or maybe fr_CA.UTF-8
> > on your local machine. locale -a will give you a list of available choices.
> > xterm may have something to set as well, I don't use it so I don't know.
>
> You can do that on the invoking line with something like:
>
> LANG=fr_CA.UTF-8 php <mumble>
>
> For example the following gives me French month abbreviations in the
> dates (in Terminal), including the e-acute in "d�c":
>
> LANG=fr_CA.UTF-8 ls -l

And yes, in xterm too.

But please note I am using Xquartz here because I am on Mountain Lion.

JF Mezei

unread,
Dec 19, 2012, 1:24:05 PM12/19/12
to
Here is another example:

Saved:
> https://services.crtc.gc.ca/pub/ListeInterventionList/Default-Defaut.aspx?en=2012-557&dt=r&lang=e

as an .html file on disk. Firefox displays Rachel Laperri�re properly.


Both Textedit and the xwindows nedit display the html line as:

<span id="ctl00_ContentMain_gvData_ctl04_lblIntervenor">Laperrière,
Rachel</span><br />


And when PHP fetches the data with a

$level1html =
file_get_contents('https://services.crtc.gc.ca/pub/ListeInterventionList/Default-Defaut.aspx?en=2012-557&dt=c&Lang=e');


The data in "level1html" variable is also as shown in the span. And when
it uses that data to create a filename, the resulting file on OS-Xs
Finder is the "corrupt" one, not one with Laperri�re

I tried the setlocale in PHP to no avail.

And I tried to do the iconv to convert to it plain ASCII with
transliteration so it would end up Rachel Laperriere and that also failed.

If the text encoding were auto detected based on content of file, how
come Textedit doesn't detect UTF-8 data ?

Now it gets stranger:
I do a "view source" from Firefox. Rachel's name displayed properly.
Select all, copy, and paste it into an empty textedit window. Rachel's
name still fine. Save the file as a text file:

Now, textedit can reopen the file and see Rachel's name fine, but nedit
will display the raw characters. (which probably means nedit just not
capable of supporting UTF-8 characterd).

However, how come if I save the HTML to a file from firefox, textedit
fails, but if I view source , copy paste into text edit, save it , and
then reopen it, the file is fine ?

Seems to me there must be some hidden file attriute somewhere which
doesn't get set when firefox saves the source file, but does when
textedit saves it.



JF Mezei

unread,
Dec 19, 2012, 1:35:53 PM12/19/12
to
BTW, here is how PHP displays my entry in an xterm:

name=Mezei, Jean-François
name3=Mezei_Jean-François
company=Vaxination Informatique
docs=Documents.aspx?ID=176417&Lang=e
number=0020

Here is how the file comes out in Finder after PHP has used my name to
construct a file containing the document:

0020_Mezei_Jean-François_01.PDF

I guess I'll have to find a way to run terminal.app on the xserve
(headless server) to see if it PHP would then run in UTF-8 mode.


JF Mezei

unread,
Dec 19, 2012, 1:40:47 PM12/19/12
to
Just tried running PHP in Terminal.App on the Xserve and it yielded the
same results.

In the "Advanced" preferences for Terminal.App the character encoding is
set to UTF-8 with the "set locale on startup" checked.


JF Mezei

unread,
Dec 19, 2012, 1:50:39 PM12/19/12
to
With Finder, I manually corrected the accented characters (at this stage
of the process, the number of documents is smaller)

In "Terminal.app" the folder for Vidéotron comes out as:

Québecor_Média_inc


But in Xterm:

Que??becor_Me??dia_inc

So it is probably some resource I need to set in Xterm to tell it to use
UTF-8 by default since in its currnet state it is unable to handle it.
Interesting that Xtrerm displays the ç as ÃÂ when dealing with data
in a file, but when I manually set a file name with Finder, Xterm then
displays the ç as ??




Richard Kettlewell

unread,
Dec 19, 2012, 2:50:59 PM12/19/12
to
JF Mezei <jfmezei...@vaxination.ca> writes:
> Out of curiosity, when I save a text file in TextEdit, I am given the
> chance to specify the text encoding (ISO 8859-1 Latin1, UTF-8 and amny
> others) in the "save as" menu option.
>
> How/where is this stored ? From the command line, is there a way to
> see and possibly change the text encoding associated with a file ?

TextEdit stores the encoding in an extended attribute, which you can
retrieve with xattr.

$ hexdump -C utf8.txt
00000000 66 69 6c 65 20 77 69 74 68 20 61 20 c2 a3 20 73 |file with a .. s|
00000010 69 67 6e 0a |ign.|
00000014
$ hexdump -C wlatin1.txt
00000000 66 69 6c 65 20 77 69 74 68 20 61 20 a3 20 73 69 |file with a . si|
00000010 67 6e 0a |gn.|
00000013
$ xattr -l utf8.txt
com.apple.TextEncoding: utf-8;134217984
$ xattr -l wlatin1.txt
com.apple.TextEncoding: windows-1252;1280

(The decimal value is an OSX-specific identifier for the encoding.)

But this is not a general answer; many other programs will neither set
the attribute nor pay any attention to it. For example, most Unix
programs will either ignore the question entirely or assume that
encoding implied by the LC_CTYPE locale setting holds for all files.

--
http://www.greenend.org.uk/rjk/
Message has been deleted

bi...@mix.com

unread,
Dec 19, 2012, 8:35:52 PM12/19/12
to
Here's a handy web test page I stumbled upon today -

UTF-8 Sampler
http://www.columbia.edu/~fdc/utf8/

And to fill some of what Apple doesn't provide (the
author's web site is not responding for me now) -

Code2000 Font
http://www.fonts2u.com/code2000.font

Not that any of this solves the OP's problem, but, hey...

Wes Groleau

unread,
Dec 19, 2012, 11:00:05 PM12/19/12
to
On 12-19-2012 03:48, bi...@MIX.COM wrote:
> You might want to start by defining LANG = en_CA.UTF-8 or maybe fr_CA.UTF-8
> on your local machine. locale -a will give you a list of available choices.
> xterm may have something to set as well, I don't use it so I don't know.


I have all of my locales set to en-US.UTF-8 and Terminal's default also
set to UTF-8 Most of the time it works pretty well. There are programs
that feel obligated to use octal for ALL non-ASCII characters. And
there is an occasional odd inconsistency, such as:

iMac:~ wgroleau$ touch X=㌳䑄啕∢晦睷袈香ꪪ==X
iMac:~ wgroleau$ ls X*
X=㌳䑄啕∢晦睷袈香?==X
iMac:~ wgroleau$ ls -lat | head -6
total 4133344
drwxr-x--- 732 wgroleau staff 24888 Dec 19 22:39 .
-rw-r--r-- 1 wgroleau staff 72 Dec 19 22:39 .signature
drwx------ 15 wgroleau staff 510 Dec 19 22:37 .dropbox
-rw-r--r-- 1 wgroleau staff 0 Dec 19 22:37 X=㌳䑄啕∢晦睷袈香ꪪ==X
drwx------ 2 wgroleau staff 68 Dec 19 22:07 .Trash

If you don't see at least one Chinese character there, YOUR newsreader
doesn't honor UTF-8 encoding headers.

There was only ONE equal sign at each end. What was before it was
U+AAAA which is apparently not available on my Mac (you can see what it
looks like on <http://www.unicode.org/charts/PDF/UAA80.pdf>.

Before that was U+9999 or 香

In Terminal, the 'ls -lat' did not have the extra equal sign, and for
the low vo, it had the glyph meaning HUH?!?

I find it interesting that 'ls' _ass_umed_ that Terminal couldn't handle
it and replaced it with ?= but when piped, left it alone and 'head'
displayed it the same as the shell (and pasting it into Thunderbird
changed the single glyph into a different one and an equal sign!)

--
Wes Groleau

“Statistics are like bikinis.
What they reveal is suggestive,
but what they conceal is vital.”
— Aaron Levenstein

Wes Groleau

unread,
Dec 19, 2012, 11:17:45 PM12/19/12
to
On 12-19-2012 13:24, JF Mezei wrote:
> Both Textedit and the xwindows nedit display the html line as:
>
> <span id="ctl00_ContentMain_gvData_ctl04_lblIntervenor">Laperrière,
> Rachel</span><br />

When you don't tell TextEdit what encoding to use, it uses substantially
the same methods as described in 'man file'
(Or the O.S. does and tells it the result)

In this case, the è was farther into the file than the characters used
to guess, i.e., the guessing code only saw ASCII characters. In that
case there is no guess. You must have TextEdit's default as ISOLatin1

è is from decoding the UTF-8 bytes of è by Latin 1 rules.

I set my default to UTF-8. So for me TextEdit usually handles gets it
right. If your HTML file had been encoded in Latin 1, my TextEdit
would be trying to use the default UTF-8 and when it got to that
character it would have said "Cannot open file. Not UTF-8"

And then I would use the File->Open command to specify Latin 1
(always my second choice because ASCII is a subset of UTF-8 and
Latin 1 is the most common non-ASCII format. And since Latin 1 includes
ALL bytes, if it isn't Latin 1, but it is any eight-bit superset of
ASCII, I can see a good part of it correctly.

--
Wes Groleau

"What progress we are making! In the Middle Ages, they would have
burnt me; nowadays they are content with burning my books.”
— Sigmund Freud, 1933
"He was never to know that even that was only an illusory progress,
that ten years later they would have burned his body as well.”
— Ernest Jones, 1953

Wes Groleau

unread,
Dec 19, 2012, 11:28:01 PM12/19/12
to
The shell (or 'ls') recognizes that the file name is UTF-8 and that the
locale can't handle that. So it substitutes question marks. It is a
bug however (or an artifact of an odd locale) that it puts two of them
for a two-byte encoding of ONE character.

I have all three encodings on "open and save" prefs for TextEdit set to
UTF-8 and "plain text" selected instead of RTF on the New Doc tab.
There may be some hidden prefs inherited from earlier versions of
TextEdit that had many more Pref tabs. But this works 99+% of the time
for me.

And Terminal handles almost everything when my locale is

iMac:~ wgroleau$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

Maybe that last one explains the remaining less than one percent. :-)


--
Wes Groleau

Pat's Polemics
http://Ideas.Lang-Learn.org/barrett

Wes Groleau

unread,
Dec 19, 2012, 11:34:43 PM12/19/12
to
On 12-19-2012 17:20, Lewis wrote:
> I don't do X11/xterm. It's always seemed a very limited and kludgy
> solution.

Although I'm sure that folks have made improvements over time, one might
benefit by tempering one's expectations with the knowledge that xterm
and X11 are older than UniCode and UTF-8.

Wes Groleau

unread,
Dec 19, 2012, 11:38:31 PM12/19/12
to
On 12-19-2012 14:50, Richard Kettlewell wrote:
> TextEdit stores the encoding in an extended attribute, which you can
> retrieve with xattr.
>
> $ hexdump -C utf8.txt
> 00000000 66 69 6c 65 20 77 69 74 68 20 61 20 c2 a3 20 73 |file with a .. s|
> 00000010 69 67 6e 0a |ign.|
> 00000014
> $ hexdump -C wlatin1.txt
> 00000000 66 69 6c 65 20 77 69 74 68 20 61 20 a3 20 73 69 |file with a . si|
> 00000010 67 6e 0a |gn.|
> 00000013
> $ xattr -l utf8.txt
> com.apple.TextEncoding: utf-8;134217984
> $ xattr -l wlatin1.txt
> com.apple.TextEncoding: windows-1252;1280

This is nice for me to know. I guess I can modify my earlier post to
say "If the xattr is not present, TextEdit will use the 'file' method of
guessing...."

> (The decimal value is an OSX-specific identifier for the encoding.)
>
> But this is not a general answer; many other programs will neither set
> the attribute nor pay any attention to it. For example, most Unix
> programs will either ignore the question entirely or assume that
> encoding implied by the LC_CTYPE locale setting holds for all files.

And it is reasonable to assume that a program that ignores it won't set
it either.

--
Wes Groleau

¡Qué quiero realmente hacer es comer un perrito caliente!
私が実際にしたいと思う何をホットドッグを食べることである!
http://Ideas.Lang-Learn.org/WWW?itemid=463

JF Mezei

unread,
Dec 20, 2012, 12:04:25 AM12/20/12
to
On 12-12-19 20:35, bi...@MIX.COM wrote:
> Here's a handy web test page I stumbled upon today -
>
> UTF-8 Sampler
> http://www.columbia.edu/~fdc/utf8/

hey ! I had dealing with Frank da Cruz back in the heydays of Kermit on
VMS ! Very smart and nice fella !

For displaying the fonts on screen, my guess is that if I have latin-1
encoded fonts, they may have problems displaying accented character that
are UTF-8.

However, that doesn't explain PHP getting UTF-8 data from the net and
not handling it as UTF-8 when processing.


Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

Wes Groleau

unread,
Dec 20, 2012, 7:26:25 PM12/20/12
to
On 12-20-2012 05:34, Tim Streater wrote:
> JF Mezei <jfmezei...@vaxination.ca> wrote:
>> However, that doesn't explain PHP getting UTF-8 data from the net and
>> not handling it as UTF-8 when processing.
>
> PHP doesn't do anything with it. It's a byte stream.

That's an over-simplification. In PHP, perl, and almost any other
language today, operations on strings or characters know what a
character is according to the encoding rule that are using.

They know that the length of "ᄑ∢㌳䑄啕晦睷袈香" is eight, but the size
is 28 bytes.


--
Wes Groleau

After the christening of his baby brother in church, Jason sobbed
all the way home in the back seat of the car. His father asked him
three times what was wrong. Finally, the boy replied, “That preacher
said he wanted us brought up in a Christian home, and I wanted to
stay with you guys."
0 new messages