Convert UTF-8

YOUNG

unread,

Dec 16, 2008, 8:05:14 PM12/16/08

to

Hi,

I have a Emacs 22.3.1 for Windows XP, and there is a file encoded in
ASCII. I am trying to read the file and convert it to UTF-8 with
emacs.

I have tried

M-x set-buffer-file-coding-system

and set up utf-8 and check it has changed to 'u' in status bar, and
since buffer has changed, it shows '**' as well.

So, I write the file using "C-x s".

It seems to fine. So, I exit the emacs, and rerun the emacs again and
read the file, too. However, the file is not converted at all.

Here is when I did "describe-current-coding-system"

----------------------

Coding system for saving this buffer:
- -- undecided-dos

Default coding system (for new files):
u -- mule-utf-8 (alias: utf-8)

Coding system for keyboard input:
* -- cp1252 (alias of windows-1252)

Coding system for terminal output:
* -- cp1252 (alias of windows-1252)

Defaults for subprocess I/O:
decoding: u -- mule-utf-8-dos

encoding: u -- mule-utf-8-unix

Priority order for recognizing coding systems when reading files:
1. mule-utf-8 (alias: utf-8)
2. iso-latin-1 (alias: iso-8859-1 latin-1)
3. mule-utf-16be-with-signature (alias: utf-16be-with-signature mule-
utf-16-be utf-16-be)
4. mule-utf-16le-with-signature (alias: utf-16le-with-signature mule-
utf-16-le utf-16-le)
5. iso-2022-jp (alias: junet)
6. iso-2022-7bit
7. iso-2022-7bit-lock (alias: iso-2022-int-1)
8. iso-2022-8bit-ss2
9. emacs-mule
10. raw-text
11. japanese-shift-jis (alias: shift_jis sjis cp932)
12. chinese-big5 (alias: big5 cn-big5 cp950)
13. no-conversion

Other coding systems cannot be distinguished automatically
from these, and therefore cannot be recognized automatically
with the present coding system priorities.

The following are decoded correctly but recognized as iso-2022-7bit-
lock:
iso-2022-7bit-ss2 iso-2022-7bit-lock-ss2 iso-2022-cn iso-2022-cn-
ext iso-2022-jp-2 iso-2022-kr

Particular coding systems specified for certain file names:

OPERATION TARGET PATTERN CODING SYSTEM(s)
--------- -------------- ----------------
File I/O "\\.dz\\'" (no-conversion . no-conversion)
"\\.g?z\\(~\\|\\.~[0-9]+~\\)?\\'"
(no-conversion . no-conversion)
"\\.tgz\\'" (no-conversion . no-conversion)
"\\.tbz\\'" (no-conversion . no-conversion)
"\\.bz2\\(~\\|\\.~[0-9]+~\\)?\\'"
(no-conversion . no-conversion)
"\\.Z\\(~\\|\\.~[0-9]+~\\)?\\'"
(no-conversion . no-conversion)
"\\.elc\\'" (emacs-mule . emacs-mule)
"\\.utf\\(-8\\)?\\'" utf-8
"\\(\\`\\|/\\)loaddefs.el\\'"
(raw-text . raw-text-unix)
"\\.tar\\'" (no-conversion . no-conversion)
"\\.po[tx]?\\'\\|\\.po\\."
po-find-file-coding-system
"\\.\\(tex\\|ltx\\|dtx\\|drv\\)\\'"
latexenc-find-file-coding-system
"" find-buffer-file-type-coding-system
Process I/O "[pP][lL][iI][nN][kK]" (undecided-dos . undecided-dos)
"[cC][mM][dD][pP][rR][oO][xX][yY]"
(undecided-dos . undecided-dos)
Network I/O nothing specified
----------------------

Do you know how to convert a file to UTF-8 using emacs, please?

Andreas Politz

unread,

Dec 16, 2008, 9:27:34 PM12/16/08

to

YOUNG wrote:
> Hi,
>
> I have a Emacs 22.3.1 for Windows XP, and there is a file encoded in
> ASCII. I am trying to read the file and convert it to UTF-8 with
> emacs.
>

If I am not mistaken, converting a ASCII file to UTF-8 is an identity operation,
since the later is backwards compatible to the former. So there would be nothing
to convert.

-ap

Harald Hanche-Olsen

unread,

Dec 17, 2008, 2:54:29 AM12/17/08

to

+ Andreas Politz <pol...@fh-trier.de>:

> YOUNG wrote:
>>
>> I have a Emacs 22.3.1 for Windows XP, and there is a file encoded in
>> ASCII. I am trying to read the file and convert it to UTF-8 with
>> emacs.
>>
>
> If I am not mistaken, converting a ASCII file to UTF-8 is an identity
> operation, since the later is backwards compatible to the former. So
> there would be nothing to convert.

You are not at all mistaken of course, but many people take "ASCII" to
mean their favourite eight bit character set (typically Latin 1 or 9 in
western Europe).

But since the OP reports no change to his files, maybe they really were
proper ASCII to begin with. Or maybe he is confused about how to make
emacs use UTF-8 when loading the file? If so, he could do worse than
read the emacs info file, node "Recognize coding".

--
* Harald Hanche-Olsen <URL:http://www.math.ntnu.no/~hanche/>
- It is undesirable to believe a proposition
when there is no ground whatsoever for supposing it is true.
-- Bertrand Russell

YOUNG

unread,

Dec 17, 2008, 3:41:47 AM12/17/08

to

On Dec 16, 11:54 pm, Harald Hanche-Olsen <han...@math.ntnu.no> wrote:
> + Andreas Politz <poli...@fh-trier.de>:

Well, I have no problem to load UTF-8 file with emacs at all.

The problem is that emacs is not able to write UTF-8 at all.

For example, if a file is encoded in ASCII (or, CP437, or ISO 8859 or
Latin 1 to 9; there are various aliases to indicating of it, but you
already know what it means.), I set it up with M-x set-buffer-file-
coding-system for writing utf-8 encoding. And, write (or save) it.
After that, exit the emacs and re-run it again, and try to read the
saved file to be expected UTF-8 encoding, but it reads again in ASCII.
It does not mean emacs can't read utf-8, but the file itself is not
encoded UTF-8. I check the file's encoding system with other
application like NotePAD++ or other editors, and all say the file is
still ASCII mode even though I write it as utf-8 in emacs.

Again, there is no problem in reading utf-8. When a file is encoded
utf-8 correctly, emacs reads/writes it in utf-8. It's good. However,
emacs is not able to write utf-8 if the file is encoded in ASCII. It
only writes in ASCII encode no matter how I do "set-buffer-file-coding-
system"

So, if somebody knows this issue and how to write utf-8 correctly when
a file is encoded in ISO8859 (or CP437 or ASCII), and if you share the
information, it would be appreciated.

Thanks,

Thierry Volpiatto

unread,

Dec 17, 2008, 4:59:46 AM12/17/08

to help-gn...@gnu.org

YOUNG <brea...@gmail.com> writes:

I was using iso-8859-15 before switching my system to utf-8.
I just add to my files: (not -*- utf-8 encoding -*-)

,----
| # -*- coding: utf-8 -*-
`----

instead of

,----
| # -*- coding: iso-8859-15 -*-
`----

--
A + Thierry Volpiatto
Location: Saint-Cyr-Sur-Mer - France

Peter Dyballa

unread,

Dec 17, 2008, 5:43:36 AM12/17/08

to YOUNG, help-gn...@gnu.org

Am 17.12.2008 um 02:05 schrieb YOUNG:

> I have a Emacs 22.3.1 for Windows XP, and there is a file encoded in
> ASCII. I am trying to read the file and convert it to UTF-8 with
> emacs.

You could also try:

(prefer-coding-system 'utf-8)

It's a global option. The best to set GNU Emacs' behaviour in this
area is by environment variables like LANG or LC_CTYPE that name some
UTF-8 based encoding. I don't know how this is handled in MS Losedos.

From the Options menu choose Mule and then "Set Coding Systems" from
which "For Next Command" (C-x RET c) will allow you to set a
particular encoding system for reading the file.

--
Greetings

Pete

Bake pizza not war!

Giorgos Keramidas

unread,

Dec 17, 2008, 6:17:04 AM12/17/08

to

On Wed, 17 Dec 2008 00:41:47 -0800 (PST), YOUNG <brea...@gmail.com> wrote:
> Well, I have no problem to load UTF-8 file with emacs at all.
>
> The problem is that emacs is not able to write UTF-8 at all.
>
> For example, if a file is encoded in ASCII (or, CP437, or ISO 8859 or
> Latin 1 to 9; there are various aliases to indicating of it, but you
> already know what it means.), I set it up with M-x set-buffer-file-
> coding-system for writing utf-8 encoding. And, write (or save) it.
> After that, exit the emacs and re-run it again, and try to read the
> saved file to be expected UTF-8 encoding, but it reads again in ASCII.
> It does not mean emacs can't read utf-8, but the file itself is not
> encoded UTF-8. I check the file's encoding system with other
> application like NotePAD++ or other editors, and all say the file is
> still ASCII mode even though I write it as utf-8 in emacs.

ASCII contains only 7-bit characters. All the characters of the 7-bit
ASCII character set map to themselves in the UTF-8 coding system.

This means that when a file contains only characters from the ASCII
character set no conversion at all is needed from UTF-8 to ASCII or vice
versa.

If you set the buffer-file-coding system to UTF-8 *and* type some text
that requires at least 8-bits to be represented correctly in in UTF-8,
then the file will be saved in UTF-8.

> Again, there is no problem in reading utf-8. When a file is encoded
> utf-8 correctly, emacs reads/writes it in utf-8. It's good. However,
> emacs is not able to write utf-8 if the file is encoded in ASCII. It
> only writes in ASCII encode no matter how I do
> "set-buffer-file-coding- system"
>
> So, if somebody knows this issue and how to write utf-8 correctly when
> a file is encoded in ISO8859 (or CP437 or ASCII), and if you share the
> information, it would be appreciated.

CP437 is very different from plain ASCII. It contains 8-bit characters
and there are other differences in the 0x00 - 0x1F code range. If you
ignore the 0x00-0x1F character differences you might be able to say that
CP437 is a 'superset' of ASCII, but they are not the same thing.

Xah Lee

unread,

Dec 17, 2008, 7:04:17 AM12/17/08

to

On Dec 17, 12:41 am, YOUNG <breadn...@gmail.com> wrote:
> Well, I have no problem to load UTF-8 file with emacs at all.
>
> The problem is that emacs is not able to write UTF-8 at all.
>
> For example, if a file is encoded in ASCII (or, CP437, or ISO 8859 or
> Latin 1 to 9; there are various aliases to indicating of it, but you
> already know what it means.), I set it up with M-x set-buffer-file-
> coding-system for writing utf-8 encoding. And, write (or save) it.
> After that, exit the emacs and re-run it again, and try to read the
> saved file to be expected UTF-8 encoding, but it reads again in ASCII.
> It does not mean emacs can't read utf-8, but the file itself is not
> encoded UTF-8. I check the file's encoding system with other
> application like NotePAD++ or other editors, and all say the file is
> still ASCII mode even though I write it as utf-8 in emacs.
>
> Again, there is no problem in reading utf-8. When a file is encoded
> utf-8 correctly, emacs reads/writes it in utf-8. It's good. However,
> emacs is not able to write utf-8 if the file is encoded in ASCII. It
> only writes in ASCII encode no matter how I do "set-buffer-file-coding-
> system"
>
> So, if somebody knows this issue and how to write utf-8 correctly when
> a file is encoded in ISO8859 (or CP437 or ASCII), and if you share the
> information, it would be appreciated.
>
> Thanks,

as other have mentioned, utf-8 is just a super set of ascii, so files
encoded in either are identical.

You mentioned ISO8859, which is not ascii. I read your 2 posts, but
don't quite understand what you wanted.

For some unicode with emacs tips, you might checkout:

• Emacs and Unicode Tips
http://xahlee.org/emacs/emacs_n_unicode.html

You might also beefup understanding of char encoding:

http://en.wikipedia.org/wiki/ISO8859
http://en.wikipedia.org/wiki/ASCII
http://en.wikipedia.org/wiki/UTF-8

Xah
∑ http://xahlee.org/

☄

YOUNG

unread,

Dec 18, 2008, 3:35:18 AM12/18/08

to

Hi,

Finally, I know what is the problem. Thank you guys for helping this
issues.

I am not expert on encoding system, though, I thank this opportunity
for me to learn it.

The problem is BOM (Byte Order Mark). In case of utf-8, it is avoided
since BOM header could cause conflict when the expected special
character is starting position like '#!' in Unix shell script.
Therefore, if there is no text written at least 8-bits to be
represented in utf-8, the text encoding is not defined or ASCII (I am
not sure if it is right term, but here, let's say it is ASCII for
convenience.) in emacs.

I could conclude emacs does not have the feature of having BOM in
utf-8. It only supports utf-8 without BOM. So, I could understand why
the text was not written in utf-8 if the text does not contain actual
utf-8 characters. If there is a text in utf-8 character and save it as
utf-8, then there is no problem in writing utf-8 without BOM.

Detailed information about unicode and BOM is found in
http://unicode.org/faq/utf_bom.html

Thank you,

Harald Hanche-Olsen

unread,

Dec 18, 2008, 9:56:52 AM12/18/08

to

+ YOUNG <brea...@gmail.com>:

> I could conclude emacs does not have the feature of having BOM in
> utf-8. It only supports utf-8 without BOM.

Not true. But you have to put the BOM (ZERO WIDTH NO-BREAK SPACE,
really) there yourself, since otherwise as you noted (in the elided
text) it can play havoc with shell scripts etc. If you want, e.g., every
file that is visited in text mode to start with a BOM you can add a hook
function to before-save-hook that ensures this before saving.

Also, at least the emacsen I am currently using (version 23 from CVS)
will recognize an initial BOM and automagically pick the utf-8 encoding
when it sees the corresponding three bytes at the top of the file.

> Detailed information about unicode and BOM is found in
> http://unicode.org/faq/utf_bom.html

The use of zero width no-break space as a marker to indicate coding is
also widely regarded as unwise. I am too lazy to find any of the
references that will support my claim, so take it with a grain of salt
if you will.