Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
those funny non-ASCII characters
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 26 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Buchs, Kevin  
View profile  
 More options May 24 2012, 7:49 pm
Newsgroups: gnu.emacs.help
From: "Buchs, Kevin" <buchs.ke...@mayo.edu>
Date: Thu, 24 May 2012 18:49:29 -0500
Local: Thurs, May 24 2012 7:49 pm
Subject: those funny non-ASCII characters
I often paste content from web pages into an emacs org-mode buffer and I
get the odd quote characters or dashes that are not ASCII. I created a
lisp function to remove the unicode ones that are just 8 bits. Lately I
am seeing that there are characters that are not being caught. They show
up in emacs as the expected character. When I kill/yank them into lisp
code, they are not being found. When I save the buffer, I am asked for
coding and chose raw text. When the file is opened again, these
characters are showing up as some sort of special symbol (dashed circle
with flag off the top) followed by doubles/triples of \2xx. For example,
the dash character I just stored was this sequence: circle-flag \200
\231. Using Gnu/Linux od to dump them I get hex strings such as: 340 245
206 340 244 206 210 200 and for the dash mentioned above 342 200 231.

I am very naive in regard to coding, so please excuse my ignorance. I
would guess these are 16-bit (Unicode16) characters. Can someone
enlighten me as to how I can determine what these characters are (after
pasted into a buffer) and how I can code a function to replace them with
ASCII equivalents? The only thing I could think of was hexl mode, but
that didn't turn out well. Thanks.

Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.ke...@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Xah Lee  
View profile  
 More options May 24 2012, 8:56 pm
Newsgroups: gnu.emacs.help, comp.emacs
From: Xah Lee <xah...@gmail.com>
Date: Thu, 24 May 2012 17:56:59 -0700 (PDT)
Local: Thurs, May 24 2012 8:56 pm
Subject: Re: those funny non-ASCII characters
On May 24, 4:49 pm, "Buchs, Kevin" <buchs.ke...@mayo.edu> wrote:

better to embrace unicode than fight it.

what encoding you have when you paste is rather complex. I guess it
depends on the sources you copy from, as each web page can be in diff
charset and encoding then am not sure your OS do some translation in
the pasteboard.

maybe this will help.

〈Emacs File/Character Encoding/Decoding FAQ〉
http://xahlee.org/emacs/emacs_encoding_decoding_faq.html

〈Xah's Unicode Tutorial〉
http://xahlee.org/Periodic_dosage_dir/unicode.html

to replace non-ascii, you can use the regex

[[:nonascii:]]+

〈Char Classes - GNU Emacs Lisp Reference Manual〉
http://xahlee.org/emacs_manual/elisp/Char-Classes.html

〈Emacs Lisp: Convert Unicode String to ASCII (Zap Gremlins)〉
http://xahlee.org/emacs/emacs_zap_gremlins.html

 Xah


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Eli Zaretskii  
View profile  
 More options May 25 2012, 2:36 am
Newsgroups: gnu.emacs.help
From: Eli Zaretskii <e...@gnu.org>
Date: Fri, 25 May 2012 09:36:20 +0300
Local: Fri, May 25 2012 2:36 am
Subject: Re: those funny non-ASCII characters

> Date: Thu, 24 May 2012 18:49:29 -0500
> From: "Buchs, Kevin" <buchs.ke...@mayo.edu>

> I am very naive in regard to coding, so please excuse my ignorance. I
> would guess these are 16-bit (Unicode16) characters. Can someone
> enlighten me as to how I can determine what these characters are (after
> pasted into a buffer)

With cursor on that character, type "C-u C-x =", and Emacs will show
everything it knows about that character, including its canonical
name.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Buchs, Kevin  
View profile  
 More options May 25 2012, 9:40 am
Newsgroups: gnu.emacs.help
From: "Buchs, Kevin" <buchs.ke...@mayo.edu>
Date: Fri, 25 May 2012 08:40:25 -0500
Local: Fri, May 25 2012 9:40 am
Subject: Re: those funny non-ASCII characters
Thanks, Xah and Eli, for contributing to my further understanding. I
went to a specific website where I got the content I copied and pasted
and I can see from the HTML that it has a charset=UTF-8, so I understand
that is Unicode 8-bit. Using the C-u C-x =, I see that the particular
character I pasted has a code point of 0x2013 (U+2013). I didn't see,
however, what the UTF-8 encoding of that code point was. Should I be
able to read that somewhere on the buffer of information I get with C-u
C-x = ? I was poking around the www.unicode.org website, trying to
understand how this U+2013 code point is encoded into UTF-8, but I
haven't determined that yet.

A fresh buffer in emacs for me on my Win-7 box has an encoding system of
iso-latin-1-dos. The coding system used to open and save files is the
same.

So, help me piece together what happens as I paste the UTF-8 text into a
buffer. First, the paste buffer must define that it is in UTF-8. Emacs
reads this information and inserts it into the byte string that defines
the buffer. Now, how does emacs record that it was a UTF-8 encoded
character? Does it translate it into a different internal encoding
instead of just recording the 8 bits transferred? Is this encoding used
as a superset of all possible encoding systems that emacs supports?

Now,  Xah, you suggest I embrace Unicode. What does that mean? Would it
involve marking all my lisp library files and my org-mode files with the
file variable -*- coding: utf-8 -*- ? Or is there another way to go
Unicode automatically?

I assume that if my lisp library files are encoded utf-8, then I can
paste that character from the web page into my call to replace-string in
order to substitute the longer dash of Unicode U+2013 with an ascii
hyphen or double hyphen. But, how does that really work? If the lisp
file is encoded utf-8, then how can I put an ascii character in the
replacement string?

I would appreciate it if someone could help me open this new door in my
brain a bit further.

Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.ke...@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Eli Zaretskii  
View profile  
 More options May 25 2012, 10:04 am
Newsgroups: gnu.emacs.help
From: Eli Zaretskii <e...@gnu.org>
Date: Fri, 25 May 2012 17:04:00 +0300
Subject: Re: those funny non-ASCII characters

> Date: Fri, 25 May 2012 08:40:25 -0500
> From: "Buchs, Kevin" <buchs.ke...@mayo.edu>

> Thanks, Xah and Eli, for contributing to my further understanding. I
> went to a specific website where I got the content I copied and pasted
> and I can see from the HTML that it has a charset=UTF-8, so I understand
> that is Unicode 8-bit. Using the C-u C-x =, I see that the particular
> character I pasted has a code point of 0x2013 (U+2013). I didn't see,
> however, what the UTF-8 encoding of that code point was. Should I be
> able to read that somewhere on the buffer of information I get with C-u
> C-x = ?

Yes, this part of "C-u C-x ="'s display:

            file code: #xE2 #x80 #x93 (encoded by coding system utf-8-dos)

shows you how it would be encoded in UTF-8.  If you see something like
"not encodable by ...", then you need to set the buffer's encoding
using "C-x RET f".  Under "file code", Emacs shows how the character
would be encoded if the buffer is saved to a disk file or sent to
another program or as an email message.

> I was poking around the www.unicode.org website, trying to
> understand how this U+2013 code point is encoded into UTF-8, but I
> haven't determined that yet.

See above: Emacs shows this under the right circumstances.

> So, help me piece together what happens as I paste the UTF-8 text into a
> buffer. First, the paste buffer must define that it is in UTF-8.

On Windows, Emacs always uses UTF-16 to pass text via the clipboard,
because doing so lets Emacs copy and paste any character from any
character set on Earth.

> Emacs reads this information and inserts it into the byte string
> that defines the buffer. Now, how does emacs record that it was a
> UTF-8 encoded character?

It doesn't.  What it records is the encoding to be used for the
current buffer if it is saved to disk or sent to some program.  That
encoding is a property of the buffer, not of the characters.

> Does it translate it into a different internal encoding

Yes, it does.

> Is this encoding used
> as a superset of all possible encoding systems that emacs supports?

Yes.  See the section "Text Representations" in the ELisp manual that
comes with Emacs, you will find the details there.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jambunathan K  
View profile  
 More options May 25 2012, 10:42 am
Newsgroups: gnu.emacs.help
From: Jambunathan K <kjambunat...@gmail.com>
Date: Fri, 25 May 2012 20:12:49 +0530
Local: Fri, May 25 2012 10:42 am
Subject: Re: those funny non-ASCII characters

I think this will help.

  (prefer-coding-system 'utf-8)

--


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Xah Lee  
View profile  
 More options May 25 2012, 2:33 pm
Newsgroups: gnu.emacs.help, comp.emacs
From: Xah Lee <xah...@gmail.com>
Date: Fri, 25 May 2012 11:33:51 -0700 (PDT)
Local: Fri, May 25 2012 2:33 pm
Subject: Re: those funny non-ASCII characters

hope Eli answered all your questions.

here's some addition.

• embrace unicode, because it's just going to be more and more.
Programing Languages are all default on unicode by spec (e.g. any html/
css/JavaScript, and Java, Haskell, …). Most OS (Windows, Mac) and file
systems all default to unicode encoding now (not sure about linux).
Even emacs, starting with emacs 23, uses unicode as default internal
encoding.

〈Unicode Popularity on Web by Google〉
http://xahlee.org/comp/unicode_on_web.html

• Unicode is about 2 things: ① a char set with a integer ID for each
char. ② several encoding for the char set, most popular being utf-8
and utf-16 (the latter are default on Mac, Windows). (encoding is a
standard that changes a char from a char set into byte sequence)

• in emacs, just put this in your init:
(set-language-environment "UTF-8")

that should put all encoding to utf-8, and shouldn't cause you any
problem if all your curretn file and elisp file are ascii, because
ascii encoding is compatible/subset of utf-8/unicode.

• in emacs, call describe-car. That'll show the current char's
encoding as well as byte sequence used for that particular encoding.
(this is emacs 24. Emacs 23 may not show the byte sequence... i don't
recall.)

my unicode tutorial covers all these… feel free to ask me, or here, of
course.

 Xah

On May 25, 6:40 am, "Buchs, Kevin" <buchs.ke...@mayo.edu> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Buchs, Kevin  
View profile  
 More options May 30 2012, 1:15 pm
Newsgroups: gnu.emacs.help
From: "Buchs, Kevin" <buchs.ke...@mayo.edu>
Date: Wed, 30 May 2012 12:15:11 -0500
Local: Wed, May 30 2012 1:15 pm
Subject: RE: those funny non-ASCII characters
I am reposting some of my questions from last Friday (plus a few more),
as I am still seeking assistance and there has been a lot of water over
the dam on this list.

Xah suggested I embrace Unicode. So I could use (prefer-coding-system
'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
to the former? What about opening an ASCII coded file? Can emacs
properly detect it or does it come up as UTF-8? Or is there another way
to go Unicode automatically? If I embrace Unicode, then should I make my
Org-mode files no longer plain text?

I assume that if my lisp library files are encoded utf-8, then I can
paste that UTF-8 character from the web page into my call to
(replace-string ...) in order to substitute the longer dash of Unicode
U+2013 with an ASCII hyphen or double hyphen. But, how does that really
work? If the lisp file is encoded utf-8, then how can I put an ASCII
character in the replacement string? Or do I need to encode the hex
value of the ASCII character(s)?

Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.ke...@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thien-Thi Nguyen  
View profile  
 More options May 31 2012, 3:17 am
Newsgroups: gnu.emacs.help
From: Thien-Thi Nguyen <t...@gnuvola.org>
Date: Thu, 31 May 2012 09:17:00 +0200
Subject: Re: those funny non-ASCII characters
() "Buchs, Kevin" <buchs.ke...@mayo.edu>
() Wed, 30 May 2012 12:15:11 -0500

   I am reposting some of my questions from last Friday (plus a few more),
   as I am still seeking assistance and there has been a lot of water over
   the dam on this list.

Does this mean you are ignoring the previous responses?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Buchs, Kevin  
View profile  
 More options May 31 2012, 10:57 am
Newsgroups: gnu.emacs.help
From: "Buchs, Kevin" <buchs.ke...@mayo.edu>
Date: Thu, 31 May 2012 09:57:37 -0500
Local: Thurs, May 31 2012 10:57 am
Subject: RE: those funny non-ASCII characters

> Does this mean you are ignoring the previous responses?

Thien-Thi,

I did not intend to ignore any prior responses. I apologize if I have
missed some. I noted responses from Xah Lee, Eli Zaretskii and
Jambunathan. There was one other, for which I did not record the name.
Have I missed more? Please let me know if I have. I note that I get the
digests of this list.

My reason for reposting is that I didn't not have the answers to all the
questions I originally asked AND I had some additional questions. Did
you feel like the questions I reposted were in fact answered? If so,
perhaps I misunderstood.

Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.ke...@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
PJ Weisberg  
View profile  
 More options May 31 2012, 11:59 am
Newsgroups: gnu.emacs.help
From: PJ Weisberg <pjweisb...@gmail.com>
Date: Thu, 31 May 2012 08:59:39 -0700
Local: Thurs, May 31 2012 11:59 am
Subject: Re: those funny non-ASCII characters

On Wednesday, May 30, 2012, Buchs, Kevin <buchs.ke...@mayo.edu> wrote:
> What about opening an ASCII coded file? Can emacs
> properly detect it or does it come up as UTF-8?

Emacs attempts to determine the correct coding system when it opens a file,
so you shouldn't have to worry about this.

The 128 characters that make up ASCII have the exact same representation in
UTF-8.  "Converting" as ASCII file to UTF-8 is a no-op.  Therefore,
treating an ASCII file as UTF-8 should cause no problems.

> I assume that if my lisp library files are encoded utf-8, then I can
> paste that UTF-8 character from the web page into my call to
> (replace-string ...) in order to substitute the longer dash of Unicode
> U+2013 with an ASCII hyphen or double hyphen. But, how does that really
> work? If the lisp file is encoded utf-8, then how can I put an ASCII
> character in the replacement string? Or do I need to encode the hex
> value of the ASCII character(s)?

A = A.  The hyphen-minus is a hyphen-minus whether it's in an ASCII file as
00101101 or a UTF-16 file as 0000000000101101.  So, just type it with your
keyboard.

BTW, I don't know how Xah intended it, but when he said to "embrace
unicode," I interpreted it to mean, "Why don't you just leave em-dashes as
em-dashes instead of replacing them with two hyphen-minuses?"

--
-PJ

Gehm's Corollary to Clark's Law: Any technology distinguishable from
magic is insufficiently advanced.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thien-Thi Nguyen  
View profile  
 More options May 31 2012, 12:40 pm
Newsgroups: gnu.emacs.help
From: Thien-Thi Nguyen <t...@gnuvola.org>
Date: Thu, 31 May 2012 18:40:20 +0200
Local: Thurs, May 31 2012 12:40 pm
Subject: Re: those funny non-ASCII characters
() "Buchs, Kevin" <buchs.ke...@mayo.edu>
() Thu, 31 May 2012 09:57:37 -0500

   Did you feel like the questions I reposted were in fact answered?
   If so, perhaps I misunderstood.

I am simply ignorant of "water over the dam".
My mistake was not asking it directly:
What do you mean by "water over the dam"?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Buchs, Kevin  
View profile  
 More options May 31 2012, 12:56 pm
Newsgroups: gnu.emacs.help
From: "Buchs, Kevin" <buchs.ke...@mayo.edu>
Date: Thu, 31 May 2012 11:56:45 -0500
Local: Thurs, May 31 2012 12:56 pm
Subject: RE: those funny non-ASCII characters

> I am simply ignorant of "water over the dam".
> My mistake was not asking it directly:
> What do you mean by "water over the dam"?

Thien-Thi,

No problem. It is an expression meaning, in general, that lots of events
have come and gone, and are now passed. It is an analogy to the flowing
of water over a dam in a river, in the sense that once water flows up
over a dam, it is going downstream and has passed the reservoir behind
the dam and presumably passed your field of view.

In this specific instance I was referring to a large number of messages
having been posted to the email list by several people discussing a
topic that Xah Lee brought up. So, the busy-ness of the list made me
think that perhaps there were some people who were going to reply, but
the number of messages coming from the list got to be so large that they
just deleted them including my message or lost my message in inbox
clutter.

I could have applied the analogy of "water over the dam" even further to
say that: "though there were many messages posted on this list that are
now water over the dam, I would like to bring my message back to allow
it to float to the top again."

Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.ke...@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Thien-Thi Nguyen  
View profile  
 More options May 31 2012, 5:46 pm
Newsgroups: gnu.emacs.help
From: Thien-Thi Nguyen <t...@gnuvola.org>
Date: Thu, 31 May 2012 23:46:25 +0200
Local: Thurs, May 31 2012 5:46 pm
Subject: Re: those funny non-ASCII characters
() "Buchs, Kevin" <buchs.ke...@mayo.edu>
() Thu, 31 May 2012 11:56:45 -0500

   [...] once water flows up over a dam, it is going downstream
   and has passed the reservoir behind the dam and presumably
   passed your field of view.

   In this specific instance [...]

OK, thanks.  Now i understand.

   I could have applied the analogy of "water over the dam" even
   further to say that: "though there were many messages posted on
   this list that are now water over the dam, I would like to
   bring my message back to allow it to float to the top again."

The flow of messages is indeed like water.

I suppose everyone relates to this in their own way.

Using GNUS (now Gnus) to read these, i imagine myself an insect
buzzing around an upward turned flow (a geiser), first in summer
when the molecules dissociate quickly, then (later, always later)
in winter when they crystalize shard-like and treed, sometimes
under a brilliant sun refracted as rainbows, sometimes under a
brilliant moon that ghostly glows, sometimes in darkness lit only
by lucky grep rows.  A drip gleaned here and there for sustenance,
a drop left there and here for assonance, the rest left to what
entropy can penetrate the disks of gmane.

Anyway, Unicode is ASCII-compatible, so probably if you wrangle
your environment to Unicode by default, Emacs will also DTRT.
Check out <http://www.utf8everywhere.org>.  Yes, it does touch
upon topics best avoided in polite company, but oh well...


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
rusi  
View profile  
 More options May 31 2012, 10:42 pm
Newsgroups: gnu.emacs.help
From: rusi <rustompm...@gmail.com>
Date: Thu, 31 May 2012 19:42:40 -0700 (PDT)
Local: Thurs, May 31 2012 10:42 pm
Subject: Re: those funny non-ASCII characters
On Jun 1, 2:46 am, Thien-Thi Nguyen <t...@gnuvola.org> wrote:

> Anyway, Unicode is ASCII-compatible, so probably if you wrangle
> your environment to Unicode by default, Emacs will also DTRT.
> Check out <http://www.utf8everywhere.org>.

Thanks very useful

> Yes, it does touch
> upon topics best avoided in polite company, but oh well...

??

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jason Rumney  
View profile  
 More options Jun 1 2012, 12:23 am
Newsgroups: gnu.emacs.help
From: Jason Rumney <jasonrum...@gmail.com>
Date: Thu, 31 May 2012 21:23:24 -0700 (PDT)
Local: Fri, Jun 1 2012 12:23 am
Subject: Re: those funny non-ASCII characters

On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin  wrote:
> Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> to the former? What about opening an ASCII coded file? Can emacs
> properly detect it or does it come up as UTF-8?

ASCII is a subset of UTF-8, so the problem you are imagining does not exist.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
rusi  
View profile  
 More options Jun 1 2012, 1:43 am
Newsgroups: gnu.emacs.help
From: rusi <rustompm...@gmail.com>
Date: Thu, 31 May 2012 22:43:07 -0700 (PDT)
Local: Fri, Jun 1 2012 1:43 am
Subject: Re: those funny non-ASCII characters
On Jun 1, 9:23 am, Jason Rumney <jasonrum...@gmail.com> wrote:

> On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin  wrote:
> > Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> > 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> > to the former? What about opening an ASCII coded file? Can emacs
> > properly detect it or does it come up as UTF-8?

> ASCII is a subset of UTF-8, so the problem you are imagining does not exist.

This does not exactly work that way on windows.
eg recently saw a description of how notepad put a BOM mark in a
haskell-script which made the haskell scripts unrunnable

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Eli Zaretskii  
View profile  
 More options Jun 1 2012, 2:12 am
Newsgroups: gnu.emacs.help
From: Eli Zaretskii <e...@gnu.org>
Date: Fri, 01 Jun 2012 09:12:42 +0300
Local: Fri, Jun 1 2012 2:12 am
Subject: Re: those funny non-ASCII characters

> From: rusi <rustompm...@gmail.com>
> Date: Thu, 31 May 2012 22:43:07 -0700 (PDT)

> On Jun 1, 9:23 am, Jason Rumney <jasonrum...@gmail.com> wrote:
> > On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin  wrote:
> > > Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> > > 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> > > to the former? What about opening an ASCII coded file? Can emacs
> > > properly detect it or does it come up as UTF-8?

> > ASCII is a subset of UTF-8, so the problem you are imagining does not exist.

> This does not exactly work that way on windows.
> eg recently saw a description of how notepad put a BOM mark in a
> haskell-script which made the haskell scripts unrunnable

We are talking about Emacs, not about Notepad, so it's unclear to me
how what Notepad does is relevant to the OP's question.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jason Rumney  
View profile  
 More options Jun 1 2012, 12:23 am
Newsgroups: gnu.emacs.help
From: Jason Rumney <jasonrum...@gmail.com>
Date: Thu, 31 May 2012 21:23:24 -0700 (PDT)
Local: Fri, Jun 1 2012 12:23 am
Subject: Re: those funny non-ASCII characters

On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin  wrote:
> Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> to the former? What about opening an ASCII coded file? Can emacs
> properly detect it or does it come up as UTF-8?

ASCII is a subset of UTF-8, so the problem you are imagining does not exist.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Xah Lee  
View profile  
 More options Jun 1 2012, 3:03 am
Newsgroups: gnu.emacs.help
From: Xah Lee <xah...@gmail.com>
Date: Fri, 1 Jun 2012 00:03:12 -0700 (PDT)
Local: Fri, Jun 1 2012 3:03 am
Subject: Re: those funny non-ASCII characters
On May 31, 10:43 pm, rusi <rustompm...@gmail.com> wrote:

> On Jun 1, 9:23 am, Jason Rumney <jasonrum...@gmail.com> wrote:

> > On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin  wrote:
> > > Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> > > 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> > > to the former? What about opening an ASCII coded file? Can emacs
> > > properly detect it or does it come up as UTF-8?

> > ASCII is a subset of UTF-8, so the problem you are imagining does not exist.

> This does not exactly work that way on windows.
> eg recently saw a description of how notepad put a BOM mark in a
> haskell-script which made the haskell scripts unrunnable

haskell compiler probably should bear the blame. Last i read (~4 years
ago), the lang spec says source code should be unicode (i forgot if it
specified a encoding), however, no haskell compiler at the time
supports it. If your lang spec says unicode, you have to support BOM
mark.

〈Unicode BOM Byte Order Mark Hack〉
http://xahlee.org/comp/unicode_BOM_byte_orde_mark.html

http://www.unicode.org/faq/utf_bom.html#bom1

 Xah


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Doug Lewan  
View profile  
 More options Jun 1 2012, 9:36 am
Newsgroups: gnu.emacs.help
From: Doug Lewan <do...@shubertticketing.com>
Date: Fri, 1 Jun 2012 13:36:08 +0000
Local: Fri, Jun 1 2012 9:36 am
Subject: RE: those funny non-ASCII characters
Thanks for the UTF-8 pointer. I never appreciated just how complex this is.

When you get people involved in software it just sucks and it shouldn't.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
rusi  
View profile  
 More options Jun 1 2012, 12:26 pm
Newsgroups: gnu.emacs.help
From: rusi <rustompm...@gmail.com>
Date: Fri, 1 Jun 2012 09:26:08 -0700 (PDT)
Local: Fri, Jun 1 2012 12:26 pm
Subject: Re: those funny non-ASCII characters
On Jun 1, 12:03 pm, Xah Lee <xah...@gmail.com> wrote:

See http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf
(pg 36) "Use of a BOM is neither required nor recommended for UTF-8,
but may
be encountered in contexts where UTF-8 data is converted from other
encoding forms..."

More specifically the non-recommendation of bom: http://www.unicode.org/faq/utf_bom.html
"Note that some recipients of UTF-8 encoded data do not expect a BOM.
Where UTF-8 is used transparently in 8-bit environments, the use of a
BOM will interfere with any protocol or file format that expects
specific ASCII characters at the beginning, such as the use of "#!" of
at the beginning of Unix shell scripts. "


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Xah Lee  
View profile  
 More options Jun 1 2012, 5:06 pm
Newsgroups: gnu.emacs.help
From: Xah Lee <xah...@gmail.com>
Date: Fri, 1 Jun 2012 14:06:33 -0700 (PDT)
Local: Fri, Jun 1 2012 5:06 pm
Subject: Re: those funny non-ASCII characters
Xah wrote

On Jun 1, 9:26 am, rusi <rustompm...@gmail.com> wrote:

> See http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf
> (pg 36) "Use of a BOM is neither required nor recommended for UTF-8,
> but may
> be encountered in contexts where UTF-8 data is converted from other
> encoding forms..."

> More specifically the non-recommendation of bom: http://www.unicode.org/faq/utf_bom.html
> "Note that some recipients of UTF-8 encoded data do not expect a BOM.
> Where UTF-8 is used transparently in 8-bit environments, the use of a
> BOM will interfere with any protocol or file format that expects
> specific ASCII characters at the beginning, such as the use of "#!" of
> at the beginning of Unix shell scripts. "

didn't i mention these 2 points exactly in the link i gave??

 Xah


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
rusi  
View profile  
 More options Jun 1 2012, 11:17 pm
Newsgroups: gnu.emacs.help
From: rusi <rustompm...@gmail.com>
Date: Fri, 1 Jun 2012 20:17:35 -0700 (PDT)
Local: Fri, Jun 1 2012 11:17 pm
Subject: Re: those funny non-ASCII characters
On Jun 2, 2:06 am, Xah Lee <xah...@gmail.com> wrote:

Yeah your own link says this: (as you know I often use and quote your
unicode pages :-) )

- In unix-like OSes, BOM for utf-8 conflicts with the Shebang (Unix)
hack.
- Many Window software add BOM to utf-8 files, e.g. Notepad.

But you also say

> If your lang spec says unicode, you have to support BOM mark

So I am not clear whats ur stand...

Let me make my own position clear:
The de jure unicode standard is set by the unicode consortium (or
whatever its called)
The de facto standard is set by microsoft and java
The two conflict


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Xah Lee  
View profile  
 More options Jun 2 2012, 7:54 am
Newsgroups: gnu.emacs.help
From: Xah Lee <xah...@gmail.com>
Date: Sat, 2 Jun 2012 04:54:34 -0700 (PDT)
Local: Sat, Jun 2 2012 7:54 am
Subject: Re: those funny non-ASCII characters
On Jun 1, 8:17 pm, rusi <rustompm...@gmail.com> wrote:

BOM mark is part of the unicode standard. If a tech declares full
support for unicode, support for BOM mark is necessary.

BOM mark is a hack, but so is unix shebang mark. BOM mark being a
given, it wouldn't have any problem if utf-8 isn't invented. utf-8 is
invented by unix fanatic Rob Pike largely to help unix world move
forward to unicode. As it is, BOM mark conflict with the spirit of
utf-8 (because utf-8 is meant to be ASCII compatible as is, yet BOM
mark byte sequence isn't in ASCII.)

i read the link Thien-Thin Nguyen posted 〔http://
www.utf8everywhere.org/〕. At first i find it very informative, but in
the end i wasn't convinced in its opinion that we should all adopt
utf-8 instead of utf-16. I think if one switch a attitude, that utf-8
is the hack that introduced all this problems, then many of their
argument for utf-8 doesn't stand.

side note... about that site, it's Windows oriented. As such, they
didn't explain many terms and Windows tech they use, e.g. i have
little idea what narrowchar or widechar they mean, nor of the many
Windows libraries they mention.

also, the site is decidedly western-mind oriented. They forgot that in
china, the encoding used is GB 18030, which has the same char set as
unicode but different encoding, and is also compatible with ascii. No
utf-8 nor utf-anything whatsoever. Chinese web traffic are like half
of the world's or something.

the site wishes utf-16 to go away. Windows, Mac, NTFS, HFS+ file
systems, all utf-16, plus java C# etc. Though, the web (html,xml,css)
are all utf-8. Neither are likely to go away. If Java and C# and NTFS
disappeared from the face of this earth, then maybe. lol. :D

 Xah


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 26   Newer >
« Back to Discussions « Newer topic     Older topic »