I often paste content from web pages into an emacs org-mode buffer and I
get the odd quote characters or dashes that are not ASCII. I created a
lisp function to remove the unicode ones that are just 8 bits. Lately I
am seeing that there are characters that are not being caught. They show
up in emacs as the expected character. When I kill/yank them into lisp
code, they are not being found. When I save the buffer, I am asked for
coding and chose raw text. When the file is opened again, these
characters are showing up as some sort of special symbol (dashed circle
with flag off the top) followed by doubles/triples of \2xx. For example,
the dash character I just stored was this sequence: circle-flag \200
\231. Using Gnu/Linux od to dump them I get hex strings such as: 340 245
206 340 244 206 210 200 and for the dash mentioned above 342 200 231.
I am very naive in regard to coding, so please excuse my ignorance. I
would guess these are 16-bit (Unicode16) characters. Can someone
enlighten me as to how I can determine what these characters are (after
pasted into a buffer) and how I can code a function to replace them with
ASCII equivalents? The only thing I could think of was hexl mode, but
that didn't turn out well. Thanks.
Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.ke...@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg
> I often paste content from web pages into an emacs org-mode buffer and I
> get the odd quote characters or dashes that are not ASCII. I created a
> lisp function to remove the unicode ones that are just 8 bits. Lately I
> am seeing that there are characters that are not being caught. They show
> up in emacs as the expected character. When I kill/yank them into lisp
> code, they are not being found. When I save the buffer, I am asked for
> coding and chose raw text. When the file is opened again, these
> characters are showing up as some sort of special symbol (dashed circle
> with flag off the top) followed by doubles/triples of \2xx. For example,
> the dash character I just stored was this sequence: circle-flag \200
> \231. Using Gnu/Linux od to dump them I get hex strings such as: 340 245
> 206 340 244 206 210 200 and for the dash mentioned above 342 200 231.
> I am very naive in regard to coding, so please excuse my ignorance. I
> would guess these are 16-bit (Unicode16) characters. Can someone
> enlighten me as to how I can determine what these characters are (after
> pasted into a buffer) and how I can code a function to replace them with
> ASCII equivalents? The only thing I could think of was hexl mode, but
> that didn't turn out well. Thanks.
better to embrace unicode than fight it.
what encoding you have when you paste is rather complex. I guess it
depends on the sources you copy from, as each web page can be in diff
charset and encoding then am not sure your OS do some translation in
the pasteboard.
> I am very naive in regard to coding, so please excuse my ignorance. I
> would guess these are 16-bit (Unicode16) characters. Can someone
> enlighten me as to how I can determine what these characters are (after
> pasted into a buffer)
With cursor on that character, type "C-u C-x =", and Emacs will show
everything it knows about that character, including its canonical
name.
Thanks, Xah and Eli, for contributing to my further understanding. I
went to a specific website where I got the content I copied and pasted
and I can see from the HTML that it has a charset=UTF-8, so I understand
that is Unicode 8-bit. Using the C-u C-x =, I see that the particular
character I pasted has a code point of 0x2013 (U+2013). I didn't see,
however, what the UTF-8 encoding of that code point was. Should I be
able to read that somewhere on the buffer of information I get with C-u
C-x = ? I was poking around the www.unicode.org website, trying to
understand how this U+2013 code point is encoded into UTF-8, but I
haven't determined that yet.
A fresh buffer in emacs for me on my Win-7 box has an encoding system of
iso-latin-1-dos. The coding system used to open and save files is the
same.
So, help me piece together what happens as I paste the UTF-8 text into a
buffer. First, the paste buffer must define that it is in UTF-8. Emacs
reads this information and inserts it into the byte string that defines
the buffer. Now, how does emacs record that it was a UTF-8 encoded
character? Does it translate it into a different internal encoding
instead of just recording the 8 bits transferred? Is this encoding used
as a superset of all possible encoding systems that emacs supports?
Now, Xah, you suggest I embrace Unicode. What does that mean? Would it
involve marking all my lisp library files and my org-mode files with the
file variable -*- coding: utf-8 -*- ? Or is there another way to go
Unicode automatically?
I assume that if my lisp library files are encoded utf-8, then I can
paste that character from the web page into my call to replace-string in
order to substitute the longer dash of Unicode U+2013 with an ascii
hyphen or double hyphen. But, how does that really work? If the lisp
file is encoded utf-8, then how can I put an ascii character in the
replacement string?
I would appreciate it if someone could help me open this new door in my
brain a bit further.
Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.ke...@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg
> Thanks, Xah and Eli, for contributing to my further understanding. I
> went to a specific website where I got the content I copied and pasted
> and I can see from the HTML that it has a charset=UTF-8, so I understand
> that is Unicode 8-bit. Using the C-u C-x =, I see that the particular
> character I pasted has a code point of 0x2013 (U+2013). I didn't see,
> however, what the UTF-8 encoding of that code point was. Should I be
> able to read that somewhere on the buffer of information I get with C-u
> C-x = ?
Yes, this part of "C-u C-x ="'s display:
file code: #xE2 #x80 #x93 (encoded by coding system utf-8-dos)
shows you how it would be encoded in UTF-8. If you see something like
"not encodable by ...", then you need to set the buffer's encoding
using "C-x RET f". Under "file code", Emacs shows how the character
would be encoded if the buffer is saved to a disk file or sent to
another program or as an email message.
> I was poking around the www.unicode.org website, trying to
> understand how this U+2013 code point is encoded into UTF-8, but I
> haven't determined that yet.
See above: Emacs shows this under the right circumstances.
> So, help me piece together what happens as I paste the UTF-8 text into a
> buffer. First, the paste buffer must define that it is in UTF-8.
On Windows, Emacs always uses UTF-16 to pass text via the clipboard,
because doing so lets Emacs copy and paste any character from any
character set on Earth.
> Emacs reads this information and inserts it into the byte string
> that defines the buffer. Now, how does emacs record that it was a
> UTF-8 encoded character?
It doesn't. What it records is the encoding to be used for the
current buffer if it is saved to disk or sent to some program. That
encoding is a property of the buffer, not of the characters.
> Does it translate it into a different internal encoding
Yes, it does.
> Is this encoding used
> as a superset of all possible encoding systems that emacs supports?
Yes. See the section "Text Representations" in the ELisp manual that
comes with Emacs, you will find the details there.
• embrace unicode, because it's just going to be more and more.
Programing Languages are all default on unicode by spec (e.g. any html/
css/JavaScript, and Java, Haskell, …). Most OS (Windows, Mac) and file
systems all default to unicode encoding now (not sure about linux).
Even emacs, starting with emacs 23, uses unicode as default internal
encoding.
• Unicode is about 2 things: ① a char set with a integer ID for each
char. ② several encoding for the char set, most popular being utf-8
and utf-16 (the latter are default on Mac, Windows). (encoding is a
standard that changes a char from a char set into byte sequence)
• in emacs, just put this in your init:
(set-language-environment "UTF-8")
that should put all encoding to utf-8, and shouldn't cause you any
problem if all your curretn file and elisp file are ascii, because
ascii encoding is compatible/subset of utf-8/unicode.
• in emacs, call describe-car. That'll show the current char's
encoding as well as byte sequence used for that particular encoding.
(this is emacs 24. Emacs 23 may not show the byte sequence... i don't
recall.)
my unicode tutorial covers all these… feel free to ask me, or here, of
course.
Xah
On May 25, 6:40 am, "Buchs, Kevin" <buchs.ke...@mayo.edu> wrote:
> Thanks, Xah and Eli, for contributing to my further understanding. I
> went to a specific website where I got the content I copied and pasted
> and I can see from the HTML that it has a charset=UTF-8, so I understand
> that is Unicode 8-bit. Using the C-u C-x =, I see that the particular
> character I pasted has a code point of 0x2013 (U+2013). I didn't see,
> however, what the UTF-8 encoding of that code point was. Should I be
> able to read that somewhere on the buffer of information I get with C-u
> C-x = ? I was poking around thewww.unicode.orgwebsite, trying to
> understand how this U+2013 code point is encoded into UTF-8, but I
> haven't determined that yet.
> A fresh buffer in emacs for me on my Win-7 box has an encoding system of
> iso-latin-1-dos. The coding system used to open and save files is the
> same.
> So, help me piece together what happens as I paste the UTF-8 text into a
> buffer. First, the paste buffer must define that it is in UTF-8. Emacs
> reads this information and inserts it into the byte string that defines
> the buffer. Now, how does emacs record that it was a UTF-8 encoded
> character? Does it translate it into a different internal encoding
> instead of just recording the 8 bits transferred? Is this encoding used
> as a superset of all possible encoding systems that emacs supports?
> Now, Xah, you suggest I embrace Unicode. What does that mean? Would it
> involve marking all my lisp library files and my org-mode files with the
> file variable -*- coding: utf-8 -*- ? Or is there another way to go
> Unicode automatically?
> I assume that if my lisp library files are encoded utf-8, then I can
> paste that character from the web page into my call to replace-string in
> order to substitute the longer dash of Unicode U+2013 with an ascii
> hyphen or double hyphen. But, how does that really work? If the lisp
> file is encoded utf-8, then how can I put an ascii character in the
> replacement string?
> I would appreciate it if someone could help me open this new door in my
> brain a bit further.
> Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
> buchs.ke...@mayo.edu
> Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |http://www.mayo.edu/sppdg
> -----Original Message-----
> With cursor on that character, type "C-u C-x =", and Emacs will show
> everything it knows about that character, including its canonical
> name.
I am reposting some of my questions from last Friday (plus a few more),
as I am still seeking assistance and there has been a lot of water over
the dam on this list.
Xah suggested I embrace Unicode. So I could use (prefer-coding-system
'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
to the former? What about opening an ASCII coded file? Can emacs
properly detect it or does it come up as UTF-8? Or is there another way
to go Unicode automatically? If I embrace Unicode, then should I make my
Org-mode files no longer plain text?
I assume that if my lisp library files are encoded utf-8, then I can
paste that UTF-8 character from the web page into my call to
(replace-string ...) in order to substitute the longer dash of Unicode
U+2013 with an ASCII hyphen or double hyphen. But, how does that really
work? If the lisp file is encoded utf-8, then how can I put an ASCII
character in the replacement string? Or do I need to encode the hex
value of the ASCII character(s)?
Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.ke...@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg
I am reposting some of my questions from last Friday (plus a few more),
as I am still seeking assistance and there has been a lot of water over
the dam on this list.
Does this mean you are ignoring the previous responses?
> Does this mean you are ignoring the previous responses?
Thien-Thi,
I did not intend to ignore any prior responses. I apologize if I have
missed some. I noted responses from Xah Lee, Eli Zaretskii and
Jambunathan. There was one other, for which I did not record the name.
Have I missed more? Please let me know if I have. I note that I get the
digests of this list.
My reason for reposting is that I didn't not have the answers to all the
questions I originally asked AND I had some additional questions. Did
you feel like the questions I reposted were in fact answered? If so,
perhaps I misunderstood.
Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.ke...@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg
On Wednesday, May 30, 2012, Buchs, Kevin <buchs.ke...@mayo.edu> wrote:
> What about opening an ASCII coded file? Can emacs
> properly detect it or does it come up as UTF-8?
Emacs attempts to determine the correct coding system when it opens a file,
so you shouldn't have to worry about this.
The 128 characters that make up ASCII have the exact same representation in
UTF-8. "Converting" as ASCII file to UTF-8 is a no-op. Therefore,
treating an ASCII file as UTF-8 should cause no problems.
> I assume that if my lisp library files are encoded utf-8, then I can
> paste that UTF-8 character from the web page into my call to
> (replace-string ...) in order to substitute the longer dash of Unicode
> U+2013 with an ASCII hyphen or double hyphen. But, how does that really
> work? If the lisp file is encoded utf-8, then how can I put an ASCII
> character in the replacement string? Or do I need to encode the hex
> value of the ASCII character(s)?
A = A. The hyphen-minus is a hyphen-minus whether it's in an ASCII file as
00101101 or a UTF-16 file as 0000000000101101. So, just type it with your
keyboard.
BTW, I don't know how Xah intended it, but when he said to "embrace
unicode," I interpreted it to mean, "Why don't you just leave em-dashes as
em-dashes instead of replacing them with two hyphen-minuses?"
-- -PJ
Gehm's Corollary to Clark's Law: Any technology distinguishable from
magic is insufficiently advanced.
> I am simply ignorant of "water over the dam".
> My mistake was not asking it directly:
> What do you mean by "water over the dam"?
Thien-Thi,
No problem. It is an expression meaning, in general, that lots of events
have come and gone, and are now passed. It is an analogy to the flowing
of water over a dam in a river, in the sense that once water flows up
over a dam, it is going downstream and has passed the reservoir behind
the dam and presumably passed your field of view.
In this specific instance I was referring to a large number of messages
having been posted to the email list by several people discussing a
topic that Xah Lee brought up. So, the busy-ness of the list made me
think that perhaps there were some people who were going to reply, but
the number of messages coming from the list got to be so large that they
just deleted them including my message or lost my message in inbox
clutter.
I could have applied the analogy of "water over the dam" even further to
say that: "though there were many messages posted on this list that are
now water over the dam, I would like to bring my message back to allow
it to float to the top again."
Kevin Buchs | Senior Engineer | SPPDG | 507-538-5459 |
buchs.ke...@mayo.edu
Mayo Clinic | 200 First Street SW | Rochester, MN 55905 |
http://www.mayo.edu/sppdg
[...] once water flows up over a dam, it is going downstream
and has passed the reservoir behind the dam and presumably
passed your field of view.
In this specific instance [...]
OK, thanks. Now i understand.
I could have applied the analogy of "water over the dam" even
further to say that: "though there were many messages posted on
this list that are now water over the dam, I would like to
bring my message back to allow it to float to the top again."
The flow of messages is indeed like water.
I suppose everyone relates to this in their own way.
Using GNUS (now Gnus) to read these, i imagine myself an insect
buzzing around an upward turned flow (a geiser), first in summer
when the molecules dissociate quickly, then (later, always later)
in winter when they crystalize shard-like and treed, sometimes
under a brilliant sun refracted as rainbows, sometimes under a
brilliant moon that ghostly glows, sometimes in darkness lit only
by lucky grep rows. A drip gleaned here and there for sustenance,
a drop left there and here for assonance, the rest left to what
entropy can penetrate the disks of gmane.
Anyway, Unicode is ASCII-compatible, so probably if you wrangle
your environment to Unicode by default, Emacs will also DTRT.
Check out <http://www.utf8everywhere.org>. Yes, it does touch
upon topics best avoided in polite company, but oh well...
On Jun 1, 2:46 am, Thien-Thi Nguyen <t...@gnuvola.org> wrote:
> Anyway, Unicode is ASCII-compatible, so probably if you wrangle
> your environment to Unicode by default, Emacs will also DTRT.
> Check out <http://www.utf8everywhere.org>.
Thanks very useful
> Yes, it does touch
> upon topics best avoided in polite company, but oh well...
On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin wrote:
> Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> to the former? What about opening an ASCII coded file? Can emacs
> properly detect it or does it come up as UTF-8?
ASCII is a subset of UTF-8, so the problem you are imagining does not exist.
On Jun 1, 9:23 am, Jason Rumney <jasonrum...@gmail.com> wrote:
> On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin wrote:
> > Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> > 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> > to the former? What about opening an ASCII coded file? Can emacs
> > properly detect it or does it come up as UTF-8?
> ASCII is a subset of UTF-8, so the problem you are imagining does not exist.
This does not exactly work that way on windows.
eg recently saw a description of how notepad put a BOM mark in a
haskell-script which made the haskell scripts unrunnable
> On Jun 1, 9:23 am, Jason Rumney <jasonrum...@gmail.com> wrote:
> > On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin wrote:
> > > Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> > > 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> > > to the former? What about opening an ASCII coded file? Can emacs
> > > properly detect it or does it come up as UTF-8?
> > ASCII is a subset of UTF-8, so the problem you are imagining does not exist.
> This does not exactly work that way on windows.
> eg recently saw a description of how notepad put a BOM mark in a
> haskell-script which made the haskell scripts unrunnable
We are talking about Emacs, not about Notepad, so it's unclear to me
how what Notepad does is relevant to the OP's question.
On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin wrote:
> Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> to the former? What about opening an ASCII coded file? Can emacs
> properly detect it or does it come up as UTF-8?
ASCII is a subset of UTF-8, so the problem you are imagining does not exist.
On May 31, 10:43 pm, rusi <rustompm...@gmail.com> wrote:
> On Jun 1, 9:23 am, Jason Rumney <jasonrum...@gmail.com> wrote:
> > On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin wrote:
> > > Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> > > 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> > > to the former? What about opening an ASCII coded file? Can emacs
> > > properly detect it or does it come up as UTF-8?
> > ASCII is a subset of UTF-8, so the problem you are imagining does not exist.
> This does not exactly work that way on windows.
> eg recently saw a description of how notepad put a BOM mark in a
> haskell-script which made the haskell scripts unrunnable
haskell compiler probably should bear the blame. Last i read (~4 years
ago), the lang spec says source code should be unicode (i forgot if it
specified a encoding), however, no haskell compiler at the time
supports it. If your lang spec says unicode, you have to support BOM
mark.
> -----Original Message-----
> From: help-gnu-emacs-bounces+dougl=shubertticketing....@gnu.org
> [mailto:help-gnu-emacs-bounces+dougl=shubertticketing....@gnu.org] On
> Behalf Of Thien-Thi Nguyen
> Sent: Thursday, 2012 May 31 17:46
> To: Buchs, Kevin
> Cc: help-gnu-em...@gnu.org
> Subject: Re: those funny non-ASCII characters
> Anyway, Unicode is ASCII-compatible, so probably if you wrangle
> your environment to Unicode by default, Emacs will also DTRT.
> Check out <http://www.utf8everywhere.org>. Yes, it does touch
> upon topics best avoided in polite company, but oh well...
> On May 31, 10:43 pm, rusi <rustompm...@gmail.com> wrote:
> > On Jun 1, 9:23 am, Jason Rumney <jasonrum...@gmail.com> wrote:
> > > On Thursday, 31 May 2012 01:15:11 UTC+8, Buchs, Kevin wrote:
> > > > Xah suggested I embrace Unicode. So I could use (prefer-coding-system
> > > > 'utf-8) or the file variable: -*- coding: utf-8 -*-. Are there drawbacks
> > > > to the former? What about opening an ASCII coded file? Can emacs
> > > > properly detect it or does it come up as UTF-8?
> > > ASCII is a subset of UTF-8, so the problem you are imagining does not exist.
> > This does not exactly work that way on windows.
> > eg recently saw a description of how notepad put a BOM mark in a
> > haskell-script which made the haskell scripts unrunnable
> haskell compiler probably should bear the blame. Last i read (~4 years
> ago), the lang spec says source code should be unicode (i forgot if it
> specified a encoding), however, no haskell compiler at the time
> supports it. If your lang spec says unicode, you have to support BOM
> mark.
See http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf (pg 36) "Use of a BOM is neither required nor recommended for UTF-8,
but may
be encountered in contexts where UTF-8 data is converted from other
encoding forms..."
More specifically the non-recommendation of bom: http://www.unicode.org/faq/utf_bom.html "Note that some recipients of UTF-8 encoded data do not expect a BOM.
Where UTF-8 is used transparently in 8-bit environments, the use of a
BOM will interfere with any protocol or file format that expects
specific ASCII characters at the beginning, such as the use of "#!" of
at the beginning of Unix shell scripts. "
On Jun 1, 9:26 am, rusi <rustompm...@gmail.com> wrote:
> See http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf > (pg 36) "Use of a BOM is neither required nor recommended for UTF-8,
> but may
> be encountered in contexts where UTF-8 data is converted from other
> encoding forms..."
> More specifically the non-recommendation of bom: http://www.unicode.org/faq/utf_bom.html > "Note that some recipients of UTF-8 encoded data do not expect a BOM.
> Where UTF-8 is used transparently in 8-bit environments, the use of a
> BOM will interfere with any protocol or file format that expects
> specific ASCII characters at the beginning, such as the use of "#!" of
> at the beginning of Unix shell scripts. "
didn't i mention these 2 points exactly in the link i gave??
> On Jun 1, 9:26 am, rusi <rustompm...@gmail.com> wrote:
> > Seehttp://www.unicode.org/versions/Unicode5.0.0/ch02.pdf > > (pg 36) "Use of a BOM is neither required nor recommended for UTF-8,
> > but may
> > be encountered in contexts where UTF-8 data is converted from other
> > encoding forms..."
> > More specifically the non-recommendation of bom:http://www.unicode.org/faq/utf_bom.html > > "Note that some recipients of UTF-8 encoded data do not expect a BOM.
> > Where UTF-8 is used transparently in 8-bit environments, the use of a
> > BOM will interfere with any protocol or file format that expects
> > specific ASCII characters at the beginning, such as the use of "#!" of
> > at the beginning of Unix shell scripts. "
> didn't i mention these 2 points exactly in the link i gave??
Yeah your own link says this: (as you know I often use and quote your
unicode pages :-) )
- In unix-like OSes, BOM for utf-8 conflicts with the Shebang (Unix)
hack.
- Many Window software add BOM to utf-8 files, e.g. Notepad.
But you also say
> If your lang spec says unicode, you have to support BOM mark
So I am not clear whats ur stand...
Let me make my own position clear:
The de jure unicode standard is set by the unicode consortium (or
whatever its called)
The de facto standard is set by microsoft and java
The two conflict
> > On Jun 1, 9:26 am, rusi <rustompm...@gmail.com> wrote:
> > > Seehttp://www.unicode.org/versions/Unicode5.0.0/ch02.pdf > > > (pg 36) "Use of a BOM is neither required nor recommended for UTF-8,
> > > but may
> > > be encountered in contexts where UTF-8 data is converted from other
> > > encoding forms..."
> > > More specifically the non-recommendation of bom:http://www.unicode.org/faq/utf_bom.html > > > "Note that some recipients of UTF-8 encoded data do not expect a BOM.
> > > Where UTF-8 is used transparently in 8-bit environments, the use of a
> > > BOM will interfere with any protocol or file format that expects
> > > specific ASCII characters at the beginning, such as the use of "#!" of
> > > at the beginning of Unix shell scripts. "
> > didn't i mention these 2 points exactly in the link i gave??
> Yeah your own link says this: (as you know I often use and quote your
> unicode pages :-) )
> - In unix-like OSes, BOM for utf-8 conflicts with the Shebang (Unix)
> hack.
> - Many Window software add BOM to utf-8 files, e.g. Notepad.
> But you also say
> > If your lang spec says unicode, you have to support BOM mark
> So I am not clear whats ur stand...
> Let me make my own position clear:
> The de jure unicode standard is set by the unicode consortium (or
> whatever its called)
> The de facto standard is set by microsoft and java
> The two conflict
BOM mark is part of the unicode standard. If a tech declares full
support for unicode, support for BOM mark is necessary.
BOM mark is a hack, but so is unix shebang mark. BOM mark being a
given, it wouldn't have any problem if utf-8 isn't invented. utf-8 is
invented by unix fanatic Rob Pike largely to help unix world move
forward to unicode. As it is, BOM mark conflict with the spirit of
utf-8 (because utf-8 is meant to be ASCII compatible as is, yet BOM
mark byte sequence isn't in ASCII.)
i read the link Thien-Thin Nguyen posted 〔http://
www.utf8everywhere.org/〕. At first i find it very informative, but in
the end i wasn't convinced in its opinion that we should all adopt
utf-8 instead of utf-16. I think if one switch a attitude, that utf-8
is the hack that introduced all this problems, then many of their
argument for utf-8 doesn't stand.
side note... about that site, it's Windows oriented. As such, they
didn't explain many terms and Windows tech they use, e.g. i have
little idea what narrowchar or widechar they mean, nor of the many
Windows libraries they mention.
also, the site is decidedly western-mind oriented. They forgot that in
china, the encoding used is GB 18030, which has the same char set as
unicode but different encoding, and is also compatible with ascii. No
utf-8 nor utf-anything whatsoever. Chinese web traffic are like half
of the world's or something.
the site wishes utf-16 to go away. Windows, Mac, NTFS, HFS+ file
systems, all utf-16, plus java C# etc. Though, the web (html,xml,css)
are all utf-8. Neither are likely to go away. If Java and C# and NTFS
disappeared from the face of this earth, then maybe. lol. :D