New submission from Yuya Nishihara <y...@tcha.org>:
Some text editors, like Notepad.exe, insert BOM (byte order mark) silently if you save Mercurial.ini as UTF-8.
IMHO, they shouldn't insert BOM for UTF-8, but it's really hard to debug because BOM isn't visible. So it seems reasonable to skip/recognize BOM before reading Mercurial.ini.
> New submission from Yuya Nishihara <y...@tcha.org>:
> Some text editors, like Notepad.exe, insert BOM (byte order mark) silently if > you save Mercurial.ini as UTF-8.
> IMHO, they shouldn't insert BOM for UTF-8, but it's really hard to debug > because BOM isn't visible. So it seems reasonable to skip/recognize BOM > before reading Mercurial.ini.
I was under impression that UTF-8 might have optional BOM marker, and Python even has this constant defined:
> Yuya Nishihara пишет:
>> New submission from Yuya Nishihara <y...@tcha.org>:
>> Some text editors, like Notepad.exe, insert BOM (byte order mark) >> silently if you save Mercurial.ini as UTF-8.
>> IMHO, they shouldn't insert BOM for UTF-8, but it's really hard to >> debug because BOM isn't visible. So it seems reasonable to >> skip/recognize BOM before reading Mercurial.ini.
> I was under impression that UTF-8 might have optional BOM marker, and > Python even has this constant defined:
> In [2]: codecs.BOM_UTF8
> Out[2]: '\xef\xbb\xbf'
> So, why you say it "shouldn't"?
Because it is optional, has no benefit, and "never" is used?
Mercurial is not particular encoding-aware but very encoding-transparent. Encoding Mercurial.ini in any ascii-superset is fine, and BOMs could probably be removed or ignored when parsed, but in that case the BOM should probably be prepended to all value strings too ... and that would cause other strange issues.
FWIW I'm -0 on special handling of BOM - but a strip on the config file content before parsing should do no harm.
Perhaps we could warn if any non-7-bit characters if found before the first # or =?
Mads Kiilerich wrote:
> Alexander Belchenko wrote, On 04/27/2010 04:15 PM:
> > Yuya Nishihara пишет:
> >> New submission from Yuya Nishihara <y...@tcha.org>:
> >> Some text editors, like Notepad.exe, insert BOM (byte order mark) > >> silently if you save Mercurial.ini as UTF-8.
> >> IMHO, they shouldn't insert BOM for UTF-8, but it's really hard to > >> debug because BOM isn't visible. So it seems reasonable to > >> skip/recognize BOM before reading Mercurial.ini.
> > I was under impression that UTF-8 might have optional BOM marker, and > > Python even has this constant defined:
> > In [2]: codecs.BOM_UTF8
> > Out[2]: '\xef\xbb\xbf'
> > So, why you say it "shouldn't"?
> Because it is optional, has no benefit, and "never" is used?
I heard it can be used for detection of character encoding,
but it seems silly to lose ascii compatibility just for such reason.
UTF-8 does exist for ascii transparency.
> Mercurial is not particular encoding-aware but very > encoding-transparent. Encoding Mercurial.ini in any ascii-superset is > fine, and BOMs could probably be removed or ignored when parsed, but in > that case the BOM should probably be prepended to all value strings too > ... and that would cause other strange issues.
> FWIW I'm -0 on special handling of BOM - but a strip on the config file > content before parsing should do no harm.
> Perhaps we could warn if any non-7-bit characters if found before the > first # or =?
That seems good for me. Stripping BOM is simple enough, but because
Mercurial doesn't care about encoding, warning comes after.
> Mads Kiilerich wrote: >> Alexander Belchenko wrote, On 04/27/2010 04:15 PM: >>> Yuya Nishihara пишет: >>>> New submission from Yuya Nishihara <y...@tcha.org>:
>>>> Some text editors, like Notepad.exe, insert BOM (byte order mark) >>>> silently if you save Mercurial.ini as UTF-8.
>>>> IMHO, they shouldn't insert BOM for UTF-8, but it's really hard to >>>> debug because BOM isn't visible. So it seems reasonable to >>>> skip/recognize BOM before reading Mercurial.ini. >>> I was under impression that UTF-8 might have optional BOM marker, and >>> Python even has this constant defined:
>>> In [2]: codecs.BOM_UTF8 >>> Out[2]: '\xef\xbb\xbf'
>>> So, why you say it "shouldn't"? >> Because it is optional, has no benefit, and "never" is used?
> I heard it can be used for detection of character encoding, > but it seems silly to lose ascii compatibility just for such reason. > UTF-8 does exist for ascii transparency.
I don't understand what is "ascii transparency" here. When somebody said about "ascii" seriously, for me it sounds the same as pretend we're living in the flat world which stand on the back of big turtle.
>>>> So, why you say it "shouldn't"?
>>> Because it is optional, has no benefit, and "never" is used?
>> I heard it can be used for detection of character encoding,
>> but it seems silly to lose ascii compatibility just for such reason.
>> UTF-8 does exist for ascii transparency.
> I don't understand what is "ascii transparency" here. When somebody said
> about "ascii" seriously, for me it sounds the same as pretend we're
> living in the flat world which stand on the back of big turtle.
Welcome to the UNIX world, where many people are scared of anything non-ASCII due to compatibility with ancient programs ;-)
> In [2]: codecs.BOM_UTF8
> Out[2]: '\xef\xbb\xbf'
> So, why you say it "shouldn't"?
Well, since utf-8 has no "customizable" byte order, "byte order mark" is a misnomer to start with. Second, while it's allowed, the byte order mark in a utf-8 document is *not recommended* by the official Unicode standard:
> Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature
(Unicode Standard 5.0 chapter 2)
and it's generally a pain in the ass.
On 28 Apr 2010, at 11:02 , Sune Foldager wrote:
> On 28-04-2010 09:29, Alexander Belchenko wrote:
>>>>> So, why you say it "shouldn't"?
>>>> Because it is optional, has no benefit, and "never" is used?
>>> I heard it can be used for detection of character encoding,
>>> but it seems silly to lose ascii compatibility just for such reason.
>>> UTF-8 does exist for ascii transparency.
>> I don't understand what is "ascii transparency" here. When somebody said
>> about "ascii" seriously, for me it sounds the same as pretend we're
>> living in the flat world which stand on the back of big turtle.
> Welcome to the UNIX world, where many people are scared of anything non-ASCII due to compatibility with ancient programs ;-)
There's also the issue that no two systems use the same encoding (let alone use them consistently), and even if you get two systems to agree on an (hopefully unicode-based) encoding they probably will disagree on something else making all your earlier efforts pointless. For instance OSX uses NFD for file names where most Linux systems use NFC. This means a file name which displays fine might not be selectable via the console (and potentially via other APIs), because the NFC you'll enter (if the file was transferred from OSX to Linux) will not match the on-disk NFD name.
_______________________________________________
Mercurial-devel mailing list
Mercurial-de...@selenic.com
http://selenic.com/mailman/listinfo/mercurial-devel
> For instance OSX uses NFD for file names where most Linux systems use NFC. This means a file name which displays fine might not be selectable via the console (and potentially via other APIs), because the NFC you'll enter (if the file was transferred from OSX to Linux) will not match the on-disk NFD name.
I don't know much about the situation on Linux, but Mac OS X isn't quite as you describe. First of all, Mac OS X doesn't actually use NFC[1] but a custom, stable encoding. Files can be accessed via their NFC name; the kernel will transparently convert any illegal entities. Listing a directory, however, always yields the canonical strings.
In short: I don't believe it'll matter much how you encode file names on Mac OS X, unless you do file name completion, or try to compare paths for equality.
On 28 Apr 2010, at 13:41 , Dan Villiom Podlaski Christiansen wrote:
> On 28 Apr 2010, at 12:45, Masklinn wrote:
>> For instance OSX uses NFD for file names where most Linux systems use NFC. This means a file name which displays fine might not be selectable via the console (and potentially via other APIs), because the NFC you'll enter (if the file was transferred from OSX to Linux) will not match the on-disk NFD name.
> I don't know much about the situation on Linux, but Mac OS X isn't quite as you describe. First of all, Mac OS X doesn't actually use NFC[1]
Erm. I said it used NFD. It's actually an NFD variant, but that's still NFD-based, and it's a variant mostly because NFD was not completely specified at the time
> but a custom, stable encoding. Files can be accessed via their NFC name; the kernel will transparently convert any illegal entities. Listing a directory, however, always yields the canonical strings.
Yes, OSX will perform canonical decomposition on CREATE and LOOKUP (though that doesn't mean there are no problem with that, see [3] for incompatibilities between OSX reporting NFD names and bash expecting NFC names for instance), but most linux distributions will *not* [4], which is why I specifically talked about OSX -> Linux
> On 28 Apr 2010, at 13:41 , Dan Villiom Podlaski Christiansen wrote:
>> On 28 Apr 2010, at 12:45, Masklinn wrote: >>> For instance OSX uses NFD for file names where most Linux systems use NFC. This means a file name which displays fine might not be selectable via the console (and potentially via other APIs), because the NFC you'll enter (if the file was transferred from OSX to Linux) will not match the on-disk NFD name.
>> I don't know much about the situation on Linux, but Mac OS X isn't quite as you describe. First of all, Mac OS X doesn't actually use NFC[1] > Erm. I said it used NFD. It's actually an NFD variant, but that's still NFD-based, and it's a variant mostly because NFD was not completely specified at the time
Sorry, I hit “send” too quickly; you're right, of course :)
>> but a custom, stable encoding. Files can be accessed via their NFC name; the kernel will transparently convert any illegal entities. Listing a directory, however, always yields the canonical strings. > Yes, OSX will perform canonical decomposition on CREATE and LOOKUP (though that doesn't mean there are no problem with that, see [3] for incompatibilities between OSX reporting NFD names and bash expecting NFC names for instance), but most linux distributions will *not* [4], which is why I specifically talked about OSX -> Linux
Well, what Bash does is quite different from anything Mercurial does. I don't believe we do file name completion anywhere. There were some problems previously with path normalisation and the working copy, but I believe they have been fixed. Honestly, I don't see how path normalisation is relevant to configuration file magic markers…
On Wed, 2010-04-28 at 12:45 +0200, Masklinn wrote:
> For instance OSX uses NFD for file names where most Linux systems use
> NFC.
Actually, no, that's incorrect. Any sensible Unix is encoding-agnostic
and treats filenames as bytes. There is nothing in the path from the
syscall interface down to the disk on, say, ext3 that cares about any
bytes aside from '/' and '\0'. The only sense in which Linux systems
"use" NFC is that the input layer of most applications produce NFC form
from keyboard input (though this may very much depend on locale!) and it
reaches the disk unmolested. If you hand an application a string in NFD
form (for instance, because you rsynced it from an OS X box), it will
similarly reach the disk unmolested.
Here's how you add proper kernel support for Unicode to a traditional
Unix: you stand up on a mountain and say "I hereby declare that the
prefered filesystem encoding is UTF-8". Done. This is the beauty of
UTF-8 - huge bodies of code that were written to be properly
encoding-agnostic just continue to work. And this is all it means when
someone (like Linus) says Linux filesystems are UTF-8.
> This means a file name which displays fine might not be selectable
> via the console (and potentially via other APIs), because the NFC
> you'll enter (if the file was transferred from OSX to Linux) will not
> match the on-disk NFD name.
If you try to visually copy Unicode, you will fail. But this is not
restricted to NFC/NFD issues. Countless glyphs are indistinguishab1e,
not available in all fonts, or effectively impossible to type in a given
locale without a Unicode reference. But if you do ls + cut&paste, it'll
generally work fine.
> There's also the issue that no two systems use the same encoding (let alone use them consistently), and even if you get two systems to agree on an (hopefully unicode-based) encoding they probably will disagree on something else making all your earlier efforts pointless. For instance OSX uses NFD for file names where most Linux systems use NFC. This means a file name which displays fine might not be selectable via the console (and potentially via other APIs), because the NFC you'll enter (if the file was transferred from OSX to Linux) will not match the on-disk NFD name.
Well, NFD at least ensures a unique representation, whereas on Linux
anything goes :(
> On Wed, 2010-04-28 at 12:45 +0200, Masklinn wrote:
>> For instance OSX uses NFD for file names where most Linux systems use
>> NFC.
> Actually, no, that's incorrect. Any sensible Unix is encoding-agnostic
> and treats filenames as bytes.
I think this is a matter of perpetual debate :-p. POSIX states that file
names are "character data", not binary. "sensible Unix" is a subjective
statement.
On Wed, 2010-04-28 at 23:56 +0200, Sune Foldager wrote:
> On 28-04-2010 12:45, Masklinn wrote:
> > There's also the issue that no two systems use the same encoding (let alone use them consistently), and even if you get two systems to agree on an (hopefully unicode-based) encoding they probably will disagree on something else making all your earlier efforts pointless. For instance OSX uses NFD for file names where most Linux systems use NFC. This means a file name which displays fine might not be selectable via the console (and potentially via other APIs), because the NFC you'll enter (if the file was transferred from OSX to Linux) will not match the on-disk NFD name.
> Well, NFD at least ensures a unique representation, whereas on Linux
> anything goes :(
Yes, much like the lack of case folding, this a perennial problem for
Linux users and developers everywhere. When oh when will someone ever
submit a case-folding Unicode-normalizing patch for ext3 that we are so
obviously dying for? Oh wait, yeah, no one actually cares about that.
The nightmare is actually for developers on those other systems, where
simply comparing filenames for equality is a black art. If you disagree,
please show me Python code for either Mac or Windows that gets all the
corner cases right in less than, say, 100 lines of code. Good luck:
using Python's NFD tables is insufficient and Windows' case-folding is
not exactly well-documented.