[issue2162] BOM (byte order mark) support for Mercurial.ini

43 views
Skip to first unread message

Yuya Nishihara

unread,
Apr 27, 2010, 9:48:13 AM4/27/10
to mercuri...@selenic.com

New submission from Yuya Nishihara <yu...@tcha.org>:

Some text editors, like Notepad.exe, insert BOM (byte order mark) silently if
you save Mercurial.ini as UTF-8.

IMHO, they shouldn't insert BOM for UTF-8, but it's really hard to debug
because BOM isn't visible. So it seems reasonable to skip/recognize BOM
before reading Mercurial.ini.

See also the discusson at tortoisehg:
http://bitbucket.org/tortoisehg/stable/issue/1190/

----------
messages: 12401
nosy: youjah
priority: wish
status: unread
title: BOM (byte order mark) support for Mercurial.ini

____________________________________________________
Mercurial issue tracker <bu...@mercurial.selenic.com>
<http://mercurial.selenic.com/bts/issue2162>
____________________________________________________
_______________________________________________
Mercurial-devel mailing list
Mercuri...@selenic.com
http://selenic.com/mailman/listinfo/mercurial-devel


--
Subscription settings: http://groups.google.com/group/mercurial_devel/subscribe?hl=en

Alexander Belchenko

unread,
Apr 27, 2010, 10:15:50 AM4/27/10
to mercuri...@selenic.com
Yuya Nishihara пишет:
> New submission from Yuya Nishihara <yu...@tcha.org>:
>
> Some text editors, like Notepad.exe, insert BOM (byte order mark) silently if
> you save Mercurial.ini as UTF-8.
>
> IMHO, they shouldn't insert BOM for UTF-8, but it's really hard to debug
> because BOM isn't visible. So it seems reasonable to skip/recognize BOM
> before reading Mercurial.ini.

I was under impression that UTF-8 might have optional BOM marker, and
Python even has this constant defined:

In [1]: import codecs

In [2]: codecs.BOM
codecs.BOM codecs.BOM_BE codecs.BOM_UTF32
codecs.BOM32_BE codecs.BOM_LE codecs.BOM_UTF32_BE
codecs.BOM32_LE codecs.BOM_UTF16 codecs.BOM_UTF32_LE
codecs.BOM64_BE codecs.BOM_UTF16_BE codecs.BOM_UTF8
codecs.BOM64_LE codecs.BOM_UTF16_LE

In [2]: codecs.BOM_UTF8
Out[2]: '\xef\xbb\xbf'

So, why you say it "shouldn't"?

Mads Kiilerich

unread,
Apr 27, 2010, 11:04:59 AM4/27/10
to Alexander Belchenko, mercuri...@selenic.com
Alexander Belchenko wrote, On 04/27/2010 04:15 PM:
> Yuya Nishihara пишет:
>> New submission from Yuya Nishihara <yu...@tcha.org>:
>>
>> Some text editors, like Notepad.exe, insert BOM (byte order mark)
>> silently if you save Mercurial.ini as UTF-8.
>>
>> IMHO, they shouldn't insert BOM for UTF-8, but it's really hard to
>> debug because BOM isn't visible. So it seems reasonable to
>> skip/recognize BOM before reading Mercurial.ini.
>
> I was under impression that UTF-8 might have optional BOM marker, and
> Python even has this constant defined:
>
> In [1]: import codecs
>
> In [2]: codecs.BOM
> codecs.BOM codecs.BOM_BE codecs.BOM_UTF32
> codecs.BOM32_BE codecs.BOM_LE codecs.BOM_UTF32_BE
> codecs.BOM32_LE codecs.BOM_UTF16 codecs.BOM_UTF32_LE
> codecs.BOM64_BE codecs.BOM_UTF16_BE codecs.BOM_UTF8
> codecs.BOM64_LE codecs.BOM_UTF16_LE
>
> In [2]: codecs.BOM_UTF8
> Out[2]: '\xef\xbb\xbf'
>
> So, why you say it "shouldn't"?

Because it is optional, has no benefit, and "never" is used?

Mercurial is not particular encoding-aware but very
encoding-transparent. Encoding Mercurial.ini in any ascii-superset is
fine, and BOMs could probably be removed or ignored when parsed, but in
that case the BOM should probably be prepended to all value strings too
... and that would cause other strange issues.

FWIW I'm -0 on special handling of BOM - but a strip on the config file
content before parsing should do no harm.

Perhaps we could warn if any non-7-bit characters if found before the
first # or =?

/Mads

Yuya Nishihara

unread,
Apr 27, 2010, 11:53:27 AM4/27/10
to Mads Kiilerich, Alexander Belchenko, mercuri...@selenic.com
Mads Kiilerich wrote:
> Alexander Belchenko wrote, On 04/27/2010 04:15 PM:
> > Yuya Nishihara пишет:
> >> New submission from Yuya Nishihara <yu...@tcha.org>:
> >>
> >> Some text editors, like Notepad.exe, insert BOM (byte order mark)
> >> silently if you save Mercurial.ini as UTF-8.
> >>
> >> IMHO, they shouldn't insert BOM for UTF-8, but it's really hard to
> >> debug because BOM isn't visible. So it seems reasonable to
> >> skip/recognize BOM before reading Mercurial.ini.
> >
> > I was under impression that UTF-8 might have optional BOM marker, and
> > Python even has this constant defined:
> >
> > In [1]: import codecs
> >
> > In [2]: codecs.BOM
> > codecs.BOM codecs.BOM_BE codecs.BOM_UTF32
> > codecs.BOM32_BE codecs.BOM_LE codecs.BOM_UTF32_BE
> > codecs.BOM32_LE codecs.BOM_UTF16 codecs.BOM_UTF32_LE
> > codecs.BOM64_BE codecs.BOM_UTF16_BE codecs.BOM_UTF8
> > codecs.BOM64_LE codecs.BOM_UTF16_LE
> >
> > In [2]: codecs.BOM_UTF8
> > Out[2]: '\xef\xbb\xbf'
> >
> > So, why you say it "shouldn't"?
>
> Because it is optional, has no benefit, and "never" is used?

I heard it can be used for detection of character encoding,
but it seems silly to lose ascii compatibility just for such reason.
UTF-8 does exist for ascii transparency.

> Mercurial is not particular encoding-aware but very
> encoding-transparent. Encoding Mercurial.ini in any ascii-superset is
> fine, and BOMs could probably be removed or ignored when parsed, but in
> that case the BOM should probably be prepended to all value strings too
> ... and that would cause other strange issues.
>
> FWIW I'm -0 on special handling of BOM - but a strip on the config file
> content before parsing should do no harm.
>
> Perhaps we could warn if any non-7-bit characters if found before the
> first # or =?

That seems good for me. Stripping BOM is simple enough, but because
Mercurial doesn't care about encoding, warning comes after.

Yuya,

Alexander Belchenko

unread,
Apr 28, 2010, 3:29:15 AM4/28/10
to mercuri...@selenic.com
Yuya Nishihara пишет:

I don't understand what is "ascii transparency" here. When somebody said
about "ascii" seriously, for me it sounds the same as pretend we're
living in the flat world which stand on the back of big turtle.

Sune Foldager

unread,
Apr 28, 2010, 5:02:06 AM4/28/10
to Alexander Belchenko, mercuri...@selenic.com
On 28-04-2010 09:29, Alexander Belchenko wrote:
>>>> So, why you say it "shouldn't"?
>>> Because it is optional, has no benefit, and "never" is used?
>>
>> I heard it can be used for detection of character encoding,
>> but it seems silly to lose ascii compatibility just for such reason.
>> UTF-8 does exist for ascii transparency.
>
> I don't understand what is "ascii transparency" here. When somebody said
> about "ascii" seriously, for me it sounds the same as pretend we're
> living in the flat world which stand on the back of big turtle.

Welcome to the UNIX world, where many people are scared of anything
non-ASCII due to compatibility with ancient programs ;-)

/Sune

Masklinn

unread,
Apr 28, 2010, 6:45:58 AM4/28/10
to mercuri...@selenic.com
On 27 Apr 2010, at 16:15 , Alexander Belchenko wrote:
>
> I was under impression that UTF-8 might have optional BOM marker, and Python even has this constant defined:
>
> In [1]: import codecs
>
> In [2]: codecs.BOM
> codecs.BOM codecs.BOM_BE codecs.BOM_UTF32
> codecs.BOM32_BE codecs.BOM_LE codecs.BOM_UTF32_BE
> codecs.BOM32_LE codecs.BOM_UTF16 codecs.BOM_UTF32_LE
> codecs.BOM64_BE codecs.BOM_UTF16_BE codecs.BOM_UTF8
> codecs.BOM64_LE codecs.BOM_UTF16_LE
>
> In [2]: codecs.BOM_UTF8
> Out[2]: '\xef\xbb\xbf'
>
> So, why you say it "shouldn't"?
Well, since utf-8 has no "customizable" byte order, "byte order mark" is a misnomer to start with. Second, while it's allowed, the byte order mark in a utf-8 document is *not recommended* by the official Unicode standard:

> Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature
(Unicode Standard 5.0 chapter 2)

and it's generally a pain in the ass.

On 28 Apr 2010, at 11:02 , Sune Foldager wrote:
> On 28-04-2010 09:29, Alexander Belchenko wrote:
>>>>> So, why you say it "shouldn't"?
>>>> Because it is optional, has no benefit, and "never" is used?
>>> I heard it can be used for detection of character encoding,
>>> but it seems silly to lose ascii compatibility just for such reason.
>>> UTF-8 does exist for ascii transparency.
>> I don't understand what is "ascii transparency" here. When somebody said
>> about "ascii" seriously, for me it sounds the same as pretend we're
>> living in the flat world which stand on the back of big turtle.
> Welcome to the UNIX world, where many people are scared of anything non-ASCII due to compatibility with ancient programs ;-)
There's also the issue that no two systems use the same encoding (let alone use them consistently), and even if you get two systems to agree on an (hopefully unicode-based) encoding they probably will disagree on something else making all your earlier efforts pointless. For instance OSX uses NFD for file names where most Linux systems use NFC. This means a file name which displays fine might not be selectable via the console (and potentially via other APIs), because the NFC you'll enter (if the file was transferred from OSX to Linux) will not match the on-disk NFD name.

Dan Villiom Podlaski Christiansen

unread,
Apr 28, 2010, 7:41:46 AM4/28/10
to Masklinn, mercuri...@selenic.com
On 28 Apr 2010, at 12:45, Masklinn wrote:

> For instance OSX uses NFD for file names where most Linux systems use NFC. This means a file name which displays fine might not be selectable via the console (and potentially via other APIs), because the NFC you'll enter (if the file was transferred from OSX to Linux) will not match the on-disk NFD name.

I don't know much about the situation on Linux, but Mac OS X isn't quite as you describe. First of all, Mac OS X doesn't actually use NFC[1] but a custom, stable encoding. Files can be accessed via their NFC name; the kernel will transparently convert any illegal entities. Listing a directory, however, always yields the canonical strings.

In short: I don't believe it'll matter much how you encode file names on Mac OS X, unless you do file name completion, or try to compare paths for equality.

[1] See <http://developer.apple.com/mac/library/technotes/tn/tn1150.html#UnicodeSubtleties>.
[2] Which is described in <http://developer.apple.com/mac/library/technotes/tn/tn1150table.html>.

--

Dan Villiom Podlaski Christiansen
dan...@gmail.com

Masklinn

unread,
Apr 28, 2010, 8:00:56 AM4/28/10
to Dan Villiom Podlaski Christiansen, mercuri...@selenic.com
On 28 Apr 2010, at 13:41 , Dan Villiom Podlaski Christiansen wrote:
>
> On 28 Apr 2010, at 12:45, Masklinn wrote:
>> For instance OSX uses NFD for file names where most Linux systems use NFC. This means a file name which displays fine might not be selectable via the console (and potentially via other APIs), because the NFC you'll enter (if the file was transferred from OSX to Linux) will not match the on-disk NFD name.
>
> I don't know much about the situation on Linux, but Mac OS X isn't quite as you describe. First of all, Mac OS X doesn't actually use NFC[1]
Erm. I said it used NFD. It's actually an NFD variant, but that's still NFD-based, and it's a variant mostly because NFD was not completely specified at the time

> but a custom, stable encoding. Files can be accessed via their NFC name; the kernel will transparently convert any illegal entities. Listing a directory, however, always yields the canonical strings.
Yes, OSX will perform canonical decomposition on CREATE and LOOKUP (though that doesn't mean there are no problem with that, see [3] for incompatibilities between OSX reporting NFD names and bash expecting NFC names for instance), but most linux distributions will *not* [4], which is why I specifically talked about OSX -> Linux

[3] http://www.mail-archive.com/bug-...@gnu.org/msg04070.html
[4] http://blogs.sun.com/nico/entry/filesystem_i18n

Dan Villiom Podlaski Christiansen

unread,
Apr 28, 2010, 8:20:42 AM4/28/10
to Masklinn, mercuri...@selenic.com

On 28 Apr 2010, at 14:00, Masklinn wrote:

> On 28 Apr 2010, at 13:41 , Dan Villiom Podlaski Christiansen wrote:
>>
>> On 28 Apr 2010, at 12:45, Masklinn wrote:
>>> For instance OSX uses NFD for file names where most Linux systems use NFC. This means a file name which displays fine might not be selectable via the console (and potentially via other APIs), because the NFC you'll enter (if the file was transferred from OSX to Linux) will not match the on-disk NFD name.
>>
>> I don't know much about the situation on Linux, but Mac OS X isn't quite as you describe. First of all, Mac OS X doesn't actually use NFC[1]
> Erm. I said it used NFD. It's actually an NFD variant, but that's still NFD-based, and it's a variant mostly because NFD was not completely specified at the time

Sorry, I hit “send” too quickly; you're right, of course :)

>> but a custom, stable encoding. Files can be accessed via their NFC name; the kernel will transparently convert any illegal entities. Listing a directory, however, always yields the canonical strings.
> Yes, OSX will perform canonical decomposition on CREATE and LOOKUP (though that doesn't mean there are no problem with that, see [3] for incompatibilities between OSX reporting NFD names and bash expecting NFC names for instance), but most linux distributions will *not* [4], which is why I specifically talked about OSX -> Linux

Well, what Bash does is quite different from anything Mercurial does. I don't believe we do file name completion anywhere. There were some problems previously with path normalisation and the working copy, but I believe they have been fixed. Honestly, I don't see how path normalisation is relevant to configuration file magic markers…

--

Martin Geisler

unread,
Apr 28, 2010, 9:48:21 AM4/28/10
to Dan Villiom Podlaski Christiansen, Masklinn, mercuri...@selenic.com
Dan Villiom Podlaski Christiansen <dan...@gmail.com> writes:

> Well, what Bash does is quite different from anything Mercurial does.
> I don't believe we do file name completion anywhere.

What about

hg status -I '*.txt'

I think that will give you the same problems as mentioned in the bash
bugreport.

--
Martin Geisler

aragost Trifork
Professional Mercurial support
http://aragost.com/mercurial/

Matt Mackall

unread,
Apr 28, 2010, 12:59:12 PM4/28/10
to Masklinn, mercuri...@selenic.com
On Wed, 2010-04-28 at 12:45 +0200, Masklinn wrote:
> For instance OSX uses NFD for file names where most Linux systems use
> NFC.

Actually, no, that's incorrect. Any sensible Unix is encoding-agnostic
and treats filenames as bytes. There is nothing in the path from the
syscall interface down to the disk on, say, ext3 that cares about any
bytes aside from '/' and '\0'. The only sense in which Linux systems
"use" NFC is that the input layer of most applications produce NFC form
from keyboard input (though this may very much depend on locale!) and it
reaches the disk unmolested. If you hand an application a string in NFD
form (for instance, because you rsynced it from an OS X box), it will
similarly reach the disk unmolested.

Here's how you add proper kernel support for Unicode to a traditional
Unix: you stand up on a mountain and say "I hereby declare that the
prefered filesystem encoding is UTF-8". Done. This is the beauty of
UTF-8 - huge bodies of code that were written to be properly
encoding-agnostic just continue to work. And this is all it means when
someone (like Linus) says Linux filesystems are UTF-8.

> This means a file name which displays fine might not be selectable
> via the console (and potentially via other APIs), because the NFC
> you'll enter (if the file was transferred from OSX to Linux) will not
> match the on-disk NFD name.

If you try to visually copy Unicode, you will fail. But this is not
restricted to NFC/NFD issues. Countless glyphs are indistinguishab1e,
not available in all fonts, or effectively impossible to type in a given
locale without a Unicode reference. But if you do ls + cut&paste, it'll
generally work fine.

--
http://selenic.com : development and support for Mercurial and Linux

Sune Foldager

unread,
Apr 28, 2010, 5:56:29 PM4/28/10
to Masklinn, mercuri...@selenic.com
On 28-04-2010 12:45, Masklinn wrote:
> There's also the issue that no two systems use the same encoding (let alone use them consistently), and even if you get two systems to agree on an (hopefully unicode-based) encoding they probably will disagree on something else making all your earlier efforts pointless. For instance OSX uses NFD for file names where most Linux systems use NFC. This means a file name which displays fine might not be selectable via the console (and potentially via other APIs), because the NFC you'll enter (if the file was transferred from OSX to Linux) will not match the on-disk NFD name.

Well, NFD at least ensures a unique representation, whereas on Linux
anything goes :(

/Sune

Sune Foldager

unread,
Apr 28, 2010, 6:09:15 PM4/28/10
to Matt Mackall, Masklinn, mercuri...@selenic.com
On 28-04-2010 18:59, Matt Mackall wrote:
> On Wed, 2010-04-28 at 12:45 +0200, Masklinn wrote:
>> For instance OSX uses NFD for file names where most Linux systems use
>> NFC.
>
> Actually, no, that's incorrect. Any sensible Unix is encoding-agnostic
> and treats filenames as bytes.

I think this is a matter of perpetual debate :-p. POSIX states that file
names are "character data", not binary. "sensible Unix" is a subjective
statement.

/Sune

Matt Mackall

unread,
Apr 28, 2010, 6:59:04 PM4/28/10
to Sune Foldager, Masklinn, mercuri...@selenic.com
On Wed, 2010-04-28 at 23:56 +0200, Sune Foldager wrote:
> On 28-04-2010 12:45, Masklinn wrote:
> > There's also the issue that no two systems use the same encoding (let alone use them consistently), and even if you get two systems to agree on an (hopefully unicode-based) encoding they probably will disagree on something else making all your earlier efforts pointless. For instance OSX uses NFD for file names where most Linux systems use NFC. This means a file name which displays fine might not be selectable via the console (and potentially via other APIs), because the NFC you'll enter (if the file was transferred from OSX to Linux) will not match the on-disk NFD name.
>
> Well, NFD at least ensures a unique representation, whereas on Linux
> anything goes :(

Yes, much like the lack of case folding, this a perennial problem for
Linux users and developers everywhere. When oh when will someone ever
submit a case-folding Unicode-normalizing patch for ext3 that we are so
obviously dying for? Oh wait, yeah, no one actually cares about that.

The nightmare is actually for developers on those other systems, where
simply comparing filenames for equality is a black art. If you disagree,
please show me Python code for either Mac or Windows that gets all the
corner cases right in less than, say, 100 lines of code. Good luck:
using Python's NFD tables is insufficient and Windows' case-folding is
not exactly well-documented.

--
http://selenic.com : development and support for Mercurial and Linux


Reply all
Reply to author
Forward
0 new messages