svn checkout - special characters in file name are not encoding properly

2,483 views
Skip to first unread message

suman....@asia.bnpparibas.com

unread,
Aug 9, 2010, 2:12:04 AM8/9/10
to us...@subversion.apache.org

Hi,
I'm working on French project. Very recently we have migrated our project from
CVS to SVN repository. After migration, when checking out the file names which
have special characters like Western-Europe encoding fonts are giving problem.

e.g.:"modŠle fields-replacements.xsl" is the file name inside the Subversion and
after checking out the file into the local machine it is coming like "mod?le
fields-replacements.xsl"
Inside the file name '?' is displaying rather than Š(S with caron).

Please help me to get the proper file-names from subversion without any encoding
problem

Thank you,
Regards,
Sunny



This message and any attachments (the "message") is
intended solely for the addressees and is confidential. 
If you receive this message in error, please delete it and 
immediately notify the sender. Any use not in accord with 
its purpose, any dissemination or disclosure, either whole 
or partial, is prohibited except formal approval. The internet
can not guarantee the integrity of this message. 
BNP PARIBAS (and its subsidiaries) shall (will) not 
therefore be liable for the message if modified. 
Do not print this message unless it is necessary,
consider the environment.

                ---------------------------------------------

Ce message et toutes les pieces jointes (ci-apres le 
"message") sont etablis a l'intention exclusive de ses 
destinataires et sont confidentiels. Si vous recevez ce 
message par erreur, merci de le detruire et d'en avertir 
immediatement l'expediteur. Toute utilisation de ce 
message non conforme a sa destination, toute diffusion 
ou toute publication, totale ou partielle, est interdite, sauf 
autorisation expresse. L'internet ne permettant pas 
d'assurer l'integrite de ce message, BNP PARIBAS (et ses
filiales) decline(nt) toute responsabilite au titre de ce 
message, dans l'hypothese ou il aurait ete modifie.
N'imprimez ce message que si necessaire,
pensez a l'environnement.

Daniel Shahaf

unread,
Aug 9, 2010, 11:31:43 AM8/9/10
to suman....@asia.bnpparibas.com, us...@subversion.apache.org
(I'm going to handle just the "svn repository" potential cause of the
problem. I'll let others handle the "client-side problem, but repository is
okay" potential cause.)

suman....@asia.bnpparibas.com wrote on Mon, Aug 09, 2010 at 11:42:04 +0530:
> Hi,
> I'm working on French project. Very recently we have migrated our project
> from CVS to SVN repository. After migration, when checking out the file
> names which have special characters like Western-Europe encoding fonts are
> giving problem.
>
> e.g.:"modŠle fields-replacements.xsl" is the file name inside the
> Subversion and after checking out the file into the local machine it is
> coming like "mod?le fields-replacements.xsl"
> Inside the file name '?' is displaying rather than Š(S with caron).
>
> Please help me to get the proper file-names from subversion without any
> encoding
> problem
>

We do support UTF-8 file names:

0:% x="modŠle fields-replacements.xsl"
0:% echo foo > $x
0:% $svnmucc put -mmsg $x file://`pwd`/r1/$x
r2 committed by daniel at 2010-08-09T15:28:18.165794Z
0:% $svn up wc1
A wc1/modŠle fields-replacements.xsl
Updated to revision 2.
0:% ls wc1
branches modŠle fields-replacements.xsl tags trunk
0:%

(this is under linux with a UTF-8 locale)


Given that you migrated from CVS to SVN, can you check that the filenames
inside the repository's filesystem are encoded in UTF-8 and not in iso-8859-*?

> Thank you,
> Regards,
> Sunny
>
>

<snip disclaimer>

Alexander Skwar

unread,
Aug 9, 2010, 12:22:57 PM8/9/10
to Daniel Shahaf, suman....@asia.bnpparibas.com, us...@subversion.apache.org
Hi.


Am 09.08.2010 um 17:31 schrieb Daniel Shahaf <d...@daniel.shahaf.name>:

>
>
> suman....@asia.bnpparibas.com wrote on Mon, Aug 09, 2010 at
> 11:42:04 +0530:
>> Hi,
>> I'm working on French project. Very recently we have migrated our
>> project
>> from CVS to SVN repository. After migration, when checking out the
>> file
>> names which have special characters like Western-Europe encoding
>> fonts are
>> giving problem.
>>
>> e.g.:"modŠle fields-replacements.xsl" is the file name inside the
>> Subversion and after checking out the file into the local machine
>> it is
>> coming like "mod?le fields-replacements.xsl"
>> Inside the file name '?' is displaying rather than Š(S with caron).
>>
>> Please help me to get the proper file-names from subversion without
>> any
>> encoding
>> problem
>>
>
> We do support UTF-8 file names:
>

And I bet, that exactly this is the problem; I bet, he's on a
platform, which doesn't use UTF8 for encoding filenames (eg. Windows).

How's the iso-8859-x support? Especially if Linux and Windows are used
in parallel?

Alexander

>

Daniel Shahaf

unread,
Aug 9, 2010, 12:30:00 PM8/9/10
to Alexander Skwar, suman....@asia.bnpparibas.com, us...@subversion.apache.org

In the repository filesystem, we use UTF-8 exclusively. APR handles
translating that UTF-8 to whatever the local OS supports.

> Alexander
>
>>

Bert Huijben

unread,
Aug 9, 2010, 4:44:47 PM8/9/10
to Alexander Skwar, Daniel Shahaf, suman....@asia.bnpparibas.com, us...@subversion.apache.org

All standard filesystems on Windows since Windows '95 / NT use UTF-16 (or UCS-2) in the filesystem (NTFS, FAT32, VFAT), so Windows can express any character expressed in utf-8 by simple recoding. On Windows '9X the usual way to access them was using the ANSI set, but all Windows '9X versions and even Windows 2000 are out of support now, so APR handles everything for us in Unicode now.

There might be some issues if you use network shares; especially if the network share is hosted on some NAS, because these systems sometimes use older protocols and or filesystems which don't support unicode.

Bert

suman....@asia.bnpparibas.com

unread,
Aug 10, 2010, 1:22:10 AM8/10/10
to d...@daniel.shahaf.name, us...@subversion.apache.org

Please let me know how to check the file system encoding type in repository, and assist me to change the encoding with a proper encoding format to get the files properly.

--Sunny



Internet
d...@daniel.shahaf.name

09/08/2010 21:01

To
Suman MAINAM
cc
users

Subject
Re: svn checkout - special characters in file name are not        encoding properly



This message and any attachments (the "message") is
intended solely for the addressees and is confidential. 
If you receive this message in error, please delete it and 
immediately notify the sender. Any use not in accord with 
its purpose, any dissemination or disclosure, either whole 
or partial, is prohibited except formal approval. The internet
can not guarantee the integrity of this message. 
BNP PARIBAS (and its subsidiaries) shall (will) not 
therefore be liable for the message if modified. 
Do not print this message unless it is necessary,
consider the environment.

                ---------------------------------------------

Ce message et toutes les pieces jointes (ci-apres le 
"message") sont etablis a l'intention exclusive de ses 
destinataires et sont confidentiels. Si vous recevez ce 
message par erreur, merci de le detruire et d'en avertir 
immediatement l'expediteur. Toute utilisation de ce 
message non conforme a sa destination, toute diffusion 
ou toute publication, totale ou partielle, est interdite, sauf 
autorisation expresse. L'internet ne permettant pas 
d'assurer l'integrite de ce message, BNP PARIBAS (et ses
filiales) decline(nt) toute responsabilite au titre de ce 
message, dans l'hypothese ou il aurait ete modifie.
N'imprimez ce message que si necessaire,
pensez a l'environnement.

Daniel Shahaf

unread,
Aug 10, 2010, 1:33:14 AM8/10/10
to suman....@asia.bnpparibas.com, us...@subversion.apache.org
suman....@asia.bnpparibas.com wrote on Tue, Aug 10, 2010 at 10:52:10 +0530:
> Please let me know how to check the file system encoding type in
> repository,

Use tools that access the repository directly. (i.e., the tools that take
a *path*, rather than a URL, of the repository.)

For example:

svnlook tree --full-paths /path/to/repos
svnadmin dump /path/to/repos | grep '^Node-path:'

> and assist me to change the encoding with a proper encoding
> format to get the files properly.
>

*If* the encoding in the repository filesystem is wrong, then you'll need to
rewrite history. I suppose one of the dumpfile manipulation tools out there
would be your best bet; someone on the list might be able to make a more
specific recommendation.

Vincent Lefevre

unread,
Aug 10, 2010, 10:59:58 AM8/10/10
to us...@subversion.apache.org
On 2010-08-09 19:30:00 +0300, Daniel Shahaf wrote:
> In the repository filesystem, we use UTF-8 exclusively. APR handles
> translating that UTF-8 to whatever the local OS supports.

Which is meaningless, since under Unix, the locale is not related
to the OS, but to the process: one can have a shell session with
UTF-8 locales and another shell session with ISO-8859-* locales.
Unfortunately the svn client doesn't remember which one was used
in the first place. The consequence is that if the user works
with different locales, things go wrong (even if the user doesn't
execute any command with non-ASCII characters in its arguments).

--
Vincent Lef�vre <vin...@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / Ar�naire project (LIP, ENS-Lyon)

Stefan Sperling

unread,
Aug 10, 2010, 11:42:57 AM8/10/10
to us...@subversion.apache.org
On Tue, Aug 10, 2010 at 04:59:58PM +0200, Vincent Lefevre wrote:
> On 2010-08-09 19:30:00 +0300, Daniel Shahaf wrote:
> > In the repository filesystem, we use UTF-8 exclusively. APR handles
> > translating that UTF-8 to whatever the local OS supports.
>
> Which is meaningless, since under Unix, the locale is not related
> to the OS, but to the process: one can have a shell session with
> UTF-8 locales and another shell session with ISO-8859-* locales.

I don't understand your point.

The repository uses UTF-8 internally regardless of the locale of the
server process. mod_dav_svn actually runs in the "C" locale because the
httpd server does not propagate locale information to its modules for
"security reasons". mod_dav_svn still receives all filenames from the
client encoded in UTF-8.

The locale only matters when data is presented to the user (by the svn
client, or svnlook, or svnadmin, ...) in which case Subversion uses iconv
to translate the UTF-8 data into the character set of the current locale.
If that does not work, an error message is printed.

> Unfortunately the svn client doesn't remember which one was used
> in the first place. The consequence is that if the user works
> with different locales, things go wrong (even if the user doesn't
> execute any command with non-ASCII characters in its arguments).

AFAIK there is no standard mechanism on UNIX for telling a process about
filename encodings. Filenames are just byte sequences. It's up to the
application to present the byte sequence to the user in a meaningful way.

One way of doing it is using assuming the character set of the current
locale and hope that this will work. That of course breaks down when people
try to work with the same set of files in locales using different character
sets (like latin1 vs. UTF-8). E.g. you can't check out a working copy using
a UTF-8 locale, and then use it with an svn client in a latin1 locale, and
expect things to just work, if you have filenames in the repository which
contain non-ASCII characters.

There are extensions in some systems like Linux, where filename encoding
can be specified at mount time and a process can query this information.
But the actual encoding of filenames might still differ (e.g. due to user
error). But more importantly since there is no common standard I don't
see how you'd solve this problem in a portable way.

Stefan

Vincent Lefevre

unread,
Aug 10, 2010, 1:44:35 PM8/10/10
to us...@subversion.apache.org
On 2010-08-10 17:42:57 +0200, Stefan Sperling wrote:
> The locale only matters when data is presented to the user (by the svn
> client, or svnlook, or svnadmin, ...) in which case Subversion uses iconv
> to translate the UTF-8 data into the character set of the current locale.

The svn client also uses the locale for filename encoding.

> AFAIK there is no standard mechanism on UNIX for telling a process about
> filename encodings. Filenames are just byte sequences. It's up to the
> application to present the byte sequence to the user in a meaningful way.

Yes, however "meaningful" depends on what the user expects.

> One way of doing it is using assuming the character set of the current
> locale and hope that this will work. That of course breaks down when people
> try to work with the same set of files in locales using different character
> sets (like latin1 vs. UTF-8). E.g. you can't check out a working copy using
> a UTF-8 locale, and then use it with an svn client in a latin1 locale, and
> expect things to just work, if you have filenames in the repository which
> contain non-ASCII characters.

This is precisely the problem I've mentioned.

> There are extensions in some systems like Linux, where filename encoding
> can be specified at mount time and a process can query this information.
> But the actual encoding of filenames might still differ (e.g. due to user
> error). But more importantly since there is no common standard I don't
> see how you'd solve this problem in a portable way.

This is easy (at least from the specification point of view): once the
encoding has been determined[*], typically at checkout time, store the
encoding in the WC metadata (with the current WC layout, that would be
some file under the .svn directory), so that the next time the svn
client is used for this WC, the same encoding will be used, avoiding
inconsistencies (such as currently obtained by two "svn up" under two
different locales).

[*] There are several ways to do that, such as:
1. Use a charset specified by the user in the svn config file.
2. Use the current locale.

Stefan Sperling

unread,
Aug 10, 2010, 2:59:00 PM8/10/10
to us...@subversion.apache.org
On Tue, Aug 10, 2010 at 07:44:35PM +0200, Vincent Lefevre wrote:
> On 2010-08-10 17:42:57 +0200, Stefan Sperling wrote:
> > There are extensions in some systems like Linux, where filename encoding
> > can be specified at mount time and a process can query this information.
> > But the actual encoding of filenames might still differ (e.g. due to user
> > error). But more importantly since there is no common standard I don't
> > see how you'd solve this problem in a portable way.
>
> This is easy (at least from the specification point of view): once the
> encoding has been determined[*], typically at checkout time, store the
> encoding in the WC metadata (with the current WC layout, that would be
> some file under the .svn directory), so that the next time the svn
> client is used for this WC, the same encoding will be used, avoiding
> inconsistencies (such as currently obtained by two "svn up" under two
> different locales).

I doubt this can be made to work properly. A feature like that is just
asking people to shoot themselves in the foot.

People simply should not mix character sets like that in their working copies.
There should be a project-wide convention about the encoding used for
filenames, and everyone should be using that encoding (unless there
really is a project-specific need to have filenames in multiple encodings
for some reason, but that's really rare -- and whoever does this should be
smart enough to deal with the consequences).

Right now, if the filename cannot be represented in the current locale,
you get this error: "svn: Can't convert string from 'UTF-8' to native encoding"

The native encoding is determined by the locale, but that does not matter.
The point is that, wherever encoding configuration happens to come from,
if the configured encoding cannot represent the character string stored
as UTF-8 in the repository, what is Subversion supposed to do? It cannot
really do anything with a filename it cannot represent in the character
set configured by the user, other than throwing an error.

The filename conversion to UTF-8 and back must not be lossy. Because
to uniquely identify a file the client needs to send the same UTF-8 byte
sequence it got from the server back to the server. And it needs to keep
doing so for backwards compatibility. This is biting us on Mac OS X by the
way, because some characters have multiple representations in UTF-8,
see http://subversion.tigris.org/issues/show_bug.cgi?id=2464

> [*] There are several ways to do that, such as:
> 1. Use a charset specified by the user in the svn config file.

That provides no advantage over checking the current locale.

> 2. Use the current locale.

That's what's being done. But we're not writing the information down in the
working copy meta data, and doing so is quite pointless as described above.

Stefan

Vincent Lefevre

unread,
Aug 10, 2010, 6:31:48 PM8/10/10
to us...@subversion.apache.org
On 2010-08-10 20:59:00 +0200, Stefan Sperling wrote:
> On Tue, Aug 10, 2010 at 07:44:35PM +0200, Vincent Lefevre wrote:
> > This is easy (at least from the specification point of view): once the
> > encoding has been determined[*], typically at checkout time, store the
> > encoding in the WC metadata (with the current WC layout, that would be
> > some file under the .svn directory), so that the next time the svn
> > client is used for this WC, the same encoding will be used, avoiding
> > inconsistencies (such as currently obtained by two "svn up" under two
> > different locales).
>
> I doubt this can be made to work properly. A feature like that is just
> asking people to shoot themselves in the foot.

I don't see any problem with it. If you want another method, then fine,
but in any case, a command like "svn up" should not fail just because
it is executed under locales unexpected by the client.

> People simply should not mix character sets like that in their
> working copies.

It seems that you didn't understand what I proposed. My proposal is
just to *avoid* mixing character sets in filenames (contrary to what
svn currently does), i.e. to use a single character set, defined at
checkout time (for instance).

> There should be a project-wide convention about the encoding used for
> filenames, and everyone should be using that encoding

For the repository, of course, but it is already the case: UTF-8.
For working copies, if a single encoding must be defined, it should
be UTF-8 too, in particular to be sure to be able to represent all
the filenames that can occur.

> (unless there
> really is a project-specific need to have filenames in multiple encodings
> for some reason, but that's really rare -- and whoever does this should be
> smart enough to deal with the consequences).
>
> Right now, if the filename cannot be represented in the current locale,
> you get this error: "svn: Can't convert string from 'UTF-8' to native encoding"

which is bad and prevents users from writing POSIX-conforming scripts
using svn, i.e. under the POSIX locale (except on systems where the
POSIX locale uses UTF-8, but I don't know any).

> The native encoding is determined by the locale, but that does not matter.
> The point is that, wherever encoding configuration happens to come from,
> if the configured encoding cannot represent the character string stored
> as UTF-8 in the repository, what is Subversion supposed to do? It cannot
> really do anything with a filename it cannot represent in the character
> set configured by the user, other than throwing an error.

For filenames stored on disk, they (all of them) can be encoded using
UTF-8. Remember, filenames on a POSIX system are just sequences of
bytes. For what is output to the terminal, non-representable
characters can be displayed by a replacement characters such as "?".
This can still be better than an error.

> The filename conversion to UTF-8 and back must not be lossy. Because
> to uniquely identify a file the client needs to send the same UTF-8 byte
> sequence it got from the server back to the server. And it needs to keep
> doing so for backwards compatibility. This is biting us on Mac OS X by the
> way, because some characters have multiple representations in UTF-8,
> see http://subversion.tigris.org/issues/show_bug.cgi?id=2464

This problem is due to the fact that Subversion doesn't enforce a
canonical representation (either NFC or NFD).

Anyway there would still be problems with case-insensitive filesystems
for instance.

> > [*] There are several ways to do that, such as:
> > 1. Use a charset specified by the user in the svn config file.
>
> That provides no advantage over checking the current locale.

The advantage is that the user doesn't need to remember to use a UTF-8
based locale for the checkout. This would also allow the user to do
checkout by portable POSIX scripts (i.e. with LC_ALL=POSIX).

> > 2. Use the current locale.
>
> That's what's being done. But we're not writing the information down in the
> working copy meta data, and doing so is quite pointless as described above.

It's not pointless, or at least, something else needs to be done.
Currently "svn up" fails to work, and that's a problem.

Paul Ebermann

unread,
Aug 11, 2010, 5:11:25 AM8/11/10
to us...@subversion.apache.org
Vincent Lefevre wrote:
> On 2010-08-10 20:59:00 +0200, Stefan Sperling wrote:
>
>> The native encoding is determined by the locale, but that does not matter.
>> The point is that, wherever encoding configuration happens to come from,
>> if the configured encoding cannot represent the character string stored
>> as UTF-8 in the repository, what is Subversion supposed to do? It cannot
>> really do anything with a filename it cannot represent in the character
>> set configured by the user, other than throwing an error.
>
> For filenames stored on disk, they (all of them) can be encoded using
> UTF-8. Remember, filenames on a POSIX system are just sequences of
> bytes. For what is output to the terminal, non-representable
> characters can be displayed by a replacement characters such as "?".
> This can still be better than an error.

The thing is, users are using other tools than SVN to work with the files, too.

So if I look at my directory with a file manager, I want my filenames to be readable (and
renameable). The idea is that usually the user uses for one working copy always the same
locale for all tools, so all filenames look same.

Of course this is not possible if some characters in the repository simply do not exists
in our current locale charset. Maybe Subversion (or APR) could support (optionally by
command-line-flag and/or config option) some reversible replacement encoding, like
"B@E4@lle.txt" for "B�lle.txt" if the current charset does not contain "�". (And
such-named files would be allowed for add/import etc., resulting in "B�lle.txt" in the
repository.)

(This @-encoding is only an example, not really a proposal - this could conflict with
other uses of @.)


Paul (who uses a UTF-8-locale personally)

Vincent Lefevre

unread,
Aug 11, 2010, 6:35:59 AM8/11/10
to us...@subversion.apache.org
On 2010-08-11 11:11:25 +0200, Paul Ebermann wrote:
> The thing is, users are using other tools than SVN to work with the
> files, too.
>
> So if I look at my directory with a file manager, I want my
> filenames to be readable (and renameable). The idea is that usually
> the user uses for one working copy always the same locale for all
> tools, so all filenames look same.

Yes, and this is another reason why the solution chosen by Subversion
doesn't work well. For instance, GNOME always uses UTF-8 for filename
encoding. So, if the user uses ISO-8859-* locales (for shell sessions),
one gets inconsistencies.

Stefan Sperling

unread,
Aug 11, 2010, 7:42:35 AM8/11/10
to us...@subversion.apache.org
On Wed, Aug 11, 2010 at 12:31:48AM +0200, Vincent Lefevre wrote:
> On 2010-08-10 20:59:00 +0200, Stefan Sperling wrote:
> > Right now, if the filename cannot be represented in the current locale,
> > you get this error: "svn: Can't convert string from 'UTF-8' to native encoding"
>
> which is bad and prevents users from writing POSIX-conforming scripts
> using svn, i.e. under the POSIX locale (except on systems where the
> POSIX locale uses UTF-8, but I don't know any).

There's no reason your script could not configure a UTF-8 locale if that
is needed to represent filenames which exist in the repository.

> For filenames stored on disk, they (all of them) can be encoded using
> UTF-8. Remember, filenames on a POSIX system are just sequences of
> bytes. For what is output to the terminal, non-representable
> characters can be displayed by a replacement characters such as "?".
> This can still be better than an error.

Throwing an error is a straightforward way of solving the problem.

We agree on the point that Subversion should use a single character
set for all filenames in the same working copy.
Because how should Subversion behave if some filenames convert fine to
the current character set, and some do not? E.g. what if my encoding
configuration setting is en_US.ISO8859-1? Should Subversion use ISO8859-1
for some filenames, and UTF-8 for those which cannot be represented in
ISO8859-1? That gets really confusing.

It seems that this conversation leads to the question of why Subversion
even bothers with checking the locale at all. It might as well always
create filenames in UTF-8, and leave the user with apparently mangled
filenames if they don't use a UTF-8 locale.

But that isn't a solution either, because now you have lots of
non-UTF-8 users complaining that Subversion cannot represent their
filenames properly, where previously it worked fine.

> > see http://subversion.tigris.org/issues/show_bug.cgi?id=2464
>
> This problem is due to the fact that Subversion doesn't enforce a
> canonical representation (either NFC or NFD).

Yes. I just brought it up because it is related indirectly to this
discussion.

> > > 2. Use the current locale.
> >
> > That's what's being done. But we're not writing the information down in the
> > working copy meta data, and doing so is quite pointless as described above.
>
> It's not pointless, or at least, something else needs to be done.
> Currently "svn up" fails to work, and that's a problem.

It doesn't fail if locales are used consistently.
If locales aren't configured consistently, that's a user error.
That's the best we can do.

I don't think this problem is specific to Subversion.
Other tools also suffer from the fact that POSIX doesn't specify a
standard for defining filename encodings. Maybe we can find a good
solution by looking around at how other tools handle this.
However, I'd expect many will just assume that the user wants filenames
to be encoded according to the current locale.
If everybody follows this convention, there is no problem, apart from
user errors during locale configuration.

Stefan

Stefan Sperling

unread,
Aug 11, 2010, 7:51:18 AM8/11/10
to us...@subversion.apache.org
On Wed, Aug 11, 2010 at 12:35:59PM +0200, Vincent Lefevre wrote:
> On 2010-08-11 11:11:25 +0200, Paul Ebermann wrote:
> > The thing is, users are using other tools than SVN to work with the
> > files, too.
> >
> > So if I look at my directory with a file manager, I want my
> > filenames to be readable (and renameable). The idea is that usually
> > the user uses for one working copy always the same locale for all
> > tools, so all filenames look same.
>
> Yes, and this is another reason why the solution chosen by Subversion
> doesn't work well. For instance, GNOME always uses UTF-8 for filename
> encoding.

You might as well argue that Subversion's solution works well but
GNOME's solution does not.

Because there is no standard, it's perfectly fine for tools to use
different conventions for this. As it is, users need to be aware of
these problems and configure their tools to use the right encodings.

> So, if the user uses ISO-8859-* locales (for shell sessions),
> one gets inconsistencies.

So don't use GNOME if you don't want your filenames encoded in UTF-8,
or don't use a non-UTF-8 locale when working with files you want to
use from GNOME. Problem solved.

Stefan

Alexander Skwar

unread,
Aug 11, 2010, 10:20:38 AM8/11/10
to us...@subversion.apache.org
Hi.

2010/8/11 Vincent Lefevre <vince...@vinc17.net>


Yes, and this is another reason why the solution chosen by Subversion
doesn't work well. For instance, GNOME always uses UTF-8 for filename
encoding. So, if the user uses ISO-8859-* locales (for shell sessions),
one gets inconsistencies.



Just curious - why should a user do that (on purpose) in the first
place, if he has to deal with filenames which are UTF-8 encoded?
 

Alexander
--
↯    Lifestream (Twitter, Blog, …) ↣ http://alexs77.soup.io/     ↯
↯ Chat (Jabber/Google Talk) ↣ a.s...@gmail.com , AIM: alexws77  ↯

Vincent Lefevre

unread,
Aug 11, 2010, 10:26:32 AM8/11/10
to us...@subversion.apache.org
On 2010-08-11 13:42:35 +0200, Stefan Sperling wrote:
> On Wed, Aug 11, 2010 at 12:31:48AM +0200, Vincent Lefevre wrote:
> > On 2010-08-10 20:59:00 +0200, Stefan Sperling wrote:
> > > Right now, if the filename cannot be represented in the current locale,
> > > you get this error: "svn: Can't convert string from 'UTF-8' to native encoding"
> >
> > which is bad and prevents users from writing POSIX-conforming scripts
> > using svn, i.e. under the POSIX locale (except on systems where the
> > POSIX locale uses UTF-8, but I don't know any).
>
> There's no reason your script could not configure a UTF-8 locale if that
> is needed to represent filenames which exist in the repository.

Configuring a UTF-8 locale can yield non-portable behavior.
There's a good reason why various scripts do a "LC_ALL=C".

Moreover there's no portable way to select a UTF-8 locale.

And the POSIX API doesn't need a UTF-8 locale to handle filenames
with top-bit-set bytes.

> We agree on the point that Subversion should use a single character
> set for all filenames in the same working copy.
> Because how should Subversion behave if some filenames convert fine to
> the current character set, and some do not? E.g. what if my encoding
> configuration setting is en_US.ISO8859-1? Should Subversion use ISO8859-1
> for some filenames, and UTF-8 for those which cannot be represented in
> ISO8859-1? That gets really confusing.
>
> It seems that this conversation leads to the question of why Subversion
> even bothers with checking the locale at all. It might as well always
> create filenames in UTF-8, and leave the user with apparently mangled
> filenames if they don't use a UTF-8 locale.
>
> But that isn't a solution either, because now you have lots of
> non-UTF-8 users complaining that Subversion cannot represent their
> filenames properly, where previously it worked fine.

That's why I suggested the encoding to be configurable.

> > It's not pointless, or at least, something else needs to be done.
> > Currently "svn up" fails to work, and that's a problem.
>
> It doesn't fail if locales are used consistently.

It fails even if locales are used consistently.

> I don't think this problem is specific to Subversion.

I haven't seen such problems with other tools.

> Other tools also suffer from the fact that POSIX doesn't specify a
> standard for defining filename encodings. Maybe we can find a good
> solution by looking around at how other tools handle this.

Most tools just ignore the encoding of filenames.

> However, I'd expect many will just assume that the user wants filenames
> to be encoded according to the current locale.
> If everybody follows this convention, there is no problem, apart from
> user errors during locale configuration.

You're asking the user, and even all users on the system where
the files are shared, to stick with a single locale. This is not
acceptable, this is contrary to POSIX requirements, and is also
a problem for SSH (where the user needs to use the same charset
on both sides). Under these conditions, the only possibility is
to encode the filenames in UTF-8 anyway. So, why not enforcing
that?

Vincent Lefevre

unread,
Aug 11, 2010, 10:29:56 AM8/11/10
to us...@subversion.apache.org
On 2010-08-11 13:51:18 +0200, Stefan Sperling wrote:
> On Wed, Aug 11, 2010 at 12:35:59PM +0200, Vincent Lefevre wrote:
> > On 2010-08-11 11:11:25 +0200, Paul Ebermann wrote:
> > > The thing is, users are using other tools than SVN to work with the
> > > files, too.
> > >
> > > So if I look at my directory with a file manager, I want my
> > > filenames to be readable (and renameable). The idea is that usually
> > > the user uses for one working copy always the same locale for all
> > > tools, so all filenames look same.
> >
> > Yes, and this is another reason why the solution chosen by Subversion
> > doesn't work well. For instance, GNOME always uses UTF-8 for filename
> > encoding.
>
> You might as well argue that Subversion's solution works well but
> GNOME's solution does not.

That's wrong. GNOME let's me to use any locale in shell sessions.
Subversion doesn't.

> > So, if the user uses ISO-8859-* locales (for shell sessions),
> > one gets inconsistencies.
>
> So don't use GNOME if you don't want your filenames encoded in UTF-8,

I meant: one gets inconsistencies between GNOME and Subversion.

> or don't use a non-UTF-8 locale when working with files you want to
> use from GNOME. Problem solved.

You're forcing the user to use a UTF-8 locale. Unacceptable.

Vincent Lefevre

unread,
Aug 11, 2010, 10:34:38 AM8/11/10
to us...@subversion.apache.org
On 2010-08-11 16:20:38 +0200, Alexander Skwar wrote:
> 2010/8/11 Vincent Lefevre <vince...@vinc17.net>
> > Yes, and this is another reason why the solution chosen by Subversion
> > doesn't work well. For instance, GNOME always uses UTF-8 for filename
> > encoding. So, if the user uses ISO-8859-* locales (for shell sessions),
> > one gets inconsistencies.
> >
> Just curious - why should a user do that (on purpose) in the first
> place, if he has to deal with filenames which are UTF-8 encoded?

The user may need to do that (terminal limitations, remote
connection...), while he (the user) does *not* deal with
filenames with non-ASCII characters (such as in "svn up").

Michael Pruemm

unread,
Aug 11, 2010, 10:49:46 AM8/11/10
to us...@subversion.apache.org
Vincent Lefevre wrote:
>> However, I'd expect many will just assume that the user wants filenames
>> to be encoded according to the current locale.
>> If everybody follows this convention, there is no problem, apart from
>> user errors during locale configuration.
>
> You're asking the user, and even all users on the system where
> the files are shared, to stick with a single locale. This is not
> acceptable, this is contrary to POSIX requirements, and is also
> a problem for SSH (where the user needs to use the same charset
> on both sides). Under these conditions, the only possibility is
> to encode the filenames in UTF-8 anyway. So, why not enforcing
> that?
>

But don't forget that different platforms may use different UTF-8
encodings for the same filename. Mac OS X encodes accented characters in
filenames in a different way than Linux.

- Michael

Vincent Lefevre

unread,
Aug 11, 2010, 11:05:32 AM8/11/10
to us...@subversion.apache.org
On 2010-08-11 16:49:46 +0200, Michael Pruemm wrote:
> But don't forget that different platforms may use different UTF-8
> encodings for the same filename. Mac OS X encodes accented
> characters in filenames in a different way than Linux.

Yes, but that's another problem, for which I think that the only
solution would be that Subversion use filenames only with ASCII
characters (e.g. with transliteration). This has drawbacks, though.

Note also that Linux allows NFD, which means that if such a file
is created, trying to reuse the filename with a copy-paste fails!
Anyway this is not a Subversion-related problem (except if
Subversion doesn't convert NFD to NFC for files committed from
Mac OS X).

Vincent Lefevre

unread,
Aug 11, 2010, 11:23:31 AM8/11/10
to us...@subversion.apache.org
On 2010-08-11 16:26:32 +0200, Vincent Lefevre wrote:
> On 2010-08-11 13:42:35 +0200, Stefan Sperling wrote:
> > On Wed, Aug 11, 2010 at 12:31:48AM +0200, Vincent Lefevre wrote:
> > > On 2010-08-10 20:59:00 +0200, Stefan Sperling wrote:
> > > > Right now, if the filename cannot be represented in the current locale,
> > > > you get this error: "svn: Can't convert string from 'UTF-8' to native encoding"
> > >
> > > which is bad and prevents users from writing POSIX-conforming scripts
> > > using svn, i.e. under the POSIX locale (except on systems where the
> > > POSIX locale uses UTF-8, but I don't know any).
> >
> > There's no reason your script could not configure a UTF-8 locale if that
> > is needed to represent filenames which exist in the repository.
>
> Configuring a UTF-8 locale can yield non-portable behavior.
> There's a good reason why various scripts do a "LC_ALL=C".
>
> Moreover there's no portable way to select a UTF-8 locale.

Actually, some UTF-8 locales may not give the expected behavior,
even with svn. For instance, one may want to parse the output of
"svn info", and using fr_FR.UTF-8 (if installed) gives localized
output (in French), which would make the parsing fail. And I think
that "Do not use svn for scripts, reimplement everything using
bindings." would not be a wise answer.

Paul Ebermann

unread,
Aug 11, 2010, 11:34:19 AM8/11/10
to us...@subversion.apache.org
Vincent Lefevre wrote:
> On 2010-08-11 13:51:18 +0200, Stefan Sperling wrote:
>> On Wed, Aug 11, 2010 at 12:35:59PM +0200, Vincent Lefevre wrote:

>>> Yes, and this is another reason why the solution chosen by Subversion
>>> doesn't work well. For instance, GNOME always uses UTF-8 for filename
>>> encoding.
>> You might as well argue that Subversion's solution works well but
>> GNOME's solution does not.
>
> That's wrong. GNOME let's me to use any locale in shell sessions.
> Subversion doesn't.

Yes, but GNOME does not allow using any locale in a file manager session (or, it ignores
the locale in the filemanager session, while the command line tools do not).

(KDE is similar, for my experiments.)

>>> So, if the user uses ISO-8859-* locales (for shell sessions),
>>> one gets inconsistencies.
>> So don't use GNOME if you don't want your filenames encoded in UTF-8,
>
> I meant: one gets inconsistencies between GNOME and Subversion.
>
>> or don't use a non-UTF-8 locale when working with files you want to
>> use from GNOME. Problem solved.
>
> You're forcing the user to use a UTF-8 locale. Unacceptable.

No, GNOME forces the user to use a UTF-8 locale, if ls, rm, cd and other command line
tools are to show/accept the same name as the GNOME file manager.

Subversion here simply behaves as any other Unix command line tool, it seems, with the
additional gotcha that it does not only use input from and output to terminal and file
system, but also from/to the repository.


Paul

Stefan Sperling

unread,
Aug 11, 2010, 1:55:01 PM8/11/10
to us...@subversion.apache.org
On Wed, Aug 11, 2010 at 05:23:31PM +0200, Vincent Lefevre wrote:
> On 2010-08-11 16:26:32 +0200, Vincent Lefevre wrote:
> > Configuring a UTF-8 locale can yield non-portable behavior.

Such as?

> > There's a good reason why various scripts do a "LC_ALL=C".

Then those scripts are written for projects which use ASCII filenames.

> > Moreover there's no portable way to select a UTF-8 locale.

Then your script will have to deal with the intricacies of supporting
several platforms when selecting the UTF-8 locale. For most platforms,
"export LC_CTYPE=en_US.UTF-8" should work fine.

> Actually, some UTF-8 locales may not give the expected behavior,
> even with svn. For instance, one may want to parse the output of
> "svn info", and using fr_FR.UTF-8 (if installed) gives localized
> output (in French), which would make the parsing fail.

Well, you can use en_US.UTF-8 to force the output to English.

> And I think
> that "Do not use svn for scripts, reimplement everything using
> bindings." would not be a wise answer.

Scripts written against bindings won't break when we change the
command line output (which happens every once in a while).

Stefan

Stefan Sperling

unread,
Aug 11, 2010, 1:56:28 PM8/11/10
to us...@subversion.apache.org
On Wed, Aug 11, 2010 at 04:29:56PM +0200, Vincent Lefevre wrote:
> You're forcing the user to use a UTF-8 locale. Unacceptable.

No, we leave users a choice.
I consider your idea of forcing UTF-8 filenames on everybody unacceptable.

Stefan

Vincent Lefevre

unread,
Aug 11, 2010, 8:19:03 PM8/11/10
to us...@subversion.apache.org
On 2010-08-11 19:55:01 +0200, Stefan Sperling wrote:
> On Wed, Aug 11, 2010 at 05:23:31PM +0200, Vincent Lefevre wrote:
> > On 2010-08-11 16:26:32 +0200, Vincent Lefevre wrote:
> > > Configuring a UTF-8 locale can yield non-portable behavior.
>
> Such as?

Outputting messages in a different language. Or any other
non-portable behavior. Who knows...

> > > There's a good reason why various scripts do a "LC_ALL=C".
>
> Then those scripts are written for projects which use ASCII filenames.

That's unspecified.

> > > Moreover there's no portable way to select a UTF-8 locale.
>
> Then your script will have to deal with the intricacies of supporting
> several platforms when selecting the UTF-8 locale. For most platforms,
> "export LC_CTYPE=en_US.UTF-8" should work fine.

Not all. Thus this is bad.

> Well, you can use en_US.UTF-8 to force the output to English.

Wrong. There's no guarantee to work. And too often it doesn't work.

> > And I think that "Do not use svn for scripts, reimplement
> > everything using bindings." would not be a wise answer.
>
> Scripts written against bindings won't break when we change the
> command line output (which happens every once in a while).

It is certainly a better solution, but requires too much work
in practice.

Vincent Lefevre

unread,
Aug 11, 2010, 8:26:28 PM8/11/10
to us...@subversion.apache.org
On 2010-08-11 17:34:19 +0200, Paul Ebermann wrote:

> Vincent Lefevre wrote:
> > That's wrong. GNOME let's me to use any locale in shell sessions.
> > Subversion doesn't.
>
> Yes, but GNOME does not allow using any locale in a file manager
> session (or, it ignores the locale in the filemanager session, while
> the command line tools do not).

The main point is that this is transparent to the user.

> > You're forcing the user to use a UTF-8 locale. Unacceptable.
>
> No, GNOME forces the user to use a UTF-8 locale, if ls, rm, cd and
> other command line tools are to show/accept the same name as the
> GNOME file manager.

That's just a minor display problem (not even always a problem).
But ls, rm and cd work fine under non-UTF-8 locales on any file.

> Subversion here simply behaves as any other Unix command line tool,

No, if the user uses a different locale, commands like "svn up" can
fail. There's no such problem with other Unix tools.

Vincent Lefevre

unread,
Aug 11, 2010, 8:30:50 PM8/11/10
to us...@subversion.apache.org
On 2010-08-11 19:56:28 +0200, Stefan Sperling wrote:
> On Wed, Aug 11, 2010 at 04:29:56PM +0200, Vincent Lefevre wrote:
> > You're forcing the user to use a UTF-8 locale. Unacceptable.
>
> No, we leave users a choice.

The choice doesn't work.

> I consider your idea of forcing UTF-8 filenames on everybody unacceptable.

No, this is not *my* idea. Please read the thread again. I proposed to
leave the choice by storing the charset chosen by the user in the .svn
directory (or whatever mean following future WC formats). But you
didn't want this solution.

Csaba Raduly

unread,
Aug 12, 2010, 3:59:30 AM8/12/10
to Michael Pruemm, us...@subversion.apache.org
On Wed, Aug 11, 2010 at 4:49 PM, Michael Pruemm wrote:
> Vincent Lefevre wrote:
(snip)

>> Under these conditions, the only possibility is
>> to encode the filenames in UTF-8 anyway. So, why not enforcing
>> that?
>>
>
> But don't forget that different platforms may use different UTF-8 encodings
> for the same filename.

Huh? There's only one UTF-8 encoding for each Unicode code point. Are
you thinking of code pages?


--
Life is complex, with real and imaginary parts.
"Ok, it boots. Which means it must be bug-free and perfect. " -- Linus Torvalds
"People disagree with me. I just ignore them." -- Linus Torvalds

Olivier Sannier

unread,
Aug 12, 2010, 4:12:37 AM8/12/10
to Csaba Raduly, Michael Pruemm, us...@subversion.apache.org
Csaba Raduly wrote:
> On Wed, Aug 11, 2010 at 4:49 PM, Michael Pruemm wrote:
>
>> Vincent Lefevre wrote:
>>
> (snip)
>
>>> Under these conditions, the only possibility is
>>> to encode the filenames in UTF-8 anyway. So, why not enforcing
>>> that?
>>>
>>>
>> But don't forget that different platforms may use different UTF-8 encodings
>> for the same filename.
>>
> Huh? There's only one UTF-8 encoding for each Unicode code point. Are
> you thinking of code pages?
>
I believe he is thinking of the way an accented character can be
encoded: either its direct codepoint or diacritic + letter

This makes more bytes, but in the end, it's all valid UTF-8

Vincent Lefevre

unread,
Aug 12, 2010, 4:14:58 AM8/12/10
to us...@subversion.apache.org
On 2010-08-12 09:59:30 +0200, Csaba Raduly wrote:
> On Wed, Aug 11, 2010 at 4:49 PM, Michael Pruemm wrote:
> > Vincent Lefevre wrote:
> (snip)
> >> Under these conditions, the only possibility is
> >> to encode the filenames in UTF-8 anyway. So, why not enforcing
> >> that?
> >>
> >
> > But don't forget that different platforms may use different UTF-8 encodings
> > for the same filename.
>
> Huh? There's only one UTF-8 encoding for each Unicode code point. Are
> you thinking of code pages?

Michael means that there are several ways to represent a "same"
string (from a semantic point of view). There are two normalized
representations: NFC and NFD. While Linux does not try to normalize
filenames (they are just viewed as a sequence of bytes[*]), Mac OS X
(at least with HFS+) requires that the filenames are valid UTF-8
strings (even in non-UTF-8 locales) and normalize them to NFD for
storing them on disk.

[*] The locale doesn't matter, and top-bit-set bytes are allowed and
can be handled even in ASCII-based locales.

Michael Pruemm

unread,
Aug 12, 2010, 4:16:25 AM8/12/10
to Olivier Sannier, Csaba Raduly, us...@subversion.apache.org

Yes, exactly. The encoding differs, but the filename is the same.

- Michael

Stefan Sperling

unread,
Aug 12, 2010, 11:16:37 AM8/12/10
to us...@subversion.apache.org
On Thu, Aug 12, 2010 at 02:30:50AM +0200, Vincent Lefevre wrote:
> On 2010-08-11 19:56:28 +0200, Stefan Sperling wrote:
> > On Wed, Aug 11, 2010 at 04:29:56PM +0200, Vincent Lefevre wrote:
> > > You're forcing the user to use a UTF-8 locale. Unacceptable.
> >
> > No, we leave users a choice.
>
> The choice doesn't work.

It doesn't work for your twisted requirement of having a program
running in the C locale use UTF-8. Otherwise it works fine.



> > I consider your idea of forcing UTF-8 filenames on everybody unacceptable.
>
> No, this is not *my* idea. Please read the thread again. I proposed to
> leave the choice by storing the charset chosen by the user in the .svn
> directory (or whatever mean following future WC formats). But you
> didn't want this solution.

It's not a solution because storing the information in wc meta-data
does not fix anything. What is Subversion supposed to do with filenames
that come down from the repository during an update, but cannot be
represented in the recorded charset? Using UTF-8 instead of the recorded
charset is out of the question. That leaves erroring out as the only
option. So how is that different from the current situation?

There is a clearly defined way of letting Subversion know which
charset to use. Set the locale. If you have to, set Subversion's locale
to something other than what locale your terminal uses.

~/bin/mysvn:
#!/bin/sh
evn LC_CTYPE="en_US.<preferred charset>" svn update

Stefan

Vincent Lefevre

unread,
Aug 12, 2010, 6:27:30 PM8/12/10
to us...@subversion.apache.org
On 2010-08-12 17:16:37 +0200, Stefan Sperling wrote:
> On Thu, Aug 12, 2010 at 02:30:50AM +0200, Vincent Lefevre wrote:
> > On 2010-08-11 19:56:28 +0200, Stefan Sperling wrote:
> > > On Wed, Aug 11, 2010 at 04:29:56PM +0200, Vincent Lefevre wrote:
> > > > You're forcing the user to use a UTF-8 locale. Unacceptable.
> > >
> > > No, we leave users a choice.
> >
> > The choice doesn't work.
>
> It doesn't work for your twisted requirement of having a program
> running in the C locale use UTF-8. Otherwise it works fine.

Many programs are able to use UTF-8 in a C locale, such as programs
that handle XML or any file format based on UTF-8.

Also, if Subversion needs a UTF-8 locale for handle UTF-8 data, why
wouldn't there be an option to do that?

> > No, this is not *my* idea. Please read the thread again. I proposed to
> > leave the choice by storing the charset chosen by the user in the .svn
> > directory (or whatever mean following future WC formats). But you
> > didn't want this solution.
>
> It's not a solution because storing the information in wc meta-data
> does not fix anything. What is Subversion supposed to do with filenames
> that come down from the repository during an update, but cannot be
> represented in the recorded charset?

Throw an error. But if the user has chosen UTF-8 as the recorded
charset, this won't happen.

> Using UTF-8 instead of the recorded charset is out of the question.
> That leaves erroring out as the only option. So how is that
> different from the current situation?

With the current situation, a "svn up" run under the non-UTF-8 locale
will either fail or corrupt the working copy if some filenames have
non-ASCII characters. With UTF-8 recorded as the encoding for
filenames, "svn up" will always work. See the difference?

> There is a clearly defined way of letting Subversion know which
> charset to use. Set the locale. If you have to, set Subversion's
> locale to something other than what locale your terminal uses.
>
> ~/bin/mysvn:
> #!/bin/sh
> evn LC_CTYPE="en_US.<preferred charset>" svn update

Wrong, wrong, wrong! Security hole!

Alexander Skwar

unread,
Aug 13, 2010, 2:16:48 AM8/13/10
to us...@subversion.apache.org
Hi.

2010/8/13 Vincent Lefevre <vince...@vinc17.net>


>
> On 2010-08-12 17:16:37 +0200, Stefan Sperling wrote:

> > ~/bin/mysvn:
> >  #!/bin/sh
> >  env LC_CTYPE="en_US.<preferred charset>" svn update


>
> Wrong, wrong, wrong! Security hole!

No, not wrong, but totally correct - especially, if you need to parse the
output in some script, you (kinda) MUST set the locale to some value,
that you know. Leaving it in a undefined state is no good at all.

Vincent Lefevre

unread,
Aug 13, 2010, 3:37:57 AM8/13/10
to us...@subversion.apache.org
On 2010-08-13 08:16:48 +0200, Alexander Skwar wrote:
> 2010/8/13 Vincent Lefevre <vince...@vinc17.net>
> >
> > On 2010-08-12 17:16:37 +0200, Stefan Sperling wrote:
>
> > > ~/bin/mysvn:
> > > �#!/bin/sh
> > > �env LC_CTYPE="en_US.<preferred charset>" svn update
> >
> > Wrong, wrong, wrong! Security hole!
>
> No, not wrong, but totally correct - especially, if you need to parse the
> output in some script, you (kinda) MUST set the locale to some value,
> that you know. Leaving it in a undefined state is no good at all.

No it is wrong because the above script may send non-printable
characters to the terminal, such as control sequences. Such
control sequences can wreck the terminal and depending on its
configuration, send the contents to a printer.

Alexander Skwar

unread,
Aug 13, 2010, 3:47:37 AM8/13/10
to us...@subversion.apache.org
Hi.

2010/8/13 Vincent Lefevre <vince...@vinc17.net>:
> On 2010-08-13 08:16:48 +0200, Alexander Skwar wrote:
>> 2010/8/13 Vincent Lefevre <vince...@vinc17.net>
>> >
>> > On 2010-08-12 17:16:37 +0200, Stefan Sperling wrote:
>>
>> > > ~/bin/mysvn:
>> > >  #!/bin/sh
>> > >  env LC_CTYPE="en_US.<preferred charset>" svn update
>> >
>> > Wrong, wrong, wrong! Security hole!
>>
>> No, not wrong, but totally correct - especially, if you need to parse the
>> output in some script, you (kinda) MUST set the locale to some value,
>> that you know. Leaving it in a undefined state is no good at all.
>
> No it is wrong because the above script may send non-printable
> characters to the terminal, such as control sequences. Such
> control sequences can wreck the terminal and depending on its
> configuration, send the contents to a printer.

Well, if you want or need to parse the output of a program,
you'll need to make sure that it's in the "correct" locale. The
way to do that, is by setting the locale variables to the expected
values. Thus, it's totally correct to set LC_CTYPE to some
predefined value. Omiting this is just plain wrong. You're suggesting
that this should be omitted.

Stefan Sperling

unread,
Aug 13, 2010, 5:18:00 AM8/13/10
to us...@subversion.apache.org
On Fri, Aug 13, 2010 at 09:37:57AM +0200, Vincent Lefevre wrote:
> On 2010-08-13 08:16:48 +0200, Alexander Skwar wrote:
> > 2010/8/13 Vincent Lefevre <vince...@vinc17.net>
> > >
> > > On 2010-08-12 17:16:37 +0200, Stefan Sperling wrote:
> >
> > > > ~/bin/mysvn:
> > > > �#!/bin/sh
> > > > �env LC_CTYPE="en_US.<preferred charset>" svn update
> > >
> > > Wrong, wrong, wrong! Security hole!

... in your terminal.

> > No, not wrong, but totally correct - especially, if you need to parse the
> > output in some script, you (kinda) MUST set the locale to some value,
> > that you know. Leaving it in a undefined state is no good at all.
>
> No it is wrong because the above script may send non-printable
> characters to the terminal, such as control sequences. Such
> control sequences can wreck the terminal and depending on its
> configuration, send the contents to a printer.

Use a terminal that does not have security holes.
This has nothing to do with Subversion whatsoever.

Stefan Sperling

unread,
Aug 13, 2010, 5:39:43 AM8/13/10
to us...@subversion.apache.org
On Fri, Aug 13, 2010 at 12:27:30AM +0200, Vincent Lefevre wrote:
> On 2010-08-12 17:16:37 +0200, Stefan Sperling wrote:
> > Using UTF-8 instead of the recorded charset is out of the question.
> > That leaves erroring out as the only option. So how is that
> > different from the current situation?
>
> With the current situation, a "svn up" run under the non-UTF-8 locale
> will either fail or corrupt the working copy if some filenames have
> non-ASCII characters. With UTF-8 recorded as the encoding for
> filenames, "svn up" will always work. See the difference?

Yes, I see the difference. It's a question of where the primary
configuration knob for the charset is located.

Right now, the source of charset information is always the locale.

You want it to be the locale at checkout time and some pre-recorded
value at update time (not sure what your idea is about all the other
subcommands).

But I don't think having two sources for this information is a good idea.
So I think Subversion should continue trusting users with setting their locale.
It's simple. It works.

But that's *my* opinion. We do lazy-consensus based development.
So talk to other people in the community if you like, and if you can convince
a majority of them, we can do it your way.

Stefan

Vincent Lefevre

unread,
Aug 13, 2010, 7:23:29 AM8/13/10
to us...@subversion.apache.org
On 2010-08-13 09:47:37 +0200, Alexander Skwar wrote:
> Well, if you want or need to parse the output of a program,
> you'll need to make sure that it's in the "correct" locale. The
> way to do that, is by setting the locale variables to the expected
> values. Thus, it's totally correct to set LC_CTYPE to some
> predefined value. Omiting this is just plain wrong. You're suggesting
> that this should be omitted.

No, I'm not suggesting that it should be omitted. The environment
variables should be set *only* if the output is captured. So, it must
be done only in the script that will call svn, not by definining some
general svn wrapper.

Moreover, for portabability reasons, only C or POSIX is guaranteed
to work. In particular, "en_US.<preferred charset>" (what has been
suggested) is incorrect on Maemo 4, where only en_US exists. But
svn will not use UTF-8 with the C/POSIX locale.

So, really, svn should provide an option that will select a UTF-8
locale.

Vincent Lefevre

unread,
Aug 13, 2010, 7:37:51 AM8/13/10
to us...@subversion.apache.org
On 2010-08-13 11:18:00 +0200, Stefan Sperling wrote:
> On Fri, Aug 13, 2010 at 09:37:57AM +0200, Vincent Lefevre wrote:
> > On 2010-08-13 08:16:48 +0200, Alexander Skwar wrote:
> > > 2010/8/13 Vincent Lefevre <vince...@vinc17.net>
> > > >
> > > > On 2010-08-12 17:16:37 +0200, Stefan Sperling wrote:
> > >
> > > > > ~/bin/mysvn:
> > > > > �#!/bin/sh
> > > > > �env LC_CTYPE="en_US.<preferred charset>" svn update
> > > >
> > > > Wrong, wrong, wrong! Security hole!
>
> ... in your terminal.

Stop saying nonsense. What you proposed is wrong. The locale must
match the settings of the terminal. Period.

> > No it is wrong because the above script may send non-printable
> > characters to the terminal, such as control sequences. Such
> > control sequences can wreck the terminal and depending on its
> > configuration, send the contents to a printer.
>
> Use a terminal that does not have security holes.

In general, one doesn't know that there is a security hole until is
it discovered. In the case of xterm, this has eventually been fixed
(after I reported the bug). But filtering non-printable characters
before sending them to the terminal should be done by the application.

Vincent Lefevre

unread,
Aug 13, 2010, 7:45:56 AM8/13/10
to us...@subversion.apache.org
On 2010-08-13 11:39:43 +0200, Stefan Sperling wrote:
> Yes, I see the difference. It's a question of where the primary
> configuration knob for the charset is located.
>
> Right now, the source of charset information is always the locale.
>
> You want it to be the locale at checkout time and some pre-recorded
> value at update time (not sure what your idea is about all the other
> subcommands).
>
> But I don't think having two sources for this information is a good idea.

In any standard Unix system, the locale is process-dependent.
So, anyway, if you consider the locale, the source can change
from one call to another. So, considering the locale only is
definitely not a good idea.

> So I think Subversion should continue trusting users with setting
> their locale. It's simple. It works.

No, it doesn't work. If the locale changes (what is allowed on
any standard Unix system), a "svn up" breaks the working copy.

Reply all
Reply to author
Forward
0 new messages