svnadmin: Path '....' is not in UTF-8

Torsten Krah

unread,

May 30, 2011, 4:27:10 PM5/30/11

to us...@subversion.apache.org

I want to load a repository with a fresh dump and did:

svnadmin -q dump /repo1 | svnadmin load /repo2

This is the error i get:

svnadmin: Path 'Projektprofile/EMS(Newsletter,
Infomails, ?\192?\166).doc' is not in UTF-8

How to fix this error - i am unable to load the dump in a new
repository?
What is causing this and are there any known workarounds?

svn version: svn, version 1.6.12 (r955767)

source repository is based on fsfs from 1.5 and destination is fsfs
repository created with 1.6 from above.

regards

Torsten

Torsten Krah

unread,

May 30, 2011, 4:51:39 PM5/30/11

to us...@subversion.apache.org

Some more infos about those problem:

svnadmin verify tells me the revision in question is ok in the source
repo.
Using vim to view the revision dump show those 2 utf-8 chars at the end
of the path which i guess are making trouble:

Projektprofile/EMS(Newsletter, Infomails, À¦).doc

Maybe someone got some nice ideas ;)

Daniel Shahaf

unread,

May 30, 2011, 5:30:54 PM5/30/11

to Torsten Krah, us...@subversion.apache.org

1.6 checks that paths are in UTF-8 at the time they enter the
repository. This was always required but not always enforced.

Solution is to recode the pathnames (those that are neither in ASCII nor
in UTF-8). If none of the third-party dump manipulation tools can do
that, then you could patch svnsync or one of those tools to do the
recoding. (just inject a filename-recoding editor at the right place)

Torsten Krah wrote on Mon, May 30, 2011 at 22:51:39 +0200:
> Some more infos about those problem:
>
> svnadmin verify tells me the revision in question is ok in the source
> repo.
> Using vim to view the revision dump show those 2 utf-8 chars at the end

It doesn't show "two UTF-8 characters", since the filename contains two
bytes which do not form a valid UTF-8 sequence.

> of the path which i guess are making trouble:
>

> Projektprofile/EMS(Newsletter, Infomails, ��).doc

Torsten Krah

unread,

May 30, 2011, 5:47:30 PM5/30/11

to us...@subversion.apache.org, Daniel Shahaf

Am Dienstag, den 31.05.2011, 00:30 +0300 schrieb Daniel Shahaf:
> 1.6 checks that paths are in UTF-8 at the time they enter the
> repository. This was always required but not always enforced.

Ok - so 1.6 does things <1.6 did not but should.

>
> Solution is to recode the pathnames (those that are neither in ASCII
> nor
> in UTF-8).

Sorry but your "solution" seems really a little bit odd to me.
If <1.6 did not enforce this and 1.6 does - why does 1.6 not recode it
at the time it does encounter such "things" - at least via some optional
command line option?

Do you really want to tell me that subversion (the "tool" used to manage
my code) is not able to load its own "dump", at least by providing some
"fix" tool by itself if it did things not "right" before - why should i
need or bother with "third-party" tools here - this should be done by
svn, shouldn't it?

> If none of the third-party dump manipulation tools can do
> that,

Which "third-party" tools you have in mind are able to do that for me?

> then you could patch svnsync or one of those tools to do the
> recoding. (just inject a filename-recoding editor at the right place)

Of cause i'll take the source, patch it and get my repo working
again ... nice joke - it was a joke right?

>
> It doesn't show "two UTF-8 characters", since the filename contains
> two
> bytes which do not form a valid UTF-8 sequence.

You're right, my fault.

Stefan Sperling

unread,

May 30, 2011, 5:54:17 PM5/30/11

to Daniel Shahaf, Torsten Krah, us...@subversion.apache.org

On Tue, May 31, 2011 at 12:30:54AM +0300, Daniel Shahaf wrote:
> 1.6 checks that paths are in UTF-8 at the time they enter the
> repository. This was always required but not always enforced.
>
> Solution is to recode the pathnames (those that are neither in ASCII nor
> in UTF-8).

Yes, that's what needs to be done. Pathnames must be encoded UTF-8.
Unfortunately it seems that this invalid pathname somehow entered the
repository when a server version was used that didn't enforce UTF-8
encoding.

I would try to edit the dump file with a hexeditor and replace the
offending two bytes with two spaces (or the proper UTF-8 character
if you know what should be there and the UTF-8 sequence has the same
number of bytes). I hope the number of paths affected by this problem
is small enough to keep this solution practical.

> If none of the third-party dump manipulation tools can do that,

... then we should provide our users with a way of fixing it,
as we did for e.g. badly encoded revision properties.

> then you could patch svnsync or one of those tools to do the
> recoding. (just inject a filename-recoding editor at the right place)

Daniel, please keep in mind that this is the *users* list.
Maybe Torsten would like to try this, but I doubt that modifying
Subversion's code is the kind of advice he was looking for.
And I really don't think that this suggestion is something that people
who are not familiar with Subversion's code base should attempt to do.
If people modify the code without understand it well the chances of
unintentionally breaking things are way too high.

It's bad enough that Torsten has to edit the dump file to fix this.

Torsten Krah

unread,

May 30, 2011, 6:07:42 PM5/30/11

to us...@subversion.apache.org, Stefan Sperling

> I would try to edit the dump file with a hexeditor and replace the
> offending two bytes with two spaces (or the proper UTF-8 character
> if you know what should be there and the UTF-8 sequence has the same
> number of bytes).

Ok, lets take some hex editor and get rid of those bad sequences.

> I hope the number of paths affected by this problem
> is small enough to keep this solution practical.

I'll hope so too - lets see how to split my dump to get hexedit solution
running.

>
> > If none of the third-party dump manipulation tools can do that,
>
> ... then we should provide our users with a way of fixing it,
> as we did for e.g. badly encoded revision properties.

That would be a "nice-to-have" feature :).

> Daniel, please keep in mind that this is the *users* list.

Yes.

> Maybe Torsten would like to try this, but I doubt that modifying
> Subversion's code is the kind of advice he was looking for.

You're right ;-)

> And I really don't think that this suggestion is something that people
> who are not familiar with Subversion's code base should attempt to do.
> If people modify the code without understand it well the chances of
> unintentionally breaking things are way too high.
>

I can try but as you said - i am not familiar with the code base and
i'll bet, things are more worse after my modifications ;-).

> It's bad enough that Torsten has to edit the dump file to fix this.

But i will take this "red pill" to see where the journey ends :-D

Stefan Sperling

unread,

May 30, 2011, 6:08:46 PM5/30/11

to Torsten Krah, us...@subversion.apache.org, Daniel Shahaf

On Mon, May 30, 2011 at 11:47:30PM +0200, Torsten Krah wrote:
> If <1.6 did not enforce this and 1.6 does - why does 1.6 not recode it
> at the time it does encounter such "things" - at least via some optional
> command line option?

I think that is something we should add, yes.
We should also make svnadmin verify complain if paths are not in UTF-8.
That is two issues to file into our tracker right there.

Note that svnsync already has this kind of feature to handle badly
encoded revision properties.

> Do you really want to tell me that subversion (the "tool" used to manage
> my code) is not able to load its own "dump", at least by providing some
> "fix" tool by itself if it did things not "right" before - why should i
> need or bother with "third-party" tools here - this should be done by
> svn, shouldn't it?

Note that the API documentation for Subversion has always been saying
that paths are expected to be in UTF-8. It's just that the code didn't
enforce it. What probably happened here is that some third-party client
was used to add this file originally, and this third party client did
not convert the pathname to UTF-8 before sending it to the repository.
The standard svn client has been converting paths to UTF-8 since before 1.0.

Of course, that does not excuse the Subversion server's behaviour.
It should have verified the input and rejected the commit as invalid.
Alas, this verification step was only added in 1.6.

> > If none of the third-party dump manipulation tools can do
> > that,
>
> Which "third-party" tools you have in mind are able to do that for me?

A nice one is svndumptool: http://svn.borg.ch/svndumptool/
But it doesn't look like it has the feature you need.

I suppose the kind of corruption problem you are having is very rare.
If this was a common problem there would already be tool support for fixing it.

Daniel Shahaf

unread,

May 30, 2011, 6:09:31 PM5/30/11

to Torsten Krah, us...@subversion.apache.org

Torsten Krah wrote on Mon, May 30, 2011 at 23:47:30 +0200:
> Am Dienstag, den 31.05.2011, 00:30 +0300 schrieb Daniel Shahaf:
> > Solution is to recode the pathnames (those that are neither in ASCII
> > nor in UTF-8).
>
> Sorry but your "solution" seems really a little bit odd to me.
> If <1.6 did not enforce this and 1.6 does - why does 1.6 not recode it
> at the time it does encounter such "things" - at least via some optional
> command line option?
>
> Do you really want to tell me that subversion (the "tool" used to manage
> my code) is not able to load its own "dump", at least by providing some
> "fix" tool by itself if it did things not "right" before - why should i
> need or bother with "third-party" tools here - this should be done by
> svn, shouldn't it?
>

As Stefan said, it would be nice if Subversion itself could fix that,
given that old released versions produced such (malformed) filesystems.

To my knowledge, currently there is no code in Subversion itself to do
this, hence my suggestion to use third-party tools.

> > If none of the third-party dump manipulation tools can do
> > that,
>
> Which "third-party" tools you have in mind are able to do that for me?
>

I know there are a couple of dumpfile manipulator tools that are
regularly suggested around this list, but I don't have a specific
recomendation.

One of the other list members might be able to answer this question.

Daniel Shahaf

unread,

May 30, 2011, 6:26:16 PM5/30/11

to us...@subversion.apache.org, Torsten Krah

Torsten Krah wrote on Mon, May 30, 2011 at 23:47:30 +0200:

> Am Dienstag, den 31.05.2011, 00:30 +0300 schrieb Daniel Shahaf:
> > then you could patch svnsync or one of those tools to do the
> > recoding. (just inject a filename-recoding editor at the right place)
>
> Of cause i'll take the source, patch it and get my repo working
> again ... nice joke - it was a joke right?

That's how I'd solve the problem.

But then, I'm not a tech support person but a Subversion committer who
is already familiar with FSFS and dumpstream format.

Stefan Sperling wrote on Mon, May 30, 2011 at 23:54:17 +0200:
> On Tue, May 31, 2011 at 12:30:54AM +0300, Daniel Shahaf wrote:

> > then you could patch svnsync or one of those tools to do the
> > recoding. (just inject a filename-recoding editor at the right place)
>

> Daniel, please keep in mind that this is the *users* list.
> Maybe Torsten would like to try this, but I doubt that modifying
> Subversion's code is the kind of advice he was looking for.
> And I really don't think that this suggestion is something that people
> who are not familiar with Subversion's code base should attempt to do.
> If people modify the code without understand it well the chances of
> unintentionally breaking things are way too high.
>

I'm usually very right-winged on telling people "Don't edit anything
under $REPOS/db/ unless you can score A+ in an oral test on 'structure'
at 3am."

To the case at hand:

* There is probably a tool that allows performing the needed conversion.

* If there isn't, I think writing a "recode fspaths" filter to our API's
isn't terribly hard (perhaps with some pointers on what API's to start
at). What do you refer to by "breaking things"?

> It's bad enough that Torsten has to edit the dump file to fix this.

In the general case, I expect there to be Out There repositories that
contain fspath's in multiple encodings: say, UTF-8 and latin1 (and
possibly latin15 too) in the same filesystem. That's going to be a mess
to fix no matter what tools you use.

Daniel Shahaf

unread,

May 30, 2011, 6:36:10 PM5/30/11

to Torsten Krah, us...@subversion.apache.org, Stefan Sperling

> > Maybe Torsten would like to try this, but I doubt that modifying
> > Subversion's code is the kind of advice he was looking for.
>
> You're right ;-)

I was assuming that someone would point out a tool that does the
recoding at some point in the next 24 hours, which would render this
particular suggestion moot.

Daniel Shahaf

unread,

May 30, 2011, 6:41:54 PM5/30/11

to Torsten Krah, us...@subversion.apache.org

Stefan Sperling wrote on Tue, May 31, 2011 at 00:08:46 +0200:
> On Mon, May 30, 2011 at 11:47:30PM +0200, Torsten Krah wrote:
> > If <1.6 did not enforce this and 1.6 does - why does 1.6 not recode it
> > at the time it does encounter such "things" - at least via some optional
> > command line option?
>
> I think that is something we should add, yes.

How would you handle a repository that contains the following
nodes/fspaths:

/foo/bår (in UTF-8)
/foo/bår (in latin1)

?

How would you handle a repository that contains:
/foo/barÉ (in latin1)
/foo/barŠ (in latin2)

?

> We should also make svnadmin verify complain if paths are not in UTF-8.

+1.

The validation that 'load' and 'commit' trigger is path_valid() in
fs_loader.c.

Stefan Sperling

unread,

May 30, 2011, 7:07:02 PM5/30/11

to Daniel Shahaf, Torsten Krah, us...@subversion.apache.org

On Tue, May 31, 2011 at 01:41:54AM +0300, Daniel Shahaf wrote:
> How would you handle a repository that contains the following
> nodes/fspaths:
>
> /foo/bår (in UTF-8)
> /foo/bår (in latin1)
>
> ?
>
>
> How would you handle a repository that contains:
> /foo/barÉ (in latin1)
> /foo/barŠ (in latin2)
>
> ?

All the ISO-8859 (latin) encodings are single-byte encodings.
It's not possible to know what the encoding is supposed to be if
paths in different ISO-8859 encodings entered the repository.
They all decode to different but valid strings of characters.

In the first iteration of this feature I would simply assume one
user-specified source encoding and try to convert data that isn't
UTF-8 from the source encoding to UTF-8.
In case multiple single-byte encodings are present this means that some
characters will be wrong but the repository will work again without
manual intervention. In case multiple multi-byte encodings other than
UTF-8 are present this approach can fail and might require manual fixing
(no worse than the current situation).
This could still be improved upon if necessary.

> > We should also make svnadmin verify complain if paths are not in UTF-8.
>
> +1.
>
> The validation that 'load' and 'commit' trigger is path_valid() in
> fs_loader.c.

Thanks for the hint. I'm now running tests on a patch for this.

Daniel Shahaf

unread,

May 30, 2011, 7:16:45 PM5/30/11

to Torsten Krah, us...@subversion.apache.org

#define MBE multi-byte encoding
#defien SBE single-byte encoding

Stefan Sperling wrote on Tue, May 31, 2011 at 01:07:02 +0200:
> On Tue, May 31, 2011 at 01:41:54AM +0300, Daniel Shahaf wrote:
> > How would you handle a repository that contains the following
> > nodes/fspaths:
> >
> > /foo/bår (in UTF-8)
> > /foo/bår (in latin1)
> >
> > ?
> >
> >
> > How would you handle a repository that contains:
> > /foo/barÉ (in latin1)
> > /foo/barŠ (in latin2)
> >
> > ?
>
> All the ISO-8859 (latin) encodings are single-byte encodings.
> It's not possible to know what the encoding is supposed to be if
> paths in different ISO-8859 encodings entered the repository.
> They all decode to different but valid strings of characters.
>
> In the first iteration of this feature I would simply assume one
> user-specified source encoding and try to convert data that isn't
> UTF-8 from the source encoding to UTF-8.
> In case multiple single-byte encodings are present this means that some
> characters will be wrong but the repository will work again without
> manual intervention. In case multiple multi-byte encodings other than
> UTF-8 are present this approach can fail and might require manual fixing
> (no worse than the current situation).
> This could still be improved upon if necessary.

True, I had overlooked these points.

One thing that jumps to mind is to have a list of encodings to
try --- i.e.,

svnadmin load --recode-paths-from=MBE1,MBE2,SBE

would attempt to interpret paths as UTF-8, failing that as MBE1, failing
that as MBE2, failing that as SBE.

(I know you use vim, so: compare the 'fencs' option in vim).

Bert Huijben

unread,

May 31, 2011, 7:46:26 AM5/31/11

to Daniel Shahaf, Torsten Krah, us...@subversion.apache.org

> -----Original Message-----
> From: Daniel Shahaf [mailto:d...@daniel.shahaf.name]
> Sent: dinsdag 31 mei 2011 0:10
> To: Torsten Krah
> Cc: us...@subversion.apache.org
> Subject: Re: svnadmin: Path '....' is not in UTF-8 - svnadmin load fails
>
> Torsten Krah wrote on Mon, May 30, 2011 at 23:47:30 +0200:
> > Am Dienstag, den 31.05.2011, 00:30 +0300 schrieb Daniel Shahaf:
> > > Solution is to recode the pathnames (those that are neither in ASCII
> > > nor in UTF-8).
> >
> > Sorry but your "solution" seems really a little bit odd to me.
> > If <1.6 did not enforce this and 1.6 does - why does 1.6 not recode it
> > at the time it does encounter such "things" - at least via some optional
> > command line option?
> >
> > Do you really want to tell me that subversion (the "tool" used to manage
> > my code) is not able to load its own "dump", at least by providing some
> > "fix" tool by itself if it did things not "right" before - why should i
> > need or bother with "third-party" tools here - this should be done by
> > svn, shouldn't it?
> >
>
> As Stefan said, it would be nice if Subversion itself could fix that,
> given that old released versions produced such (malformed) filesystems.
>
> To my knowledge, currently there is no code in Subversion itself to do
> this, hence my suggestion to use third-party tools.

The problem is: We just know it isn't utf-8. But that doesn't tell us how to
fix it.

But which encoding does it have, if it isn't utf-8 as expected?

Without telling Subversion could choose from hundreds of different encodings
(iso-8859-1?, etc., etc.), which might just contain the one you would like.
(Or maybe your filesystem used format 101).

Subversion defines that it must be utf-8, so it can't answer this question
for you.

Bert

Torsten Krah

unread,

May 31, 2011, 7:59:39 AM5/31/11

to us...@subversion.apache.org, Bert Huijben

> Subversion defines that it must be utf-8, so it can't answer this
> question
> for you.

Yes it can't anwer, but it may provide some option like to specify some
"encodings" e.g. via command line which it should try as fallback if it
encounters path names which are not UTF-8 - it may not be the "result"
it once was, but at least the import will succeed and the user may fix
those "names" later (via rename) if it was not the right encoding for
every path.

Stefan Sperling

unread,

May 31, 2011, 8:25:12 AM5/31/11

to Daniel Shahaf, Torsten Krah, us...@subversion.apache.org

Fixed 'svnadmin verify' in r1129641. Thanks for reporting this, Torsten!

Torsten Krah

unread,

Jun 3, 2011, 3:33:53 PM6/3/11

to Stefan Sperling, Daniel Shahaf, us...@subversion.apache.org

Made some task for doc purpose:

http://subversion.tigris.org/issues/show_bug.cgi?id=3911

Torsten Krah

unread,

Jun 3, 2011, 3:34:24 PM6/3/11

to us...@subversion.apache.org

Made an issue to track this:

http://subversion.tigris.org/issues/show_bug.cgi?id=3912

Reply all

Reply to author

Forward