svndumpfilter does not exclude some files

173 views
Skip to first unread message

Jason Heeris

unread,
Oct 14, 2012, 9:57:48 PM10/14/12
to us...@subversion.apache.org
I am trying to use svndumpfilter to remove specific files and
directories from a repository dump, but I can't seem to remove all of
them.

Some of the files are, for example:

/specs/[01234] product x spec.pdf

There are other files in other top-level directories, but they all
have that "[xxxxx]" prefix.

There are also some directories I'm filtering out such as:

/results/RST-0001 (v0.01) #001

I created a text file to list these excluded path prefixes. I enter
each directory into it verbatim, but for the files I just enter the
prefix up to and including the "[01234]" part (to catch any moves,
renames, etc). So my "excludes.txt" contains something like

/specs/[01234]
...bunch of similar path prefixes
/results/RST-0001 (v0.01) #001
...bunch of similar path prefixes

I use 'svnadmin dump' to dump the repository contents, and then run

svndumpfilter exclude --targets "excludes.txt" < repo-01.dump >
repo-02.dump 2> filter.log

Unfortunately, only the directories (the second kind of entries in my
'excludes.txt') seem to get filtered. Looking in 'filter.log', there's
no mention of removing the files starting with "[xxxxx]", but it does
explicitly say that the other directories were removed (although it
doesn't list all the files under them). Loading the resulting dump
file into a repository, the files are certainly still there, but the
directories aren't.

I've also tried putting a "*" on the end of each line and using the
"--pattern" argument, but the files are still not filtered out.

So how do I get svndumpfilter to get rid of these files from my
repository? Each dump/load cycle takes about seven hours, so I'm not
really able to try a lot of trial and error.

All of this is done on Windows Server 2003. Versions of both svnadmin
and svndump are 1.7.5 (r1336830), and they're the binaries that come
with VisualSVN server 2.5.5.

Please CC me on replies.

— Jason Heeris

Jason Heeris

unread,
Oct 15, 2012, 12:56:40 AM10/15/12
to us...@subversion.apache.org
Okay, I managed to cheat a bit, so I'm sharing my workaround here. In
my includes file, I used the form:

/specs*01234*

...and for the directories:

/results/RST-0001 (v0.01) #001*

...and now everything seems to be expunged that should be.

I have no idea why the original form doesn't work, maybe it's the "["
characters or somesuch. Obviously you need to be a bit careful that
your patterns aren't too general; in my case those five-digit numbers
are a unique enough pattern to work for me.

— Jason

Stefan Sperling

unread,
Oct 15, 2012, 5:30:45 AM10/15/12
to Jason Heeris, us...@subversion.apache.org
On Mon, Oct 15, 2012 at 12:56:40PM +0800, Jason Heeris wrote:
> Okay, I managed to cheat a bit, so I'm sharing my workaround here. In
> my includes file, I used the form:
>
> /specs*01234*
>
> ...and for the directories:
>
> /results/RST-0001 (v0.01) #001*
>
> ...and now everything seems to be expunged that should be.
>
> I have no idea why the original form doesn't work, maybe it's the "["
> characters or somesuch. Obviously you need to be a bit careful that
> your patterns aren't too general; in my case those five-digit numbers
> are a unique enough pattern to work for me.

The square brackets are wildcard syntax saying "match any of the characters
listed within the brackets". This is part of the syntax of the fnmatch()
standard C function, see:
http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_13_01

The pattern you originally tried to use:

/specs/[01234]

would match any of the following paths:

/specs/0
/specs/1
/specs/2
/specs/3
/specs/4

But not this path, assuming '[01234]' is a literal part of the filename:

/specs/[01234] product x spec.pdf

The pattern you ended up using is a better way of matching those paths:

/specs*01234*

You should be able to quote the square brackets with a backslash to prevent
them from causing wildcard matching. For instance, the following should
match '/specs/[01234] product x spec.pdf':

/specs/\[01234\]*pdf

Jason Heeris

unread,
Oct 15, 2012, 5:40:42 AM10/15/12
to us...@subversion.apache.org
On 15 October 2012 17:30, Stefan Sperling <st...@elego.de> wrote:
> The square brackets are wildcard syntax saying "match any of the characters
> listed within the brackets". This is part of the syntax of the fnmatch()
> standard C function

Oh, that makes sense now. That's not the first time I've been bitten
by using square brackets in my filenames, but oh well.

It might be good to document that a bit better though, I don't think
it's in the SVN book or the help text.

Cheers,
— Jason

Stefan Sperling

unread,
Oct 15, 2012, 6:18:46 AM10/15/12
to Jason Heeris, us...@subversion.apache.org
On Mon, Oct 15, 2012 at 05:40:42PM +0800, Jason Heeris wrote:
> On 15 October 2012 17:30, Stefan Sperling <st...@elego.de> wrote:
> > The square brackets are wildcard syntax saying "match any of the characters
> > listed within the brackets". This is part of the syntax of the fnmatch()
> > standard C function
>
> Oh, that makes sense now. That's not the first time I've been bitten
> by using square brackets in my filenames, but oh well.
>

Another possibly slightly unclear aspect of prefix matching (i.e.
without --pattern) is that only entire path components are matched.

That is, if you use this prefix:

/specs/[01234]

A file called '/specs/1 foo bar.pdf' won't match, but any file within
a directory called '/specs/[01234]/' would match, as would the file
literally called '/specs/[01234]'. In all these cases the square brackets
don't carry special meaning because the --pattern option isn't used.

> It might be good to document that a bit better though, I don't think
> it's in the SVN book or the help text.

I agree!

Please feel free to make suggestions. You can send us your proposed
changes with relatively little effort.

To enhance the help text, check out a copy of Subversion's trunk
from https://svn.apache.org/repos/asf/subversion/trunk and edit
the appropriate section of the file subversion/svndumpfilter/main.c.
Then run 'svn diff' on the working copy, redirect the output to a file,
and send this file as an attachment to the dev@ list. See for details:
http://subversion.apache.org/docs/community-guide/general.html#patches

If editing C source code is too technical, feel free to simply send an
edited version of the output of 'svndumpfilter help exclude'.
You can redirect the output to a file for editing purposes:
svndumpfilter help exclude > help-exclude.txt
Somebody else can then embed these changes in the C source files.

The SVNbook has a separate web site with instructions for contributors,
see: http://svnbook.org

To get you started, here's a help text snippet I wrote to describe
fnmatch-style pattern matching in the help text of 'svn log' for a
new --search option in Subversion 1.8. Something similar could be done
for 'svndumpfilter help exclude' and 'svndumpfilter help include'.

log: Show the log messages for a set of revision(s) and/or path(s).
usage: 1. log [PATH][@REV]
2. log URL[@REV] [PATH...]

[... Some help text omitted here ...]

If the --search option is used, log messages are displayed only if the
provided search pattern matches any of the author, date, log message
text (unless --quiet is used), or, if the --verbose option is also
provided, a changed path.
The search pattern may include "glob syntax" wildcards:
? matches any single character
* matches a sequence of arbitrary characters
[abc] matches any of the characters listed inside the brackets
If multiple --search options are provided, a log message is shown if
it matches any of the provided search patterns. If the --search-and
option is used, that option's argument is combined with the pattern
from the previous --search or --search-and option, and a log message
is shown only if it matches the combined search pattern.
If --limit is used in combination with --search, --limit restricts the
number of log messages searched, rather than restricting the output
to a particular number of matching log messages.

Nico Kadel-Garcia

unread,
Oct 15, 2012, 6:53:48 AM10/15/12
to Jason Heeris, us...@subversion.apache.org
On Mon, Oct 15, 2012 at 5:40 AM, Jason Heeris <jason....@gmail.com> wrote:
> On 15 October 2012 17:30, Stefan Sperling <st...@elego.de> wrote:
>> The square brackets are wildcard syntax saying "match any of the characters
>> listed within the brackets". This is part of the syntax of the fnmatch()
>> standard C function
>
> Oh, that makes sense now. That's not the first time I've been bitten
> by using square brackets in my filenames, but oh well.

So why do you do it? Similar to putting spaces and question marks and
quotation marks in file names, it can cause a lot of scripting
confusion for your hook scripts.

Jason Heeris

unread,
Oct 15, 2012, 7:21:52 AM10/15/12
to Nico Kadel-Garcia, us...@subversion.apache.org
On 15 October 2012 18:53, Nico Kadel-Garcia <nka...@gmail.com> wrote:
> So why do you do it? Similar to putting spaces and question marks and
> quotation marks in file names, it can cause a lot of scripting
> confusion for your hook scripts.

I did it once, because I didn't realise it would cause problems, and
it continues to bite me today.

To be fair - SVN can handle all sorts of non-ASCII characters, spaces,
etc. so I didn't think that punctuation would be problematic. They're
valid on any filesystem in current use, and I write hook scripts in
Python, so string processing doesn't fall over with odd characters.

— Jason

Nico Kadel-Garcia

unread,
Oct 15, 2012, 8:05:36 AM10/15/12
to Jason Heeris, us...@subversion.apache.org
Understandable, but it can really bite the next person who works with
your scripts or material. I've encountered a lot of adventures with
non-7-bit-ASCII character sets over the years, and it leads to me
doing a lot of sanitizing of filenames in hook scripts: usually, I
prefer to simply reject such filenames in the pre-commit script.

Jason Heeris

unread,
Oct 15, 2012, 7:37:28 PM10/15/12
to Nico Kadel-Garcia, us...@subversion.apache.org
On 15 October 2012 20:05, Nico Kadel-Garcia <nka...@gmail.com> wrote:
> Understandable, but it can really bite the next person who works with
> your scripts or material. I've encountered a lot of adventures with
> non-7-bit-ASCII character sets over the years, and it leads to me
> doing a lot of sanitizing of filenames in hook scripts: usually, I
> prefer to simply reject such filenames in the pre-commit script.

Yeah, I'm learning that :P

However — a small, pedantic part of me (not that small, actually)
still wants to do these things, and just insist that the tool-creators
(which sometimes includes myself) write "better" tools, rather than
insisting that those working with those tools (also includes me)
restrict how they can express information.

Of course, I pay for that attitude occasionally, when a system gets
used and abused in ways I didn't predict.

— Jason
Reply all
Reply to author
Forward
0 new messages