Trival merge of big text file: Dismal performance, 540x faster if binary.

4 views
Skip to first unread message

krueger, Andreas (Andreas Krüger, DV-RATIO)

unread,
Jan 13, 2011, 6:49:55 AM1/13/11
to us...@subversion.apache.org
Hello,

trivial merges of big text file changes are dismally slow. SVN can do
much better when doing such merges as binary.

Briefly, I think it should. I suggest SVN should detect the trivial
merge situation, and use the fast binary algorithm even for text
files.

I'd like to open a bug report / improvement suggestion on this.

What do folks think?


Here are the gory details:

This starts with some branch F and a big text file F/b.xml (see end of
message for details on "big"). This file has no SVN properties
whatsoever.

This got copied, with "svn cp", to some new branch T/b.xml.

Then a major overhaul of F/b.xml was checked in.

There had been no change in T/b.xml yet. So merging the overhaul
transaction from F to T is a *trivial* merge. As the result of that
merge, the T/b.xml content should be simply replaced with the content
of the overhauled F/b.xml.

That merge indeed worked as expected. Only it took 55:21 minutes on my
machine. During most of that time, there was very little network or
hard drive activity, but one CPU was kept 100% busy.


I found a way to speed this up considerably, by a factor of 540 in
this particular case, from 55 minutes to 6 seconds: Use binary instead
of text.

Gory details of this:

New F, new F/b.xml, with same content as before.

I lied to SVN and told it F/b.xml isn't a text file, but binary,
(setting svn:mime-type to application/octet-stream on F/b.xml).

After this, again svn cp to (a new T's) T/b.xml, and again the same
overhaul to F/b.xml .

The whole time, I was careful to not tell SVN there was any connection
to the previous experiment. In particular, no svn cp from the previous
experiment, but fresh checkin from workspace.

Again, the overhaul's merge from F/b.xml to T/b.xml resulted in
replacing the old T/b.xml content with the present F/b.xml content as
expected. Only this time, the merge took a mere 6 something seconds
instead of 55,3 minutes, resulting in a factor 540 speed improvement.

I want to have that speed improvement, without needing to lie to SVN!

Regards,
and thanks to the SVN project members for providing fine software,

Andreas

P.S.:

Numbers, in case someone cares:

The original F/b.xml was 18,291,344 byte and 456,951 lines.

The output of svn diff after the overhaul contained 676,136 lines,
(and that svn diff took quite a while to complete, which is
understandable and not part of this issue).

The overhauled F/b.xml was 18,311,873 byte and 688,560 lines.

I had similar performance problem experiences with various SVN
clients. The times quoted above were Cygwin's svn command line 1.6.12
on Windows. Protocol used was HTTPS, server Apache HTTPD with svn
module (also 1.6.12).

--
Dr. Andreas Krüger, Senior Developer

Tel. (+49) (211) 280 69-1132
andreas...@hp.com

DV-RATIO NORDWEST GmbH, Habsburgerstraße 12, 40547 Düsseldorf, Germany
 
für
 
Hewlett-Packard GmbH H Herrenberger Str. 140 71034 Böblingen www.hp.com/de
Geschäftsführer: Volker Smid (Vorsitzender), Michael Eberhardt, Thorsten Herrmann,
Martin Kinne, Heiko Meyer, Ernst Reichart, Rainer Sterk
Vorsitzender des Aufsichtsrates: Jörg Menno Harms
Sitz der Gesellschaft: Böblingen S Amtsgericht Stuttgart HRB 244081 WEEE-Reg.-Nr. DE 30409072

Stefan Sperling

unread,
Jan 13, 2011, 7:44:20 AM1/13/11
to krueger, Andreas (Andreas Krüger, DV-RATIO), us...@subversion.apache.org
On Thu, Jan 13, 2011 at 11:49:55AM +0000, krueger, Andreas (Andreas Kr�ger, DV-RATIO) wrote:
> Numbers, in case someone cares:
>
> The original F/b.xml was 18,291,344 byte and 456,951 lines.
>
> The output of svn diff after the overhaul contained 676,136 lines,
> (and that svn diff took quite a while to complete, which is
> understandable and not part of this issue).
>
> The overhauled F/b.xml was 18,311,873 byte and 688,560 lines.
>
> I had similar performance problem experiences with various SVN
> clients. The times quoted above were Cygwin's svn command line 1.6.12
> on Windows. Protocol used was HTTPS, server Apache HTTPD with svn
> module (also 1.6.12).

Can you share both versions of this file, privately if needed?
I'd like to reproduce.

Thanks,
Stefan

Johan Corveleyn

unread,
Jan 13, 2011, 7:55:58 AM1/13/11
to krueger, Andreas (Andreas Krüger, DV-RATIO), us...@subversion.apache.org

Hi Andreas,

This is interesting, because it just so happens that I've been working
on a feature branch in svn (on and off for the past half year) for
performance improvements for the diff algorithm in svn, especially for
big files (I have also been using a "big" xml file for testing, of
around 60,000 lines).

Textual merging in svn makes use of a variant of the standard diff
algorithm, namely diff3. Just a couple of days ago, I finally
succeeded in making diff3 take advantage of those performance
improvements (haven't committed this to the branch yet, but maybe I'll
get to it tonight).

Would you be able to build an svn client from source? If so, could you
perhaps build a client from
http://svn.apache.org/repos/asf/subversion/branches/diff-optimizations-bytes
?

This already contains the performance improvement for regular 'svn
diff', so you could test if that makes any difference. If you wait
until I've committed the changes to diff3, you could perhaps see the
impact on the merge you're trying to do.

[note: this performance improvement is currently not included in the
svn trunk, so it's not currently on track to be included in 1.7.
However, I think it's still an option (depends on some more work on
the branch, and then possibly review, some tweaks, ... if the other
devs agree with this change)]

[note2: don't expect this perf improvement to bring it down to 6
seconds but it might still make a big difference (it works very well
if both files are quite similar, and the changes are close together in
the file (a lot of identical prefix and suffix)). Judging from your
description though, there is a big difference between both versions of
the file (of 200,000+ lines).]

Cheers,
--
Johan

Stefan Sperling

unread,
Jan 13, 2011, 8:07:34 AM1/13/11
to Johan Corveleyn, krueger, Andreas (Andreas Krüger, DV-RATIO), us...@subversion.apache.org
On Thu, Jan 13, 2011 at 01:55:58PM +0100, Johan Corveleyn wrote:
> Textual merging in svn makes use of a variant of the standard diff
> algorithm, namely diff3. Just a couple of days ago, I finally
> succeeded in making diff3 take advantage of those performance
> improvements (haven't committed this to the branch yet, but maybe I'll
> get to it tonight).
>
> Would you be able to build an svn client from source? If so, could you
> perhaps build a client from
> http://svn.apache.org/repos/asf/subversion/branches/diff-optimizations-bytes
> ?

Hey Johan,

I would be interested in doing testing and reviewing the changes
on your branch. There might still be enough time to get them into 1.7.

I don't have any suitably large XML files though.
If you and/or Andreas could provide some that would be great.

Thanks,
Stefan

Johan Corveleyn

unread,
Jan 13, 2011, 9:25:35 AM1/13/11
to krueger, Andreas (Andreas Krüger, DV-RATIO), us...@subversion.apache.org
On Thu, Jan 13, 2011 at 2:07 PM, Stefan Sperling <st...@elego.de> wrote:
> On Thu, Jan 13, 2011 at 01:55:58PM +0100, Johan Corveleyn wrote:
>> Textual merging in svn makes use of a variant of the standard diff
>> algorithm, namely diff3. Just a couple of days ago, I finally
>> succeeded in making diff3 take advantage of those performance
>> improvements (haven't committed this to the branch yet, but maybe I'll
>> get to it tonight).
>>
>> Would you be able to build an svn client from source? If so, could you
>> perhaps build a client from
>> http://svn.apache.org/repos/asf/subversion/branches/diff-optimizations-bytes
>> ?
>
> Hey Johan,
>
> I would be interested in doing testing and reviewing the changes
> on your branch. There might still be enough time to get them into 1.7.

Thanks, that would be great (btw, danielsh also expressed an interest
in reviewing the branch). I will try to give an status update on the
dev-list after I've committed the changes for diff3.

> I don't have any suitably large XML files though.
> If you and/or Andreas could provide some that would be great.

I was thinking of writing a python script (as philip already
suggested) that can generate several variants of large files with
semi-random data. I have some prototype code for this lying around, so
if I find the time, I'll try to wrap this up and send it to the dev
list. OTOH, real-world examples are probably even better.

Cheers,
--
Johan

krueger, Andreas (Andreas Krüger, DV-RATIO)

unread,
Jan 13, 2011, 10:02:46 AM1/13/11
to us...@subversion.apache.org
Hello, Stefan,

> Can you share both versions of this file ... ?

The bad news: No, sorry.

The good news: This is a generic problem. So it is quite easy to come
up with example files.

To be more specific than that, the following Perl script generates two
files (which aren't XML, but just lines with numbers) that trigger the
same SVN behavior. Binary merge: 3.6 seconds. Text merge: Longer
than it takes to clean up the Perl script, running it again to check
it still works, and typing this text to accompany it :-) .

Regards, Andreas


#!/usr/bin/perl -w

# Generate stupid files on stdout.

use strict;

# For the overhauled file, set to 1:
my $overhaul = 0;

my $number = 1;

for (1 .. 1000000) {

# 1073741824 and 910111213 have no common divisor,
# so this will take a while before it repeats.
$number = ($number + 910111213) % 1073741824;

krueger, Andreas (Andreas Krüger, DV-RATIO)

unread,
Jan 13, 2011, 10:07:50 AM1/13/11
to us...@subversion.apache.org
Hello, Stefan,

> Can you share both versions of this file ... ?

The bad news: No, sorry.

The good news: This is a generic problem. So it is quite easy to
come up with example files.

To be more specific than that, the following Perl script
generates two files (which aren't XML, but just lines with
numbers) that trigger the same SVN behavior. Binary merge: 3.6
seconds. Text merge: Longer than it takes to clean up the Perl
script, running it again to check it still works, and typing this
text to accompany it, finding I goofed, and sending the corrected
message again. :-).

Regards, Andreas


#!/usr/bin/perl -w

# Generate stupid files on stdout.

use strict;

# For the overhauled file, set to 1:
my $overhaul = 0;

my $number = 1;

for (1 .. 1000000) {

# 1073741824 and 910111213 have no common divisor,
# so this will take a while before it repeats.
$number = ($number + 910111213) % 1073741824;

my $printme;
if($overhaul) {
$printme = ($number % 4 != 0 ? $number * 13 % 1073741824 : $number);
} else {
$printme = $number;
}
print $printme,"\n";
}

--
Dr. Andreas Krüger, Senior Developer

Tel. (+49) (211) 280 69-1132
andreas...@hp.com

DV-RATIO NORDWEST GmbH, Habsburgerstraße 12, 40547 Düsseldorf, Germany
 
für
 
Hewlett-Packard GmbH H Herrenberger Str. 140 71034 Böblingen www.hp.com/de
Geschäftsführer: Volker Smid (Vorsitzender), Michael Eberhardt, Thorsten Herrmann,
Martin Kinne, Heiko Meyer, Ernst Reichart, Rainer Sterk
Vorsitzender des Aufsichtsrates: Jörg Menno Harms
Sitz der Gesellschaft: Böblingen S Amtsgericht Stuttgart HRB 244081 WEEE-Reg.-Nr. DE 30409072



Tony Sweeney

unread,
Jan 13, 2011, 10:58:39 AM1/13/11
to Johan Corveleyn, krueger, Andreas (Andreas Krüger, DV-RATIO), us...@subversion.apache.org
Why bother with a script? Just wget a few high traffic websites (slashdot, yahoo, dailykos, google news) or similar into a file every now and again.

Tony.

> -----Original Message-----
> From: Johan Corveleyn [mailto:jco...@gmail.com]
> Sent: 13 January 2011 14:26
> To: krueger, Andreas (Andreas Krüger, DV-RATIO);
> us...@subversion.apache.org
> Subject: Re: Trival merge of big text file: Dismal
> performance, 540x faster if binary.
>

> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit
> http://www.messagelabs.com/email
> ______________________________________________________________________
>

Daniel Shahaf

unread,
Jan 13, 2011, 11:28:44 AM1/13/11
to us...@subversion.apache.org, Johan Corveleyn, krueger, Andreas (Andreas Krüger, DV-RATIO)

How about taking periodic dumps of some large repository? I count on
propchanges to give the "small change in the middle of the file" effect.

Another option:

for i in 0 1 2 3 4 5 6 7 8 9; do
cat $REPOS/db/revs/*/*$i
done | tar -cf- > "`date`"

> Thanks,
> Stefan

Daniel Shahaf

unread,
Jan 13, 2011, 11:32:56 AM1/13/11
to us...@subversion.apache.org, Johan Corveleyn, krueger, Andreas (Andreas Krüger, DV-RATIO)

Without the tar.

>
> > Thanks,
> > Stefan

krueger, Andreas (Andreas Krüger, DV-RATIO)

unread,
Jan 14, 2011, 9:53:19 AM1/14/11
to us...@subversion.apache.org, Johan Corveleyn
Hello, Johan and all,

first, for the record, here is another comparison between
binary and text merge performance, this time with the files
generated by my script (repeated below):

Binary merge took 3.56 seconds, text merge took 3:45:45.202 hours.
In this particular case, binary merge performance was 3805 times
faster than text merge performance.



> Textual merging in svn makes use of a variant of the standard diff
> algorithm, namely diff3. Just a couple of days ago, I finally
> succeeded in making diff3 take advantage of ... performance
> improvements ... .

Good news! Excellent! Thank you!

But... does this relate to my problem?

The improved diff3 will give a nice performance improvement in the
*general* case.

I certainly want that improvement!


Another nice performance improvement of a factor of several hundreds
(or thousands) could be obtained for big files in the *trivial* case,
if SVN didn't diff3 at all, but simply copied the result.

I also want this other improvement!


Finally:

SVN already contains the intelligence needed to find out whether a
merge is trivial or not. For, in the binary case, the trivial merges
are precisely the ones that SVN knows how to do.


Johan (or whoever else), please kindly enlighten me, should I be
missing something!

Regards, Andreas
--
Dr. Andreas Krüger, Senior Developer

Tel. (+49) (211) 280 69-1132
andreas...@hp.com

DV-RATIO NORDWEST GmbH, Habsburgerstraße 12, 40547 Düsseldorf, Germany

für

Hewlett-Packard GmbH H Herrenberger Str. 140 71034 Böblingen www.hp.com/de
Geschäftsführer: Volker Smid (Vorsitzender), Michael Eberhardt, Thorsten Herrmann,
Martin Kinne, Heiko Meyer, Ernst Reichart, Rainer Sterk
Vorsitzender des Aufsichtsrates: Jörg Menno Harms
Sitz der Gesellschaft: Böblingen S Amtsgericht Stuttgart HRB 244081 WEEE-Reg.-Nr. DE 30409072

-----Original Message-----
From: krueger, Andreas (Andreas Krüger, DV-RATIO)
Sent: Thursday, January 13, 2011 4:08 PM
To: us...@subversion.apache.org
Subject: RE: Trival merge of big text file: Dismal performance, 540x faster if binary.

...

Johan Corveleyn

unread,
Jan 14, 2011, 5:52:10 PM1/14/11
to krueger, Andreas (Andreas Krüger, DV-RATIO), us...@subversion.apache.org

Ok, after rereading this thread, I'm starting to understand what you
mean: why would "merge" perform an expensive diffing algorithm while
it can just be 100% sure that it can simply copy the contents from the
source to the target (because the target file has not had any changes
since it was branched)?

I think it's a good suggestion, but I can't really comment more on
(the feasibility of) it, because I'm not that familiar with that part
of the codebase. I've only concentrated on the diff algorithm itself
(and how it's used by "svn diff" and "svn merge" (for text files)).
Maybe someone else can chime in to comment on that?

Of course, if there would be any change to the target file, it
wouldn't be a trivial merge (copy contents) anymore, so you would
immediately fallback to the very expensive case. But I agree that's
hardly a reason not to take the shortcut when you can...

A couple more thoughts on the diff-side of the story:
- Your perl script presents an extremely hard case for the diff
algorithm, because:
* Files A and B differ in each and every one of the 1000000 lines (so
my prefix/suffix scanning optimization will not help at all).
* All lines in each file are also unique inside the file (this makes
it more expensive).
* Most lines have identical lengths as other lines (also makes it more
expensive).
* The lines are very short (the diff algorithm is mainly proportional
with the number of lines).

These things are very atypical when you compare with real-world
examples. Usually there is some identical part at the beginning and/or
the end, and lines vary a lot in length. If there is a significant
portion of identical lines at the beginning and/or the end, the
optimizations in the diff-optimizations-bytes branch will help a lot.

- Interestingly, GNU diff calculates the diff between these files in 7
seconds on my machine. But if I give it the option '--minimal', it
also runs for hours (started it 2 hours ago; it's still running).

- Can you try the merge on your original example (big.xml) with the
option -x-b (ignore changes in amount of whitespace)? Just to know if
it would make a difference. In my tests this made diff *much* faster,
so I'm guessing the same for merge. Of course, this depends entirely
on the example data (won't help a bit for the perl-generated files
(will be slowed down even more)).

Cheers,
--
Johan

Daniel Shahaf

unread,
Jan 14, 2011, 7:33:38 PM1/14/11
to Johan Corveleyn, krueger, Andreas (Andreas Krüger, DV-RATIO), us...@subversion.apache.org
Johan Corveleyn wrote on Fri, Jan 14, 2011 at 23:52:10 +0100:
> Ok, after rereading this thread, I'm starting to understand what you
> mean: why would "merge" perform an expensive diffing algorithm while
> it can just be 100% sure that it can simply copy the contents from the
> source to the target (because the target file has not had any changes
> since it was branched)?
>
> I think it's a good suggestion, but I can't really comment more on
> (the feasibility of) it, because I'm not that familiar with that part
> of the codebase. I've only concentrated on the diff algorithm itself
> (and how it's used by "svn diff" and "svn merge" (for text files)).
> Maybe someone else can chime in to comment on that?

In other words, merging changes from file.c@BRANCH to trunk should
detect that file@trunk and file@BRANCH@BRANCH-CREATION are the same
node-revision?

I don't know whether it does that... but giving the question more
visibility (as opposed to burying it in the middle of a paragraph on
users@) might help you get an answer. :-)

krueger, Andreas (Andreas Krüger, DV-RATIO)

unread,
Jan 17, 2011, 11:30:40 AM1/17/11
to us...@subversion.apache.org
Hello, Daniel and all,

> In other words, merging changes from file.c@BRANCH to trunk should
> detect that file@trunk and file@BRANCH@BRANCH-CREATION are the same
> node-revision?

I think that my intended optimization should work in this case.
But I don't think that's the condition I mean.

It does not feel general enough.

But then, I also don't think this has to be discussed in the context
of this optimization proposal. After all, there is some condition
already implemented. SVN already knows how to check
whether a merge is possible or not in the binary case.

That condition IS what I want.

If a binary merge would be possible, be fast and do the binary merge
and don't bother with text diffs.


> but giving the question more
> visibility (as opposed to burying it in the middle of a paragraph on
> users@) might help you get an answer. :-)

Thanks for the hint!

I'd be more than willing to convert this to an issue at
http://subversion.tigris.org/issues .

Writing a self-contained script that demonstrates the performance
problem (starting with the creation of a scratch SVN repository) -
would that be a good next step?

Regards, Andreas
--
Dr. Andreas Krüger, Senior Developer

Tel. (+49) (211) 280 69-1132
andreas...@hp.com

DV-RATIO NORDWEST GmbH, Habsburgerstraße 12, 40547 Düsseldorf, Germany
 
für
 
Hewlett-Packard GmbH H Herrenberger Str. 140 71034 Böblingen www.hp.com/de
Geschäftsführer: Volker Smid (Vorsitzender), Michael Eberhardt, Thorsten Herrmann,
Martin Kinne, Heiko Meyer, Ernst Reichart, Rainer Sterk
Vorsitzender des Aufsichtsrates: Jörg Menno Harms
Sitz der Gesellschaft: Böblingen S Amtsgericht Stuttgart HRB 244081 WEEE-Reg.-Nr. DE 30409072


-----Original Message-----

Johan Corveleyn

unread,
Jan 21, 2011, 4:19:07 AM1/21/11
to krueger, Andreas (Andreas Krüger, DV-RATIO), us...@subversion.apache.org

Hi Andreas,

Yes, I think you should probably file an issue for this in the issue
tracker, referring to this thread. If you could write a self-contained
script to demonstrate, that would certainly be a good thing.

Just to confirm your hypothesis about the special shortcut in "svn
merge" for binary files, here is the relevant excerpt from
subversion/libsvn_client/merge.c (starting at line 1454):

[[[
/* Special case: if a binary file's working file is
exactly identical to the 'left' side of the merge, then don't
allow svn_wc_merge to produce a conflict. Instead, just
overwrite the working file with the 'right' side of the
merge. Why'd we check for local mods above? Because we want
to do a different notification depending on whether or not
the file was locally modified.

Alternately, if the 'left' side of the merge doesn't exist in
the repository, and the 'right' side of the merge is
identical to the WC, pretend we did the merge (a no-op). */
if ((mimetype1 && svn_mime_type_is_binary(mimetype1))
|| (mimetype2 && svn_mime_type_is_binary(mimetype2)))
{
/* For adds, the 'left' side of the merge doesn't exist. */
svn_boolean_t older_revision_exists =
!merge_b->add_necessitated_merge;
svn_boolean_t same_contents;
SVN_ERR(svn_io_files_contents_same_p(&same_contents,
(older_revision_exists ?
older_abspath : yours_abspath),
mine_abspath, subpool));
if (same_contents)
{
if (older_revision_exists && !merge_b->dry_run)
{
SVN_ERR(svn_io_file_move(yours_abspath, mine_abspath,
subpool));
}
merge_outcome = svn_wc_merge_merged;
merge_required = FALSE;
}
}
]]]

See also:
http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_client/merge.c?view=markup

That said, I'm not so sure that we could blindly take this same
shortcut for text files. It sounds like a trivial decision, but there
might be some hidden problems if we do this. We just need to be
careful, and think this through ...

Cheers,
--
Johan

Daniel Becroft

unread,
Jan 21, 2011, 6:29:35 PM1/21/11
to Johan Corveleyn, krueger, Andreas (Andreas Krüger, DV-RATIO), us...@subversion.apache.org

(deja vu, I've just been looking at this bit of code)

For binary files, this special case causes issues (e.g. #3686), because it bypasses the general work-queue logic, and any special logic for properties does not get applied (e.g. svn:executable). This currently works for text files, since it runs the svn_client_mergeX() process.

Just my $0.02.

Cheers,
Daniel B.
Reply all
Reply to author
Forward
0 new messages