Can you share both versions of this file, privately if needed?
I'd like to reproduce.
Thanks,
Stefan
Hi Andreas,
This is interesting, because it just so happens that I've been working
on a feature branch in svn (on and off for the past half year) for
performance improvements for the diff algorithm in svn, especially for
big files (I have also been using a "big" xml file for testing, of
around 60,000 lines).
Textual merging in svn makes use of a variant of the standard diff
algorithm, namely diff3. Just a couple of days ago, I finally
succeeded in making diff3 take advantage of those performance
improvements (haven't committed this to the branch yet, but maybe I'll
get to it tonight).
Would you be able to build an svn client from source? If so, could you
perhaps build a client from
http://svn.apache.org/repos/asf/subversion/branches/diff-optimizations-bytes
?
This already contains the performance improvement for regular 'svn
diff', so you could test if that makes any difference. If you wait
until I've committed the changes to diff3, you could perhaps see the
impact on the merge you're trying to do.
[note: this performance improvement is currently not included in the
svn trunk, so it's not currently on track to be included in 1.7.
However, I think it's still an option (depends on some more work on
the branch, and then possibly review, some tweaks, ... if the other
devs agree with this change)]
[note2: don't expect this perf improvement to bring it down to 6
seconds but it might still make a big difference (it works very well
if both files are quite similar, and the changes are close together in
the file (a lot of identical prefix and suffix)). Judging from your
description though, there is a big difference between both versions of
the file (of 200,000+ lines).]
Cheers,
--
Johan
Hey Johan,
I would be interested in doing testing and reviewing the changes
on your branch. There might still be enough time to get them into 1.7.
I don't have any suitably large XML files though.
If you and/or Andreas could provide some that would be great.
Thanks,
Stefan
Thanks, that would be great (btw, danielsh also expressed an interest
in reviewing the branch). I will try to give an status update on the
dev-list after I've committed the changes for diff3.
> I don't have any suitably large XML files though.
> If you and/or Andreas could provide some that would be great.
I was thinking of writing a python script (as philip already
suggested) that can generate several variants of large files with
semi-random data. I have some prototype code for this lying around, so
if I find the time, I'll try to wrap this up and send it to the dev
list. OTOH, real-world examples are probably even better.
Cheers,
--
Johan
Tony.
> -----Original Message-----
> From: Johan Corveleyn [mailto:jco...@gmail.com]
> Sent: 13 January 2011 14:26
> To: krueger, Andreas (Andreas Krüger, DV-RATIO);
> us...@subversion.apache.org
> Subject: Re: Trival merge of big text file: Dismal
> performance, 540x faster if binary.
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit
> http://www.messagelabs.com/email
> ______________________________________________________________________
>
How about taking periodic dumps of some large repository? I count on
propchanges to give the "small change in the middle of the file" effect.
Another option:
for i in 0 1 2 3 4 5 6 7 8 9; do
cat $REPOS/db/revs/*/*$i
done | tar -cf- > "`date`"
> Thanks,
> Stefan
Without the tar.
>
> > Thanks,
> > Stefan
Ok, after rereading this thread, I'm starting to understand what you
mean: why would "merge" perform an expensive diffing algorithm while
it can just be 100% sure that it can simply copy the contents from the
source to the target (because the target file has not had any changes
since it was branched)?
I think it's a good suggestion, but I can't really comment more on
(the feasibility of) it, because I'm not that familiar with that part
of the codebase. I've only concentrated on the diff algorithm itself
(and how it's used by "svn diff" and "svn merge" (for text files)).
Maybe someone else can chime in to comment on that?
Of course, if there would be any change to the target file, it
wouldn't be a trivial merge (copy contents) anymore, so you would
immediately fallback to the very expensive case. But I agree that's
hardly a reason not to take the shortcut when you can...
A couple more thoughts on the diff-side of the story:
- Your perl script presents an extremely hard case for the diff
algorithm, because:
* Files A and B differ in each and every one of the 1000000 lines (so
my prefix/suffix scanning optimization will not help at all).
* All lines in each file are also unique inside the file (this makes
it more expensive).
* Most lines have identical lengths as other lines (also makes it more
expensive).
* The lines are very short (the diff algorithm is mainly proportional
with the number of lines).
These things are very atypical when you compare with real-world
examples. Usually there is some identical part at the beginning and/or
the end, and lines vary a lot in length. If there is a significant
portion of identical lines at the beginning and/or the end, the
optimizations in the diff-optimizations-bytes branch will help a lot.
- Interestingly, GNU diff calculates the diff between these files in 7
seconds on my machine. But if I give it the option '--minimal', it
also runs for hours (started it 2 hours ago; it's still running).
- Can you try the merge on your original example (big.xml) with the
option -x-b (ignore changes in amount of whitespace)? Just to know if
it would make a difference. In my tests this made diff *much* faster,
so I'm guessing the same for merge. Of course, this depends entirely
on the example data (won't help a bit for the perl-generated files
(will be slowed down even more)).
Cheers,
--
Johan
In other words, merging changes from file.c@BRANCH to trunk should
detect that file@trunk and file@BRANCH@BRANCH-CREATION are the same
node-revision?
I don't know whether it does that... but giving the question more
visibility (as opposed to burying it in the middle of a paragraph on
users@) might help you get an answer. :-)
Hi Andreas,
Yes, I think you should probably file an issue for this in the issue
tracker, referring to this thread. If you could write a self-contained
script to demonstrate, that would certainly be a good thing.
Just to confirm your hypothesis about the special shortcut in "svn
merge" for binary files, here is the relevant excerpt from
subversion/libsvn_client/merge.c (starting at line 1454):
[[[
/* Special case: if a binary file's working file is
exactly identical to the 'left' side of the merge, then don't
allow svn_wc_merge to produce a conflict. Instead, just
overwrite the working file with the 'right' side of the
merge. Why'd we check for local mods above? Because we want
to do a different notification depending on whether or not
the file was locally modified.
Alternately, if the 'left' side of the merge doesn't exist in
the repository, and the 'right' side of the merge is
identical to the WC, pretend we did the merge (a no-op). */
if ((mimetype1 && svn_mime_type_is_binary(mimetype1))
|| (mimetype2 && svn_mime_type_is_binary(mimetype2)))
{
/* For adds, the 'left' side of the merge doesn't exist. */
svn_boolean_t older_revision_exists =
!merge_b->add_necessitated_merge;
svn_boolean_t same_contents;
SVN_ERR(svn_io_files_contents_same_p(&same_contents,
(older_revision_exists ?
older_abspath : yours_abspath),
mine_abspath, subpool));
if (same_contents)
{
if (older_revision_exists && !merge_b->dry_run)
{
SVN_ERR(svn_io_file_move(yours_abspath, mine_abspath,
subpool));
}
merge_outcome = svn_wc_merge_merged;
merge_required = FALSE;
}
}
]]]
See also:
http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_client/merge.c?view=markup
That said, I'm not so sure that we could blindly take this same
shortcut for text files. It sounds like a trivial decision, but there
might be some hidden problems if we do this. We just need to be
careful, and think this through ...
Cheers,
--
Johan