I have recently run into an issue with my subversion system (1.7.1)
where a specific component I am trying to build has failed. I have
had sucessful builds of this project before this issue happened since
we upgraded to 1.7.1. I am currently getting by this by using
TortoiseSVN 1.7.1 to do the check-in of the files left in the
workspace.
This issue seems to be intermittent as it does not happen every time.
I am using subversion 1.7.1 windows binaries and subversion is
hosting on Apache 2.2.21.
I have run "svnadmin verify" against the directory and it has come
back clean. I was wondering what type of events can cause this to
happen? Are there any resolutions?
Here are some entries from the apache log for the last time this
issue happened.
The Apache error log shows the following lines:
[Tue Jan 31 11:37:23 2012] [error] [client 9.31.13.109] Could not
MERGE resource "/repository/!svn/txn/61847-1bz5" into
"/repository/project/binaries/release/phase1/iteration/81/trunk
". [409, #0]
[Tue Jan 31 11:37:23 2012] [error] [client 9.31.13.109] An error
occurred while committing the transaction. [409, #160004]
[Tue Jan 31 11:37:23 2012] [error] [client 9.31.13.109] predecessor
count for the root node-revision is wrong: found 61815,
committing r61852 [409, #160004]
[Tue Jan 31 11:37:24 2012] [error] [client 9.31.13.109] Could not
fetch resource information. [404, #0]
[Tue Jan 31 11:37:24 2012] [error] [client 9.31.13.109] Named
transaction doesn't exist. [404, #0]
The following are from the Apache access log at the same time:
9.31.13.109 - username [31/Jan/2012:11:37:22 -0800] "MERGE
/repository/project/binaries/release/phase1/iteration/81/trunk
HTTP/1.1" 409 281
9.31.13.109 - username [31/Jan/2012:11:37:24 -0800] "DELETE
/repository/!svn/txn/61847-1bz5 HTTP/1.1" 404 232
Any help/comments would be appreciated. Thank you.
Jason Wong
This error message indicates a bug has been detected. The implications
of the bug are that walking backwards through history (eg, 'svn log -r
HEAD:0') may skip some revisions.
I'm interested in getting to the bottom of this.
There are two ways to identify instances of the bug: either by directly
querying the filesystem (explained below), or by running 'svn log -q ^/'
and looking for gaps in the sequences of revision numbers.
The second approch is as follows: look for revisions N such that
f(N)�!=�f(N-1), where f(N) is 'N minus (predecessors count of the
node-revision of / in revision N)'. To compute f(), you find the
predecessors count; it is given by the 'count:' header in the output
of the attached script:
% ./dump-noderev.pl /tmp/svn/r1 / 1
id: 0.0.r1/4760
type: dir
pred: 0.0.r0/17
count: 1
text: 1 4640 107 107 c34b7d0de08f48db97941642f719e2f4
cpath: /
copyroot: 0 /
So, in summary, there are two ways to identify revisions that trigger
the bug: by 'svn log' or by manually looking for gaps in the sequence of
node-revisions of the root of the filesystem.
I would ask you to run the attached script anyway, on the HEAD revision,
and review its output for sensibility --- in particular, what is the
value of the minfo-cnt header (if present).
> ��� [Tue Jan 31 11:37:24 2012] [error] [client 9.31.13.109] Could not
> fetch resource information.� [404, #0]
> ��� [Tue Jan 31 11:37:24 2012] [error] [client 9.31.13.109] Named
> transaction doesn't exist.� [404, #0]
>
> The following are from the Apache access log at the same time:
>
> ��� 9.31.13.109 - username [31/Jan/2012:11:37:22 -0800] "MERGE
> /repository/project/binaries/release/phase1/iteration/81/trunk
> HTTP/1.1" 409 281
> ��� 9.31.13.109 - username [31/Jan/2012:11:37:24 -0800] "DELETE
> /repository/!svn/txn/61847-1bz5 HTTP/1.1" 404 232
>
> Any help/comments would be appreciated. Thank you.
>
As I said, I'd be interested in isolating the cause of these errors.
Is there anything common to revisions that triggered the bug (as
explained above)? Are they concomitant with concurrent writes (commits,
propedits, 'svn lock' operations, 'svnadmin pack' operations)? What
version of svn does your server run (1.7.1?)? What operating system
does your server run? Is there anything noteworthy about its
filesystems or disks?
>
> Jason Wong
Thanks,
Daniel
I am working with our lead developer to come up with more details
on our build process. I will post this when I get it.
Our svn repository is 1.7.1 and is hosted on Apache 2.2.21 on a
Windows 2003 server. The server has running RAID5 with SCSI disks.
Because my systems are on Windows, I don't think the perl script
you had sent me will run as there are a couple commands in it
that are called which I don't have. Do you have any suggestions
for how I can run the script?
In the meantime, I am running svn log -q and will go though the
output to scan for missing revisions. I will let you know those
results when I have them.
Thank you.
Jason Wong
Okay. We've seen this happen on svn.apache.org, which today runs
httpd-2.3/FreeBSD/zfs, and was probably running this or a similar stack
when the problems triggered there.
> Because my systems are on Windows, I don't think the perl script
> you had sent me will run as there are a couple commands in it
> that are called which I don't have. Do you have any suggestions
> for how I can run the script?
>
Get xxd.exe from http://www.vim.org/ and cat.exe and sed.exe from
http://gnuwin32.sf.net (or from Cygwin). Delete from the script the
line that uses the 'head' command.
Or just reimplement it in your language of choice. Basically it runs
'svnlook', parses out the noderev-id as '%s.%s.%s', parses \3 as
'r%d.%d', and opens revision file \1 at offset \2 --- accounting for the
fact that that revision may have been 'svnadmin pack'ed.
If I spoke too fast, feel free to ask.
> In the meantime, I am running svn log -q and will go though the
> output to scan for missing revisions. I will let you know those
> results when I have them.
>
Thanks,
Daniel
> Thank you.
>
> Jason Wong
Typo fix: 'r%d.%d' -> 'r%d/%d'
There is a second use of 'head', which you shouldn't delete. So
instead, just get head.exe from the same place as the other two, or use
the following kind of statement:
my $line = do {
open FOO, "perl -V 2>&1 |";
<FOO>;
};
Lastly, there's a 'sed' invocation that uses single-quoted arguments.
All it does is "print the input up to the first empty line" --- feel
free to implement it differently. (One way:
my @lines = split /\n/, `command | goes | here`;
$_ and print or last for @lines;
Both of these examples could do with some error checking.)
Daniel
(yes, there's also a neater way to do this without split(). but it's
not a Perl class here)
>> Get xxd.exe from http://www.vim.org/ and cat.exe and sed.exe from
>> http://gnuwin32.sf.net (or from Cygwin). Delete from the script the
>> line that uses the 'head' command.
>
> There is a second use of 'head', which you shouldn't delete. So
> instead, just get head.exe from the same place as the other two, or use
> the following kind of statement:
Or install CygWin and run the scripts from inside CygWin. This does
present end-of-line issues, so be very careful about using "svn:eol
native" properties.
Hello.
Sorry for the delay. Here is an update of what I have done since
the last time I posted.
I have run "svn log -q ^/" on the respository and it came back with
no missing revisions.
Since I first posted, each of the projects we have tried to build
that had failed have since successfully been built without any changes
on our side.
I was having an issue with converting the script to run in windows as
I was only getting the first line returned so I set up cygwin.
I ran the script against both of the revisions (61815 and 61852) in
mentioned in the Apache error log and the output was the same for
each.
Commands:
dump-noderev.pl /repository
/project/binaries/release/phase1/iteration/81/trunk 61815
dump-noderev.pl /repository
/project/binaries/release/phase1/iteration/81/trunk 61852
Output:
id: 9-45362.0-61242.r61424/0
type: dir
pred: 9-45362.0-60310/0
count: 43
text: 58741 121716266 218 218 74eb31e90880ba1345fc49252ca6efe6
cpath: /project/binaries/release/phase1/iteration/81/trunk
copyfrom: 61423 /project/binaries/release/phase1/iteration/80/trunk
Is this information helpful? Let me know if this tells you anything. Thanks
Jason.
I stand corrected, then. I've confirmed on another instance of the bug
that 'svn log -q ^/' does not behave abnormally when the bug is present.
Sorry for the misinformation.
Question to devs: what operation will walk the predecessor links for
the root fspath? (and can therefore be used to identify instances of
the bug)
> Since I first posted, each of the projects we have tried to build
> that had failed have since successfully been built without any changes
> on our side.
>
What is the significance of this? I don't know how your build process
interacts with Subversion.
> I was having an issue with converting the script to run in windows as
> I was only getting the first line returned so I set up cygwin.
>
> I ran the script against both of the revisions (61815 and 61852) in
> mentioned in the Apache error log and the output was the same for
> each.
>
> Commands:
> dump-noderev.pl /repository
> /project/binaries/release/phase1/iteration/81/trunk 61815
> dump-noderev.pl /repository
> /project/binaries/release/phase1/iteration/81/trunk 61852
>
> Output:
> id: 9-45362.0-61242.r61424/0
> type: dir
> pred: 9-45362.0-60310/0
Are you sure that's the value of the pred: field? It contains only
one ".", instead of two.
> count: 43
> text: 58741 121716266 218 218 74eb31e90880ba1345fc49252ca6efe6
> cpath: /project/binaries/release/phase1/iteration/81/trunk
> copyfrom: 61423 /project/binaries/release/phase1/iteration/80/trunk
>
> Is this information helpful? Let me know if this tells you anything. Thanks
>
The fact that the output is identical suggests that the
/project/binaries/release/phase1/iteration/81/trunk tree hasn't changed
between those two revisions (or that there was a directory replace above
it).
However, this is the error you report:
��� [Tue Jan 31 11:37:23 2012] [error] [client 9.31.13.109] predecessor count for the root node-revision is wrong: found 61815, committing r61852� [409, #160004]
The metadata this error complains about will be output by these two
commands:
./dump-noderev.pl /repository / 61851
./dump-noderev.pl /repository / 61852
> Jason.
Cheers,
Daniel
I missed a part of it, you are right. here is the full pred line:
pred: 9-45362.0-60310.r60310/0
>> count: 43
>> text: 58741 121716266 218 218 74eb31e90880ba1345fc49252ca6efe6
>> cpath: /project/binaries/release/phase1/iteration/81/trunk
>> copyfrom: 61423 /project/binaries/release/phase1/iteration/80/trunk
>>
>> Is this information helpful? Let me know if this tells you anything. Thanks
>>
>
> The fact that the output is identical suggests that the
> /project/binaries/release/phase1/iteration/81/trunk tree hasn't changed
> between those two revisions (or that there was a directory replace above
> it).
>
> However, this is the error you report:
>
> [Tue Jan 31 11:37:23 2012] [error] [client 9.31.13.109] predecessor count for the root node-revision is wrong: found 61815, committing r61852 [409, #160004]
>
> The metadata this error complains about will be output by these two
> commands:
>
> ./dump-noderev.pl /repository / 61851
------------------------------
id: 0.0.r61851/33470
type: dir
pred: 0.0.r61850/3844
count: 61818
text: 61851 32225 1232 1232 7555349571e297c23e647cc2441d5b8f
cpath: /
copyroot: 0 /
minfo-cnt: 25685
------------------------------
> ./dump-noderev.pl /repository / 61852
------------------------------
id: 0.0.r61852/27663
type: dir
pred: 0.0.r61851/33470
count: 61819
text: 61852 26417 1233 1233 712fec619d55677e67aca8f7aa4ceb97
cpath: /
copyroot: 0 /
minfo-cnt: 25685
--------------------------------
>
>> Jason.
>
> Cheers,
>
> Daniel
>
Hi Daniel
Thanks for the quick reply. I have posted the results from the two
commands you have asked me to run above as well as the full pred
value that was incomplete.
Let me know if you need any other information.
Thanks.
Jason
Okay. Did /project/binaries/release/phase1/iteration/81/trunk@61852, or
anything under it, change between r60311 and r61423? (I'm intentionally
using the peg rev notation here --- these id's do not change when
a parent of the path in question gets copied.)
The output from these two tells me two things:
1. The minfo-cnt value is reasonable (within a typical ballpark).
That's relevant since minfo-cnt abnormalities were seen in another
instance of the bug.
2. Everything else looks correct: the 'id:'/'pred:' headers are accurate,
and the 'count:' header was incremented correctly. The 'count:' header
does, however, indicate that your repository has _in the past_ triggered
an instance of the bug.
In a bit more detail: the value of the 'count:' header should be equal to
the revision number given as the third argument to dump-noderev.pl.
(That revision number is also embedded in the 'id:' header, and is
practically guaranteed to be embedded in the 'text:' header as well.)
So, there are two things you can do to help us identify the bug:
1. Hunt for past instance of the bug, identify what revisions triggered
it, and try and identify a common pattern to those revisions. (This
basically calls for running 'dump-noderev.pl $REPOS /' in a loop and
looking for non-sequential 'count:' or 'pred:' headers in the output for
a pair of successive revisions.)
2. Look for new instances of the bug. You could periodically scan for
new instances of the bug, or implement a post-commit hook such as the
following (written for unix-like systems, sorry):
[[[
# look for a corruption or two
minfo_cnt() {
dump-noderev.pl $REPOS / "$1" | sed -ne 's/minfo-cnt: //p'
}
PREV_REV=`expr $REV - 1`
if expr `minfo_cnt $PREV_REV` - `minfo_cnt $REV` | grep ....... >/dev/null; then
# echo an error to stderr and mail the admin
exit 1
fi
skipped_root_noderevs() {
expr $1 - `dump-noderev.pl $REPOS / $1 | sed -ne 's/^count: //p'`
}
if [ "`skipped_root_noderevs $PREV_REV`" -ne "`skipped_root_noderevs $REV`" ]; then
# echo an error to stderr and mail the admin
exit 2
fi
]]]
> >
> >> Jason.
> >
> > Cheers,
> >
> > Daniel
> >
>
> Hi Daniel
>
> Thanks for the quick reply. I have posted the results from the two
> commands you have asked me to run above as well as the full pred
> value that was incomplete.
>
> Let me know if you need any other information.
> Thanks.
>
Replied above. The summary is that you have indeed ran into the bug,
but for some reason not in r61852 but sometime before that, (and why
did r61852 trigger the syslog error anyway? Good question) and now
we're at the point of trying to identify the cause of the bug --- at
least circumstantially.
> Jason
Thanks for your help so far,
Daniel
>
> The output from these two tells me two things:
>
> 1. The minfo-cnt value is reasonable (within a typical ballpark).
> That's relevant since minfo-cnt abnormalities were seen in another
> instance of the bug.
>
> 2. Everything else looks correct: the 'id:'/'pred:' headers are accurate,
> and the 'count:' header was incremented correctly. The 'count:' header
> does, however, indicate that your repository has _in the past_ triggered
> an instance of the bug.
This is true. We have seen the bug happen before. The first occurence
of this that we had seen was Dec. 7th, 2011, a few days after we went
from 1.6.16 to 1.7.1. That was the first time we had seen that happen.
At the time, we did not know about the cause and the developer who
had encountered the error didn't report it and was able to work
around it. From the Apache logs we have:
[Wed Dec 07 15:16:36 2011] [error] [client 10.2.3.1] predecessor
count for the root node-revision is wrong: found 59444,
committing r59478 [409, #160004]
[Wed Dec 07 15:33:47 2011] [error] [client 10.2.3.2] predecessor
count for the root node-revision is wrong: found 59482,
committing r59516 [409, #160004]
[Wed Dec 07 15:35:19 2011] [error] [client 10.2.3.3] predecessor
count for the root node-revision is wrong: found 59488,
committing r59522 [409, #160004]
[Wed Dec 07 15:44:10 2011] [error] [client 10.2.3.4] predecessor
count for the root node-revision is wrong: found 59505,
committing r59539 [409, #160004]
Of the ips above, the last line is from the build machine. The others
were from developer workstations. I mentioned the most recent two
times first as we were actually aware of the issue at that time and
it was recent so we knew to start looking into it. Between Dec. 7 and
Jan. 31, the bug has occurred 12 times, 3 of those times from the
build server. The rest are from workstations. This month, it has only
occurred once and it was from the build server.
Each of these times, the error has occurred in different parts of
the repository.
>
> In a bit more detail: the value of the 'count:' header should be equal to
> the revision number given as the third argument to dump-noderev.pl.
> (That revision number is also embedded in the 'id:' header, and is
> practically guaranteed to be embedded in the 'text:' header as well.)
> So, there are two things you can do to help us identify the bug:
>
> 1. Hunt for past instance of the bug, identify what revisions triggered
> it, and try and identify a common pattern to those revisions. (This
> basically calls for running 'dump-noderev.pl $REPOS /' in a loop and
> looking for non-sequential 'count:' or 'pred:' headers in the output for
> a pair of successive revisions.)
I will try and see if I can get this done this week.
>
> 2. Look for new instances of the bug. You could periodically scan for
> new instances of the bug, or implement a post-commit hook such as the
> following (written for unix-like systems, sorry):
>
> [[[
> # look for a corruption or two
> minfo_cnt() {
> dump-noderev.pl $REPOS / "$1" | sed -ne 's/minfo-cnt: //p'
> }
> PREV_REV=`expr $REV - 1`
> if expr `minfo_cnt $PREV_REV` - `minfo_cnt $REV` | grep ....... >/dev/null; then
> # echo an error to stderr and mail the admin
> exit 1
> fi
>
> skipped_root_noderevs() {
> expr $1 - `dump-noderev.pl $REPOS / $1 | sed -ne 's/^count: //p'`
> }
> if [ "`skipped_root_noderevs $PREV_REV`" -ne "`skipped_root_noderevs $REV`" ]; then
> # echo an error to stderr and mail the admin
> exit 2
> fi
> ]]]
>
I will talk to the build team here about the post-commit hook. We have had
the bug occur again since my last reply.
>
> Replied above. The summary is that you have indeed ran into the bug,
> but for some reason not in r61852 but sometime before that, (and why
> did r61852 trigger the syslog error anyway? Good question) and now
> we're at the point of trying to identify the cause of the bug --- at
> least circumstantially.
>
> Thanks for your help so far,
>
> Daniel
Hi Daniel.
Replies above. Sorry about the delay in replying. I have been really
busy of late. I will try and get the results this week, if not, it
will most likely be next week.
Thanks
Jason.
Just to be clear: These errors emitted by the 1.7.1 server prevent the
bug from corrupting new revisions. With a 1.6 server the problem would
go unnoticed and create bad revision data.
When this corruption occurs, the repository still works.
But the history links for affected revisions are incorrect.
Hi Stephan.
So I think I misunderstood why the error messages were occurring.
I had thought that there was a condition done by this check (in 1.7),
that was erroneously causing svn to reject the attempt to check-in.
I guess I am wondering that if this is the case, then why is it that
if the check-in fails, and then we manually check it in again using
tortoisesvn, that it works the second time?
So the errors prevent the bug from corrupting new revisions? Is this
something between the 1.7 versions or would this have been in 1.6
versions as well? We have been using svn for a while now and I am
wondering what this means that for 1.6, that this issue has been
occurring from communcations between 1.6 client and 1.6 server.
Also, is this bug something that svnadmin verify will not detect?
The last time we ran svnadmin verify, it said all was good.
If it is the case that this bug has been occuring for a long time,
what are the implications of the history links for affected
revisions? When you say the history links are incorrect, does it
just put in a random value or is it actually unreadable values?
Does this mean subsequent revisions that occur after these bad
revisions will propagate this bad information?
A developer asked me to pose the following question. If he was to
open a bad revision, would the client fail and give an error prompt
or would it display history information which could belong to other
files?
Thanks.
Jason.
The purpose of this error is to abort a commit in a situation where
Subversion would otherwise create a commit that has the wrong idea
about the number of predecessor revisions of the repository's root node.
The root node is changed in every commit. So for revision N, the
number of predecessor revisions should add up to N (including revision
zero). Somehow, this count sometimes ends up being wrong during a
commit, and such commits are now rejected. However, when checking
the count for sanity, Subversion accounts for the fact that older
revisions might have gotten the count wrong. I.e. the check only prevents
the count from becoming even more wrong in future revisions than it
already is.
> I guess I am wondering that if this is the case, then why is it that
> if the check-in fails, and then we manually check it in again using
> tortoisesvn, that it works the second time?
As far as I know, we don't know what the underlying problem is yet.
So I'm afraid I cannot answer all of your questions in a satisfactory
manner because I don't know the answers.
Daniel might know more -- he has been following the problem more closely
than me. And I hope he will correct me if I'm making any wrong or
misleading statements in this post :)
> So the errors prevent the bug from corrupting new revisions?
Yes. This error is a sanity check.
Right now, it's a sort of good thing to be getting the error.
Just not as good as a proper fix for the, as of yet unknown, underlying
problem that's triggering the sanity check.
> Is this something between the 1.7 versions or would this have been in 1.6
> versions as well?
We currently assume that is also happens with 1.6.
> We have been using svn for a while now and I am
> wondering what this means that for 1.6, that this issue has been
> occurring from communcations between 1.6 client and 1.6 server.
It has probably been happening since before the 1.7 release.
The problem was first discovered in the ASF repository. The first
occurrence there has been traced to a time before 1.7.0 was released.
At the time the ASF server was still running 1.6, and the first commit
that introduced the problem in the ASF repository was very likely
a 1.6 client.
> Also, is this bug something that svnadmin verify will not detect?
> The last time we ran svnadmin verify, it said all was good.
Apparently, svnadmin verify won't find it.
This should probably be fixed. But the priority right now is to
understand why it is happening in the first place.
> If it is the case that this bug has been occuring for a long time,
> what are the implications of the history links for affected
> revisions?
'svn log' skips some revisions where it shouldn't. This is the
only harmful effect of this bug, as far as I know.
> When you say the history links are incorrect, does it
> just put in a random value or is it actually unreadable values?
I don't know if it is random or somehow predetermined. That depends
on how the wrong predecessor value actually comes about.
> Does this mean subsequent revisions that occur after these bad
> revisions will propagate this bad information?
Yes, the predecessor count of subsequent revisions is off by
some constant value.
> A developer asked me to pose the following question. If he was to
> open a bad revision, would the client fail and give an error prompt
> or would it display history information which could belong to other
> files?
So far this hasn't triggered any noticeable errors apart from missing
items when viewing the log history of a path. Once the problem was
discovered by inspection of 'svn log' output, a sanity check was put
in place to prevent the problem from happening in new revisions.
This check is what you are hitting. What we need to do now is to figure
out why it's happening, and then fix that problem.
I'm not really surprised by this situation. I guess Jason means "the
same set of tree/text/prop/revprop changes" when he says "it", but by
the nature of the bug it may well have to do with some concurrency
issues on the server --- i.e., it depends on more than just the contents
being committed.
> Daniel might know more -- he has been following the problem more closely
> than me. And I hope he will correct me if I'm making any wrong or
> misleading statements in this post :)
Yeah, I'm following this thread :)
> > We have been using svn for a while now and I am
> > wondering what this means that for 1.6, that this issue has been
> > occurring from communcations between 1.6 client and 1.6 server.
>
> It has probably been happening since before the 1.7 release.
> The problem was first discovered in the ASF repository. The first
> occurrence there has been traced to a time before 1.7.0 was released.
> At the time the ASF server was still running 1.6, and the first commit
> that introduced the problem in the ASF repository was very likely
> a 1.6 client.
>
AI We should check whether it _was_ a 1.6 client..
> > Also, is this bug something that svnadmin verify will not detect?
> > The last time we ran svnadmin verify, it said all was good.
>
> Apparently, svnadmin verify won't find it.
> This should probably be fixed. But the priority right now is to
> understand why it is happening in the first place.
>
AI The 'verify' part is a relatively low-hanging fruit, though.
AI File an issue for the pred-count bug
> > If it is the case that this bug has been occuring for a long time,
> > what are the implications of the history links for affected
> > revisions?
>
> 'svn log' skips some revisions where it shouldn't. This is the
> only harmful effect of this bug, as far as I know.
>
As I'd mentioned upthread the ASF repository triggering this bug also
has bogus minfo-cnt headers. It's not known yet whether that is the
same bug, a separate bug, or a bug in something other than svn.
(that's also why I'm not filing an issue for the minfo-cnt bug yet --
it's not clear that the bug is in svn.)
> > When you say the history links are incorrect, does it
> > just put in a random value or is it actually unreadable values?
>
> I don't know if it is random or somehow predetermined. That depends
> on how the wrong predecessor value actually comes about.
>
The links are not garbage values -- they point to an existing location
(node-rev) in history. They don't point to the right location.
> > Does this mean subsequent revisions that occur after these bad
> > revisions will propagate this bad information?
>
> Yes, the predecessor count of subsequent revisions is off by
> some constant value.
>
Indeed; but that's just error propagation, rather than a bug.
Practically it implies fixing the effects of the bug will involve
changing all the historical metadata. (This probably means dump/load.
There is a way to implement this by doing in-place surgery (see l1488)
but it is of the "Don't try this at home" variety.)
The propagation of wrong counts, and the skipping of revisions during
backwards history walk, are both expected behaviours from the DAG layer
given the DAG in question (which includes a replacement of the root
node). In English: the lower level APIs are performing correctly given
the violating-the-invariants-of-higher-layers state of the filesystem.
> > A developer asked me to pose the following question. If he was to
> > open a bad revision, would the client fail and give an error prompt
> > or would it display history information which could belong to other
> > files?
>
> So far this hasn't triggered any noticeable errors apart from missing
> items when viewing the log history of a path. Once the problem was
Indeed. And that's about all the direct effects it can have. (Perhaps
some code is doing that history walk internally as part of another
higher-level operation, though.)
I guess minfo-cnt bug mentioned earlier should normally be very obvious,
as the values are off by an order of magnitude. If they are too high
then the tree walk to find svn:mergeinfo-ful nodes will not stop at the
root of trees that lack those nodes (so, inefficiency but not
incorrectness); if it's too low then it will ignore the svn:mergeinfo on
some nodes, so I expect merges will visibly misperform (but Stefan will
know better).
> discovered by inspection of 'svn log' output, a sanity check was put
> in place to prevent the problem from happening in new revisions.
> This check is what you are hitting. What we need to do now is to figure
> out why it's happening, and then fix that problem.
Daniel
(will read Jason's mail later -- time for breakfast.)
Done: r1294479 and followups. See also r1294470 which improves the
error message.
The latter will hopefully be in 1.7.4. svn_fs_verify() is not present
in 1.7.x so the r1294479 set of changes cannot easily be backported.
> AI File an issue for the pred-count bug
Issue #4129:
http://subversion.tigris.org/issues/show_bug.cgi?id=4129
Well, install fail2ban and have it mail you when that string appears
in the logs? I'll do so too...
> around it. From the Apache logs we have:
>
> [Wed Dec 07 15:16:36 2011] [error] [client 10.2.3.1] predecessor
> count for the root node-revision is wrong: found 59444,
> committing r59478 [409, #160004]
> [Wed Dec 07 15:33:47 2011] [error] [client 10.2.3.2] predecessor
> count for the root node-revision is wrong: found 59482,
> committing r59516 [409, #160004]
> [Wed Dec 07 15:35:19 2011] [error] [client 10.2.3.3] predecessor
> count for the root node-revision is wrong: found 59488,
> committing r59522 [409, #160004]
As Stefan mentioned, these represent commit attempts that were rejected
in order to prevent a new instance of the bug from entering the history.
> [Wed Dec 07 15:44:10 2011] [error] [client 10.2.3.4] predecessor
> count for the root node-revision is wrong: found 59505,
> committing r59539 [409, #160004]
>
> Of the ips above, the last line is from the build machine. The others
> were from developer workstations. I mentioned the most recent two
> times first as we were actually aware of the issue at that time and
> it was recent so we knew to start looking into it. Between Dec. 7 and
> Jan. 31, the bug has occurred 12 times, 3 of those times from the
> build server. The rest are from workstations. This month, it has only
> occurred once and it was from the build server.
>
What percentage of your commits are from the build server?
Is there anything noteworthy about commits that were in progress around
the time the bug occurred? (their svn:date's would be near the time
stamp in the httpd log)
> Each of these times, the error has occurred in different parts of
> the repository.
>
> Replies above. Sorry about the delay in replying. I have been really
> busy of late. I will try and get the results this week, if not, it
> will most likely be next week.
>
No problem.
Some things you could do are:
- What RA method do you use? svn:// or http://?
- Are the failing revisions always small (eg: just a URL-URL copy),
or always large (eg: results of a merge)?
- Do you have any caching enabled at the OS filesystem layer or
below it?
- Did you configure Subversion to use memcached?
- Did you configure a maximum cache size for Subversion?
- Could you try setting the maximum cache size to zero? (svnserve:
--memory-cache-size=0; mod_dav_svn: SVNInMemoryCacheSize 0)
This doesn't entirely disable caching, actually -- it just disables
the new 'membuffer' cache backend, reverting to another older
backend called 'inprocess'.
@Justin -- thanks for the information. I've asked Jason upthread some
questions (eg, whether there patterns of commits concurrent to the
failing ones), which likely apply to you too.
Daniel
Justin, Jason,
Some things you could do are:
- What RA method do you use? svn:// or http://?
- Are the failing revisions always small (eg: just a URL-URL copy),
or always large (eg: results of a merge)?
- Do you have any caching enabled at the OS filesystem layer or
below it?
- Did you configure Subversion to use memcached?
- Did you configure a maximum cache size for Subversion?
- Could you try setting the maximum cache size to zero? (svnserve:
--memory-cache-size=0; mod_dav_svn: SVNInMemoryCacheSize 0)
That's not clear enough. Could you show 'log -qv' of those revisions?
A wc-to-URL copy could touch just one or two files (compare
`svn log -qv --stop-on-copy http://svn.apache.org/repos/asf/subversion/tags/1.7.3`)
or a full tree (http://subversion.apache.org/faq.html#in-place-import).
Which is it?
>
> > - Could you try setting the maximum cache size to zero? (svnserve:
> > --memory-cache-size=0; mod_dav_svn: SVNInMemoryCacheSize 0)
> >
> >
> Apache is our server, so this is not applicable.
SVNInMemoryCacheSize is applicable.
Justin Johnson wrote on Wed, Feb 29, 2012 at 10:25:38 -0600:
> On Wed, Feb 29, 2012 at 10:15 AM, Daniel Shahaf <dani...@elego.de> wrote:
> > - Are the failing revisions always small (eg: just a URL-URL copy),That's not clear enough. Could you show 'log -qv' of those revisions?
> > or always large (eg: results of a merge)?
> >
> >
> As mentioned before, so far it appears to be 1) create a tag by copying an
> entire working copy of a branch to a URL, and 2) commit merge results for
> an entire branch.
>
A wc-to-URL copy could touch just one or two files (compare
`svn log -qv --stop-on-copy http://svn.apache.org/repos/asf/subversion/tags/1.7.3`)
or a full tree (http://subversion.apache.org/faq.html#in-place-import).
Which is it?
>SVNInMemoryCacheSize is applicable.
> > - Could you try setting the maximum cache size to zero? (svnserve:
> > --memory-cache-size=0; mod_dav_svn: SVNInMemoryCacheSize 0)
> >
> >
> Apache is our server, so this is not applicable.
Yes, thanks.
>
> > >
> > > > - Could you try setting the maximum cache size to zero? (svnserve:
> > > > --memory-cache-size=0; mod_dav_svn: SVNInMemoryCacheSize 0)
> > > >
> > > >
> > > Apache is our server, so this is not applicable.
> >
> > SVNInMemoryCacheSize is applicable.
> >
>
> Sorry, I missed that one. We have not specified SVNInMemoryCacheSize, so
> we're using the default.
... so please try SVNInMemoryCacheSize 0, and see if that makes the
issue less frequent.
The only known (and predicted) effect of the error is that some
revisions are wrongly skipped during a backward history walk --- such as
'log -r HEAD:0' does.
Creating an svnsync mirror is enough to fix the issue. (There is no
need to check the mirror, even: if the mirror is written to by a 1.7
server (or a 1.7 svnsync over file://), the normal validation that
Jason's error logs show will be done automatically.)
Hi Daniel
Not sure if I should be replying to this post or the latest post, but
here is where we are at currently.
I have had a developer here create a build of the latest SVN code
with your changes you mentioned in r1294470 for the svnadmin verify
command. We have run 'svnadmin verify' against every revision of our
hotcopy of our repository taken when we first brought this issue to
the forums and are now tracking down each of the revisions to see
what actions were being done at those times.
From the results, we see 25 error messages for predecessor count is
wrong and the first one appeared on January 26, 2011. Near that time
the following events occurred:
Jan. 14, 2011 - svn upgraded from 1.6.6 to 1.6.15
Jan. 14, 2011 - Apache HTTP server upgraded from 2.2.15 to 2.2.17
Jan. 21, 2011 - repository was pruned to delete some binary files.
Between January and our upgrade in Dec. to 1.7.1, we have had about
14,000 revisions and seen only 25 instances of this node revision
issue. During the times we had these errors, we were using svn
versions 1.6.15 and 1.6.16.
Fail2ban from what I could find does not look like it has a Windows
port which I currently have my production environment hosted on.
Thanks.
Jason
OK. So that means that either membuffer isn't the cause of the error,
or both membuffer and inprocess have (independent) bugs.
That said, I'm happy to see that it took you less than a day to report
that the bug no longer reproduces. Can you reproduce it at will?
---
I'm not completely sure what to try next. Were there any other
significant changes in 1.6.x->1.7.x besides caching? If so, the
next steps would be to disable _both_ inprocess and membuffer, or
to bisect.
Unfortunately, the latter requires building svn from source, and the
former requires either that, or having you upgrade to specially-built
binaries, or squeezing the "disable inprocess" feature into 1.7.4 (in
a way that doesn't break our compat policy too badly) and having you
upgrade to that.
devs/all --- thoughts? Is there a simpler way forward than the two
potential 'next step's I've outlined here?
Justin, what operating system does your server run?
Okay, that's great news, for two reasons:
1. It means building svn on windows isn't as painful as it used to be :)
2. It means I can ask you to build a custom server with the 'inprocess'
cache disabled, or (if all else fails) to bisect, per my previous email.
One of the things you could try is to disable caching: simply modify
the function create_cache() in libsvn_fs_fs/caching.c to always return
NULL in *CACHE_P. See below for another suggestion.
> command. We have run 'svnadmin verify' against every revision of our
> hotcopy of our repository taken when we first brought this issue to
> the forums and are now tracking down each of the revisions to see
> what actions were being done at those times.
>
Thanks! I do hope this work enables us to pinpoint and fix the bug.
> From the results, we see 25 error messages for predecessor count is
> wrong and the first one appeared on January 26, 2011. Near that time
> the following events occurred:
> Jan. 14, 2011 - svn upgraded from 1.6.6 to 1.6.15
> Jan. 14, 2011 - Apache HTTP server upgraded from 2.2.15 to 2.2.17
> Jan. 21, 2011 - repository was pruned to delete some binary files.
>
> Between January and our upgrade in Dec. to 1.7.1, we have had about
> 14,000 revisions and seen only 25 instances of this node revision
> issue. During the times we had these errors, we were using svn
> versions 1.6.15 and 1.6.16.
>
Thanks, very valuable information.
I've reviewed the 1.6.6->1.6.15 diff, and I have the following
suggestions:
- Change subversion/libsvn_fs_fs/fs.h such that
SVN_FS_FS__USE_LOCK_MUTEX is set to 1. It was set to 1 in 1.6.6
but to 0 in 1.6.15.
(This wouldn't explain why ASF saw it, but it might explain why you're
seeing it.)
> Fail2ban from what I could find does not look like it has a Windows
> port which I currently have my production environment hosted on.
>
Yeah, sorry. But you can write a cron job -- I mean, a Scheduled Task
-- that greps your error logs for "160004" every night and mails you it
it found anything, right?
That's the error code to watch for for many FS error conditions:
% ./tools/dev/which-error.py E160004
00160004 SVN_ERR_FS_CORRUPT
> Thanks.
>
> Jason
For convenience I'm attaching a patch that implements both of my
suggestions. Let us know please if it has any effect.
Cheers,
Daniel
... and what versions of httpd (Apache) and APR.
Justin Johnson wrote on Thu, Mar 01, 2012 at 07:45:08 -0600:
> On Wed, Feb 29, 2012 at 4:14 PM, Justin Johnson <justina...@gmail.com>wrote:
> > On Wed, Feb 29, 2012 at 11:22 AM, Daniel Shahaf <dani...@elego.de> wrote:
> >> ... so please try SVNInMemoryCacheSize 0, and see if that makes theOK. So that means that either membuffer isn't the cause of the error,
> >> issue less frequent.
> >>
> >
> > I'm a dork. I will do so once I take care of the appropriate change
> > control I have to deal with. Thanks.
> >
>
> We made the change and problem is still occurring.
or both membuffer and inprocess have (independent) bugs.
That said, I'm happy to see that it took you less than a day to report
that the bug no longer reproduces. Can you reproduce it at will?
Actually, it did take some work to get it going as we did not have
another system available to us and also did not have VC++ 6. We had
to use VS 2010 in order to do this. Also, for the other components
required (python,perl etc), the files after the install were copied
to the workstation to see if it would work as we did not want to
change the current workstation configuration by running the
installers. All in all, it did seem to work.
>
> 2. It means I can ask you to build a custom server with the 'inprocess'
> cache disabled, or (if all else fails) to bisect, per my previous email.
>
> One of the things you could try is to disable caching: simply modify
> the function create_cache() in libsvn_fs_fs/caching.c to always return
> NULL in *CACHE_P. See below for another suggestion.
>
>> command. We have run 'svnadmin verify' against every revision of our
>> hotcopy of our repository taken when we first brought this issue to
>> the forums and are now tracking down each of the revisions to see
>> what actions were being done at those times.
>>
>
> Thanks! I do hope this work enables us to pinpoint and fix the bug.
I will be going through the list to see what else was happening at the
same time on the apache server since it was alluded to that there may
be concurrency issues. I know the last two times that this error has
popped up, we had two svn operations starting at around the same time
according to the Apache logs. I will go through the previous apache
history to see if this was always the case or not.
I will look into it. We did ask developers to note any error messages
that they see from tortoisesvn now as the last time we saw the error
message pop up, we asked the developer what happened and he said that
an error message popped up and he just tried to check in again and it
worked. We will note the exact message next time.
>> Thanks.
>>
>> Jason
>
> For convenience I'm attaching a patch that implements both of my
> suggestions. Let us know please if it has any effect.
>
I will forward this to the developer to look at.
> Cheers,
>
> Daniel
>
Hi.
See replies above. I will post what we find.
Thanks.
Jason
Okay. The normal build requires just the *.exe and *.dll files to be
placed appropriately (such that the *.exe's and httpd's find their
libsvn_* DLL's at runtime) --- it doesn't require Administrator access,
for example.
To clarify, Perl is only required to build OpenSSL; it is not required
to build APR, Neon, or Subversion.
> >
> > 2. It means I can ask you to build a custom server with the 'inprocess'
> > cache disabled, or (if all else fails) to bisect, per my previous email.
> >
> > One of the things you could try is to disable caching: simply modify
> > the function create_cache() in libsvn_fs_fs/caching.c to always return
> > NULL in *CACHE_P. See below for another suggestion.
> >
> >> command. We have run 'svnadmin verify' against every revision of our
> >> hotcopy of our repository taken when we first brought this issue to
> >> the forums and are now tracking down each of the revisions to see
> >> what actions were being done at those times.
> >>
> >
> > Thanks! I do hope this work enables us to pinpoint and fix the bug.
>
> I will be going through the list to see what else was happening at the
> same time on the apache server since it was alluded to that there may
> be concurrency issues. I know the last two times that this error has
> popped up, we had two svn operations starting at around the same time
> according to the Apache logs. I will go through the previous apache
> history to see if this was always the case or not.
>
Thanks, looking forward to hear what you come up with.
FWIW, Justin's reply suggests that the error was seen on three different
platforms --- Windows, Solaris, and FreeBSD --- so that should narrow
down the range of possible explanations.
(I'll also note that at ASF's installation we are not running into new
instances of the bug.)
Hi Daniel.
I haven't gone through all the cases yet, but I have made progress
through quite a number of them and a pattern seems to be coming up.
I have attached 2 txt files. One shows the modified svnadmin verify
output from the binaries we built. The other shows the revisions and
what appears to have been occuring at the time of the bug. I figure
better to provide this now rather than delay any longer for the rest
of the results.
I will continue to go through the rest of the events and see if
there are other differences seen when the issue occurs. I hope
this information helps.
Thanks.
Jason
Hello.I have recently run into an issue with my subversion system (1.7.1)
where a specific component I am trying to build has failed. I have
had sucessful builds of this project before this issue happened since
we upgraded to 1.7.1. I am currently getting by this by using
TortoiseSVN 1.7.1 to do the check-in of the files left in the
workspace.This issue seems to be intermittent as it does not happen every time.
I am using subversion 1.7.1 windows binaries and subversion is
hosting on Apache 2.2.21.I have run "svnadmin verify" against the directory and it has come
back clean. I was wondering what type of events can cause this to
happen? Are there any resolutions?Here are some entries from the apache log for the last time this
issue happened.The Apache error log shows the following lines:
[Tue Jan 31 11:37:23 2012] [error] [client 9.31.13.109] Could not
MERGE resource "/repository/!svn/txn/61847-1bz5" into
"/repository/project/binaries/release/phase1/iteration/81/trunk
". [409, #0]
[Tue Jan 31 11:37:23 2012] [error] [client 9.31.13.109] An error
occurred while committing the transaction. [409, #160004]
[Tue Jan 31 11:37:23 2012] [error] [client 9.31.13.109] predecessor
count for the root node-revision is wrong: found 61815,
committing r61852 [409, #160004]
[Tue Jan 31 11:37:24 2012] [error] [client 9.31.13.109] Could not
fetch resource information. [404, #0]
[Tue Jan 31 11:37:24 2012] [error] [client 9.31.13.109] Named
transaction doesn't exist. [404, #0]The following are from the Apache access log at the same time:
9.31.13.109 - username [31/Jan/2012:11:37:22 -0800] "MERGE
/repository/project/binaries/release/phase1/iteration/81/trunk
HTTP/1.1" 409 281
9.31.13.109 - username [31/Jan/2012:11:37:24 -0800] "DELETE
/repository/!svn/txn/61847-1bz5 HTTP/1.1" 404 232Any help/comments would be appreciated. Thank you.
Jason Wong
Hi. Sorry for the delay --- was away from svn the last few days.
> I haven't gone through all the cases yet, but I have made progress
> through quite a number of them and a pattern seems to be coming up.
>
Is it safe to summarize your findings as: in every instance of the bug
(as determined by the new 'svnadmin verify' output), the victim revision
was started whilst (victim-1) was in progress?
That by itself is an everyday occurence, but I think it's nonetheless
a useful piece of information. I'll try and digest it further later
when I'm less sleepy (it's way past midnight here).
(As I understand ra_dav, the MERGE verb corresponds to the FS level
svn_fs_commit_txn(). Someone please correct me if I'm wrong.)
Thanks,
Daniel
Sure, we will try. What are your queries?
Right now we don't know what exactly causes the bug; we are still
investigating. When we make progress or find a fix, we will announce
that. The best place to watch for such announcements is by registering
to the issue tracker and adding yourself to the CC list on issue #4129
(link below):
http://subversion.tigris.org/issues/show_bug.cgi?id=4129
Please keep discussion of the bug on the mailing list, not on the
issue tracker.
Cheers,
Daniel
No problem. I have been really busy as well past couple weeks on other
tasks.
>> I haven't gone through all the cases yet, but I have made progress
>> through quite a number of them and a pattern seems to be coming up.
>>
>
> Is it safe to summarize your findings as: in every instance of the bug
> (as determined by the new 'svnadmin verify' output), the victim revision
> was started whilst (victim-1) was in progress?
>
From what is there so far, yes. We do have different operations
occurring at the same time, but for these ones, I see MERGE and DELETE
verbs overlapping in the same or near time intervals according to the
Apache logs. I just did a quick look in the Apache logs during a time
window where the bug wasn't triggered, and was able to see cases where
I have the following:
rev(x-1) merge
rev(x) merge
rev(x) delete
rev(x-1) delete
This seems fine.
The case that I had seen in my reported cases are as follows:
rev(x-1) merge
rev(x) merge
rev(x-1) delete
rev(x) delete
I would have to look more closely if needed to show that this case
always triggers the bug or not. We do not have a large userbase here
so it is harder to get a lot of overlapping hits for this type
of case.
I have tried forcing this sequence in the debugger, but that did not
reproduce the bug.
I was using trunk@1301511 mod_dav_svn, trunk client, and httpd-2.4.
I set a breakpoint in ra_neon's commit.c:commit_close_edit() and
ra_serf's commit.c:close_edit() before the DELETE of the activity. and
sent the requests in the following order: WC1-MERGE, WC2-MERGE,
WC1-DELETE, WC2-DELETE. I tried both serf and neon. I tried
mod_dav_svn of trunk@1301511 and 1.6.x@1301511. (Only httpv1 makes
sense, since in httpv2 DELETEs do not occur.)
I conclude that the specific MERGE/MERGE/DELETE/DELETE order, by itself,
is not sufficient to reproduce the bug.
I'd like to try and force the race at the other end --- force both httpd
threads to get as far as svn_fs_fs__commit_txn() or even svn_fs_fs__commit()
at the same time. I implemented that by adding a sleep(5+0) call in
fs_fs.c [1], and starting the two concurrent commits within a second of
each other. When I did that, the second commit took 10 seconds to
complete (I assume the loop in svn_fs_fs__commit() iterated twice), but
the bug still did not reproduce. In that case I used ra_serf and 1.6.x
server.
@@ -6011,1 +6013,2 @@ svn_fs_fs__commit(svn_revnum_t *new_rev_p,
+ if (getenv("SVN_")) sleep(5);
SVN_ERR(svn_fs_fs__with_write_lock(fs, commit_body, &cb, pool));
> I would have to look more closely if needed to show that this case
> always triggers the bug or not. We do not have a large userbase here
> so it is harder to get a lot of overlapping hits for this type
> of case.
>
Can you relate the time of the MERGE and DELETE operations (from the
access log) to the time of the commit?
The best way to determine the time of the commit is to look at the
modification time of the revision file (REPOS/db/revs/0/42) belonging
to it; this happens just before the commit is finalized.
In particular it would be interesting to see when rX's commit was done
relative to r(X-1)'s DELETE.
>
> > That by itself is an everyday occurence, but I think it's nonetheless
> > a useful piece of information. I'll try and digest it further later
> > when I'm less sleepy (it's way past midnight here).
> >
> > (As I understand ra_dav, the MERGE verb corresponds to the FS level
> > svn_fs_commit_txn(). Someone please correct me if I'm wrong.)
> >
> > Thanks,
> >
> > Daniel
> >
> >> I have attached 2 txt files. One shows the modified svnadmin verify
> >> output from the binaries we built. The other shows the revisions and
> >> what appears to have been occuring at the time of the bug. I figure
> >> better to provide this now rather than delay any longer for the rest
> >> of the results.
> >>
> >> I will continue to go through the rest of the events and see if
> >> there are other differences seen when the issue occurs. I hope
> >> this information helps.
> >>
> >> Thanks.
> >>
> >> Jason
> >
[1] Yes, this is rather crude. I couldn't do this with 'gdb -args httpd
-X' because I needed two concurrent httpd threads. In retrospect
I think there was an easier solution (use a threaded non-processed MPM,
or perhaps gdb the parent before it forked), but I only realized that
solution when I wrote this paragraph.
>> From what is there so far, yes. We do have different operations
>> occurring at the same time, but for these ones, I see MERGE and DELETE
>> verbs overlapping in the same or near time intervals according to the
>> Apache logs. I just did a quick look in the Apache logs during a time
>> window where the bug wasn't triggered, and was able to see cases where
>> I have the following:
>> rev(x-1) merge
>> rev(x) merge
>> rev(x) delete
>> rev(x-1) delete
>>
>> This seems fine.
>> The case that I had seen in my reported cases are as follows:
>> rev(x-1) merge
>> rev(x) merge
>> rev(x-1) delete
>> rev(x) delete
>>
>
> I have tried forcing this sequence in the debugger, but that did not
> reproduce the bug.
[...]
> I conclude that the specific MERGE/MERGE/DELETE/DELETE order, by itself,
> is not sufficient to reproduce the bug.
DELETE is extremely unlikely to be involved in this bug. mod_dav_svn
maintains a mapping of DAV activity ID to Subversion transaction name in
the dav/ subdirectory of a repository. The dav/ subdirectory is not
part of the Subversion filesystem. DELETE is used to remove the mapping
for a particular activity ID and to remove the transaction if it still
exists. Following a successful MERGE the transaction will have been
converted into a revision so DELETE has no effect on the the Subversion
filesystem. DELETE will only affect the Subversion filesystem if called
before a successful MERGE, and that will cause any subsequent MERGE for
the activity to fail.
--
uberSVN: Apache Subversion Made Easy
http://www.uberSVN.com
Okay, count me happy :-) I can reproduce this:
[[[
$SVN ps svn:mergeinfo '/branch/A:2' wc1/trunk/A
$SVN ps svn:mergeinfo '/branch/iota:2' wc2/trunk/iota
$SVN mkdir wc1/foo
$SVN mkdir wc2/bar
$svn ci -mm wc1 & $svn ci -mm wc2 & wait
]]]
Output:
[[[
Adding wc1/foo
Sending wc1/trunk/A
Adding wc2/bar
Sending wc2/trunk/iota
Committed revision 2.
subversion/svn/commit-cmd.c:183: (apr_err=160004)
subversion/libsvn_client/commit.c:876: (apr_err=160004)
subversion/libsvn_client/commit.c:876: (apr_err=160004)
svn: E160004: Commit failed (details follow):
subversion/libsvn_ra_serf/commit.c:2216: (apr_err=160004)
subversion/libsvn_ra_serf/commit.c:2216: (apr_err=160004)
subversion/libsvn_ra_serf/commit.c:2216: (apr_err=160004)
subversion/libsvn_ra_serf/util.c:774: (apr_err=160004)
subversion/libsvn_ra_serf/util.c:2087: (apr_err=160004)
subversion/libsvn_ra_serf/util.c:2087: (apr_err=160004)
subversion/libsvn_ra_serf/util.c:920: (apr_err=160004)
svn: E160004: predecessor count for the root node-revision is wrong: found (2+1 != 2), committing r3
]]]
Error log:
[[[
[Mon Mar 19 14:19:41.388413 2012] [dav:error] [pid 17156:tid 3064073072] [client ::1:40969] Could not MERGE resource "/t/r1/!svn/txn/1-2" into "/t/r1". [409, #0]
[Mon Mar 19 14:19:41.388497 2012] [dav:error] [pid 17156:tid 3064073072] [client ::1:40969] An error occurred while committing the transaction. [409, #160004]
[Mon Mar 19 14:19:41.388518 2012] [dav:error] [pid 17156:tid 3064073072] [client ::1:40969] predecessor count for the root node-revision is wrong: found (2+1 != 2), committing r3 [409, #160004]
[Mon Mar 19 14:19:42.024052 2012] [authz_core:debug] [pid 17156:tid 3055680368] mod_authz_core.c(783): [client ::1:40971] AH01626: authorization result of Require all granted: granted
[Mon Mar 19 14:19:42.024114 2012] [authz_core:debug] [pid 17156:tid 3055680368] mod_authz_core.c(783): [client ::1:40971] AH01626: authorization result of <RequireAny>: granted
[Mon Mar 19 14:19:42.024781 2012] [dav:error] [pid 17156:tid 3055680368] [client ::1:40971] Could not fetch resource information. [404, #0]
[Mon Mar 19 14:19:42.024820 2012] [dav:error] [pid 17156:tid 3055680368] [client ::1:40971] Named transaction doesn't exist. [404, #0]
]]]
Using another property, or omitting either propset, is enough to cause
the bug not to trigger.
The bug reproduced both with and without the following patch:
[[[
--- subversion/libsvn_fs_fs/tree.c (revision 1301511)
+++ subversion/libsvn_fs_fs/tree.c (working copy)
@@ -1 +1,2 @@
+#include <unistd.h>
/* tree.c : tree-like filesystem, built on DAG filesystem
@@ -1701 +1702,2 @@ svn_fs_fs__commit_txn(const char **conflict_p,
+ sleep(3);
err = svn_fs_fs__commit(new_rev, fs, txn, iterpool);
@@ -1729,3 +1731,4 @@ svn_fs_fs__commit_txn(const char **conflict_p,
cleanup:
+ sleep(3);
svn_fs_fs__reset_txn_caches(fs);
]]]
The bug reproduced with either "ServerLimit 1" or "ThreadLimit 1" in
httpd.conf. (That forced both commits to be served by the same process
(resp., by different processes).) I use httpd 2.4.1 with event MPM.
> The bug reproduced with either "ServerLimit 1" or "ThreadLimit 1" in
> httpd.conf. (That forced both commits to be served by the same process
> (resp., by different processes).) I use httpd 2.4.1 with event MPM.
I can reproduce ove ra_local:
svnadmin create repo
svn mkdir -mm file://`pwd`/repo/A
svn mkdir -mm file://`pwd`/repo/B
svn co file://`pwd`/repo wc1
svn co file://`pwd`/repo wc2
svn ps svn:mergeinfo /P:2 wc1/A
svn ps svn:mergeinfo /Q:2 wc2/B
svn mkdir wc1/X
svn mkdir wc2/Y
svn ci -mm wc1 & svn ci -mm wc2 & wait
Gives:
Sending wc1/A
Adding wc1/X
Sending wc2/B
Adding wc2/Y
Committed revision 3.
svn: E160004: Commit failed (details follow):
svn: E160004: predecessor count for the root node-revision is wrong: found 3, committing r4
Normally concurrent commits are supported. The bug under discussion has
to do with a flaw in the mechanism that causes concurrent commits to
succeed.
> I noticed a similar problem also, which I think might be related to the
> 'Predescessor' bug. Simultaneous SVN Updates from the same repository by
> different users fails silently (throws no error, but update doesn't
> happen). I think, probably an implicit lock is being put on the repository,
> preventing simultaneous operations.Please do let us know if some solution
> exists.
>
In FSFS there are no read locks, and different readers don't interfere
with each other or with others. You'll have to describe what you see in
more detail. (Exactly what you did, what you saw, and how it differed
from what you expected to see)
Thanks
Daniel
> Thanks
> Subu
> I can reproduce ove ra_local:
>
> svnadmin create repo
> svn mkdir -mm file://`pwd`/repo/A
> svn mkdir -mm file://`pwd`/repo/B
> svn co file://`pwd`/repo wc1
> svn co file://`pwd`/repo wc2
> svn ps svn:mergeinfo /P:2 wc1/A
> svn ps svn:mergeinfo /Q:2 wc2/B
> svn mkdir wc1/X
> svn mkdir wc2/Y
> svn ci -mm wc1 & svn ci -mm wc2 & wait
>
> Gives:
>
> Sending wc1/A
> Adding wc1/X
> Sending wc2/B
> Adding wc2/Y
>
> Committed revision 3.
> svn: E160004: Commit failed (details follow):
> svn: E160004: predecessor count for the root node-revision is wrong: found 3, committing r4
This is the problem code in libsvn_fs_fs/tree.c:merge
SVN_ERR(svn_fs_fs__dag_get_predecessor_count(&pred_count, source, pool));
SVN_ERR(update_ancestry(fs, source_id, target_id, target_path,
pred_count, pool));
if (svn_fs_fs__fs_supports_mergeinfo(fs))
SVN_ERR(svn_fs_fs__dag_increment_mergeinfo_count(target,
mergeinfo_increment,
pool));
target is dag_node_t* which is opaque outside dag.c and
target->node_revision->predecessor_count is 3 when we reach the above
code.
update_ancestry rewrites the file repo/db/transactions/2-2.txn/node.0.0
with the correctly updated value "count: 4" but nothing updates
target->node_revision->predecessor_count in memory.
svn_fs_fs__dag_increment_mergeinfo_count then rewrites
repo/db/transactions/2-2.txn/node.0.0
pulling the old value of target->node_revision->predecessor_count from
memory and putting "count: 3" back in the file.
If I use the debugger to manually set target->node_revision to NULL
inside svn_fs_fs__dag_increment_mergeinfo_count then the commit works.
I'm not exactly sure how all the FSFS caching layers are supposed to
interact. Is tree.c:update_ancestry supposed to update the in-memory
predecessor_count? Should there be a svn_fs_fs__dag_xxx function to
change the predecessor count? Should target->node_revision be set to
NULL soemwehere? Something else?
Just a note that another option here is for merge() to re-fetch
'dag_node_t *target' via dag.h APIs, to get another struct with a fresh
(actually: not yet populated) cache ->noderev member.
Is this problem specific to the FSFS backend?
--
C. Michael Pilato <cmpi...@collab.net>
CollabNet <> www.collab.net <> Distributed Development On Demand
No.
% ../runpytest svnadmin mergeinfo_race --fs-type bdb
2012-03-19 20:21:44 [WARNING] CWD: /home/daniel/src/svn/t1/subversion/tests/cmdline
2012-03-19 20:21:44 [WARNING] EXCEPTION: Failure: one or both commits failed
XFAIL: svnadmin_tests.py 29: concurrent mergeinfo commits invalidate pred-count
> On 03/19/2012 01:25 PM, Philip Martin wrote:
>> Philip Martin <philip...@wandisco.com> writes:
>>
>>> I can reproduce ove ra_local:
>>>
>>> svnadmin create repo
>>> svn mkdir -mm file://`pwd`/repo/A
>>> svn mkdir -mm file://`pwd`/repo/B
>>> svn co file://`pwd`/repo wc1
>>> svn co file://`pwd`/repo wc2
>>> svn ps svn:mergeinfo /P:2 wc1/A
>>> svn ps svn:mergeinfo /Q:2 wc2/B
>>> svn mkdir wc1/X
>>> svn mkdir wc2/Y
>>> svn ci -mm wc1 & svn ci -mm wc2 & wait
>>>
>>> Gives:
>>>
>>> Sending wc1/A
>>> Adding wc1/X
>>> Sending wc2/B
>>> Adding wc2/Y
>>>
>>> Committed revision 3.
>>> svn: E160004: Commit failed (details follow):
>>> svn: E160004: predecessor count for the root node-revision is wrong: found 3, committing r4
>
> Is this problem specific to the FSFS backend?
Yes, I think it is.
For BDB the dag_node_t type in dag.c doesn't have a node_revision
member. When update_ancestry does svn_fs_bdb__put_node_revision it
writes to the database and subsequent svn_fs_bdb__get_node_revision will
see the updated value.
For FSFS the svn_fs_fs__put_node_revision call writes to the
transactions subdir but the in-memory node_revision doesn't get changed.
I think that's a failure of the regression test, after the XFAIL the
repository looks like:
$ svn log -vqr3:2 file://`pwd`/svn-test-work/repositories/svnadmin_tests-29/
------------------------------------------------------------------------
r3 | jrandom | 2012-03-19 18:28:53 +0000 (Mon, 19 Mar 2012)
Changed paths:
M /A
A /d1
------------------------------------------------------------------------
r2 | jrandom | 2012-03-19 18:28:53 +0000 (Mon, 19 Mar 2012)
Changed paths:
A /d2
M /iota
------------------------------------------------------------------------
Also the test still XFAILs for FSFS with my FSFS patch to fix the
problem.
Gotcha. Thanks for the response.
I remember when the BDB code had an in-memory node-revision structure and
tried to maintain that using the trail construct. I remember also the day I
purged the code of that mess. :-)
Right as usual, implemented your fix in r1302591.
> If I use the debugger to manually set target->node_revision to NULL
> inside svn_fs_fs__dag_increment_mergeinfo_count then the commit works.
> I'm not exactly sure how all the FSFS caching layers are supposed to
> interact. Is tree.c:update_ancestry supposed to update the in-memory
> predecessor_count? Should there be a svn_fs_fs__dag_xxx function to
> change the predecessor count? Should target->node_revision be set to
> NULL soemwehere? Something else?
Moving update_ancestry from tree.c to dag.c is one way to fix the
problem. Daniel also suggested removing the node_revision member of
dag_node_t altogether and relying on new 1.7 caching to give us the
performance. I suppose we would still need a patch like this for 1.6.
Index: ../src/subversion/libsvn_fs_fs/dag.c
===================================================================
--- ../src/subversion/libsvn_fs_fs/dag.c (revision 1302591)
+++ ../src/subversion/libsvn_fs_fs/dag.c (working copy)
@@ -1296,3 +1296,27 @@
return SVN_NO_ERROR;
}
+
+svn_error_t *
+svn_fs_fs__dag_update_ancestry(dag_node_t *target,
+ dag_node_t *source,
+ apr_pool_t *pool)
+{
+ node_revision_t *source_noderev, *target_noderev;
+
+ if (! svn_fs_fs__dag_check_mutable(target))
+ return svn_error_createf
+ (SVN_ERR_FS_NOT_MUTABLE, NULL,
+ _("Attempted to update ancestry of non-mutable node"));
+
+ SVN_ERR(get_node_revision(&source_noderev, source, pool));
+ SVN_ERR(get_node_revision(&target_noderev, target, pool));
+
+ target_noderev->predecessor_id = source->id;
+ target_noderev->predecessor_count = source_noderev->predecessor_count;
+ if (target_noderev->predecessor_count != -1)
+ target_noderev->predecessor_count++;
+
+ return svn_fs_fs__put_node_revision(target->fs, target->id, target_noderev,
+ FALSE, pool);
+}
Index: ../src/subversion/libsvn_fs_fs/tree.c
===================================================================
--- ../src/subversion/libsvn_fs_fs/tree.c (revision 1302591)
+++ ../src/subversion/libsvn_fs_fs/tree.c (working copy)
@@ -1142,32 +1142,6 @@
}
-/* Teach node-revision TARGET_ID that node-revision SOURCE_ID is its
- predecessor. TARGET_PATH is used for error messages only. */
-static svn_error_t *
-update_ancestry(svn_fs_t *fs,
- const svn_fs_id_t *source_id,
- const svn_fs_id_t *target_id,
- const char *target_path,
- int source_pred_count,
- apr_pool_t *pool)
-{
- node_revision_t *noderev;
-
- if (svn_fs_fs__id_txn_id(target_id) == NULL)
- return svn_error_createf
- (SVN_ERR_FS_NOT_MUTABLE, NULL,
- _("Unexpected immutable node at '%s'"), target_path);
-
- SVN_ERR(svn_fs_fs__get_node_revision(&noderev, fs, target_id, pool));
- noderev->predecessor_id = source_id;
- noderev->predecessor_count = source_pred_count;
- if (noderev->predecessor_count != -1)
- noderev->predecessor_count++;
- return svn_fs_fs__put_node_revision(fs, target_id, noderev, FALSE, pool);
-}
-
-
/* Set the contents of CONFLICT_PATH to PATH, and return an
SVN_ERR_FS_CONFLICT error that indicates that there was a conflict
at PATH. Perform all allocations in POOL (except the allocation of
@@ -1219,7 +1193,6 @@
apr_hash_index_t *hi;
svn_fs_t *fs;
apr_pool_t *iterpool;
- int pred_count;
apr_int64_t mergeinfo_increment = 0;
/* Make sure everyone comes from the same filesystem. */
@@ -1543,9 +1516,7 @@
}
svn_pool_destroy(iterpool);
- SVN_ERR(svn_fs_fs__dag_get_predecessor_count(&pred_count, source, pool));
- SVN_ERR(update_ancestry(fs, source_id, target_id, target_path,
- pred_count, pool));
+ SVN_ERR(svn_fs_fs__dag_update_ancestry(target, source, pool));
if (svn_fs_fs__fs_supports_mergeinfo(fs))
SVN_ERR(svn_fs_fs__dag_increment_mergeinfo_count(target,
Index: ../src/subversion/libsvn_fs_fs/dag.h
===================================================================
--- ../src/subversion/libsvn_fs_fs/dag.h (revision 1302591)
+++ ../src/subversion/libsvn_fs_fs/dag.h (working copy)
@@ -603,6 +603,10 @@
dag_node_t *node,
apr_pool_t *pool);
+svn_error_t *
+svn_fs_fs__dag_update_ancestry(dag_node_t *target,
+ dag_node_t *source,
+ apr_pool_t *pool);
#ifdef __cplusplus
}
#endif /* __cplusplus */
++1
I have been following the thread: "#4129 is reproducible Re:
predecessor count for the root node-revision is wrong message".
It looks like you all have it figured out now. Good job.
Do you need any more information from me at this point? Thanks.
Jason Wong.
Thanks Jason. It would be useful if you could confirm that you do not
run into the error after rebuilding the server with r1302399 and
r1302613 applied. (If you run the test suite, apply r1302539 and
r1302591 too.) These revisions constitute the fix which is nominated
for inclusion in 1.6.18 and 1.7.5; see ^/subversion/branches/1.7.x/STATUS.
Cheers,
Daniel
> Jason Wong.
Hi Daniel.
The developer who built the svn client is away and will probably not be back
until later this week. What is your ETA for 1.7.5? Just wondering if that would
released before the developer I have is back.
Thanks.
Jason
> Cheers,
>
> Daniel
>
>> Jason Wong.
The time until 1.7.5 is counted in weeks, and 1.6.18 is scheduled to be
released next week.
The fix was merged to 1.6.x@HEAD today and barring surprises will be
included in 1.6.18.
I've learnt yesterday something new about the minfo-cnt corruption bug:
it can manifest not only as absurdly high values (on the order of 2**70),
but as far smaller wrong increments too (such as increment of 172
instead of of 0 on one occasion).
Could you determine whether said bug has occurred in your history? You
can do that by duplicating your repository using svnsync or dump|load,
running dump-noderev.pl on / of both copies at the same revisions, and
comparing the minfo-cnt values.
I would be interested in knowing whether they are equal between the two
copies.
Thanks,
Daniel
Jason Wong wrote on Thu, Feb 16, 2012 at 11:42:42 -0800:
> > ./dump-noderev.pl /repository / 61851
> ------------------------------
>
> id: 0.0.r61851/33470
> type: dir
> pred: 0.0.r61850/3844
> count: 61818
> text: 61851 32225 1232 1232 7555349571e297c23e647cc2441d5b8f
> cpath: /
> copyroot: 0 /
> minfo-cnt: 25685
> ------------------------------
Server-side.
> r1302613 but am not familiar with the code. Also, will I need to do
> something with my repositories to fix any corruption?
>
As noted on the issue, either create an svnsync mirror of them, or
dump|load them, to fix the corruption. (svnsync is a client-side tool.)
> Thanks.
> Justin
I will give it a go and let you know what I find.
Jason
> Moving update_ancestry from tree.c to dag.c is one way to fix the
> problem.
This was applied in r1302613. I believe this also fixes the minfo-cnt
corruption that has been observed. To reproduce apply the following
patch to the old client to allow commit to be paused:
Index: ../src/subversion/libsvn_fs_fs/tree.c
===================================================================
--- ../src/subversion/libsvn_fs_fs/tree.c (revision 1302612)
+++ ../src/subversion/libsvn_fs_fs/tree.c (working copy)
@@ -1694,6 +1694,7 @@
any future merges will only be between that node and whatever
the root node of the youngest rev is by then. */
err = merge_changes(NULL, youngish_root_node, txn, conflict, iterpool);
+ { char buf[256]; fputs("waiting...", stdout);fgets(buf, 255, stdin); }
if (err)
{
if ((err->apr_err == SVN_ERR_FS_CONFLICT) && conflict_p)
Now the scenario:
svnadmin create repo
svn mkdir -mm file://`pwd`/repo/{A,B,C}
svn co file://`pwd`/repo wc1
svn co file://`pwd`/repo wc2
svn co file://`pwd`/repo wc3
svn ps svn:mergeinfo /P:2 wc1/A
svn ps svn:mergeinfo /Q:2 wc2/B
svn ps svn:mergeinfo /R:2 wc3/C
Now commit wc1 using the patched client running under valgrind. At the
first "waiting..." prompt commit wc2 using a standard client. Continue
the first commit to get a second "waiting..." prompt and commit wc3
using a standard client. Continue the first commit and valgrind reports
errors such as:
==23311== Conditional jump or move depends on uninitialised value(s)
==23311== at 0x98E0E5A: svn_fs_fs__dag_increment_mergeinfo_count (dag.c:544)
==23311== by 0x9906847: merge (tree.c:1551)
==23311== by 0x9906A18: merge_changes (tree.c:1599)
==23311== by 0x9906C06: svn_fs_fs__commit_txn (tree.c:1696)
==23311== by 0x6C8C6D9: svn_fs_commit_txn (fs-loader.c:646)
==23311== by 0x6A5B12C: svn_repos_fs_commit_txn (fs-wrap.c:59)
==23311== by 0x6A51D0E: close_edit (commit.c:693)
==23311== by 0x4E4A34F: svn_client__do_commit (commit_util.c:1898)
==23311== by 0x4E45723: svn_client_commit6 (commit.c:1689)
==23311== by 0x4097D6: svn_cl__commit (commit-cmd.c:169)
==23311== by 0x416FA1: main (main.c:2699)
==23311== Syscall param write(buf) points to uninitialised byte(s)
==23311== at 0x62CC0D0: __write_nocancel (syscall-template.S:82)
==23311== by 0x5E7D4FC: apr_file_flush_locked (readwrite.c:317)
==23311== by 0x5E7D7AF: apr_file_flush (readwrite.c:340)
==23311== by 0x5E7CFA4: apr_unix_file_cleanup (open.c:77)
==23311== by 0x59FADAF: svn_io_file_close (io.c:3092)
==23311== by 0x98E9083: svn_fs_fs__put_node_revision (fs_fs.c:2318)
==23311== by 0x98E1039: svn_fs_fs__dag_increment_mergeinfo_count (dag.c:575)
==23311== by 0x9906847: merge (tree.c:1551)
==23311== by 0x9906A18: merge_changes (tree.c:1599)
==23311== by 0x9906C06: svn_fs_fs__commit_txn (tree.c:1696)
==23311== by 0x6C8C6D9: svn_fs_commit_txn (fs-loader.c:646)
==23311== by 0x6A5B12C: svn_repos_fs_commit_txn (fs-wrap.c:59)
==23311== Address 0xb2ee1dc is 92 bytes inside a block of size 4,096 alloc'd
==23311== at 0x4C244E8: malloc (vg_replace_malloc.c:236)
==23311== by 0x5E806D0: pool_alloc (apr_pools.c:1463)
==23311== by 0x5E7CDFB: apr_file_open (open.c:211)
==23311== by 0x59F5194: file_open (io.c:280)
==23311== by 0x59FABF6: svn_io_file_open (io.c:3049)
==23311== by 0x98E8FC4: svn_fs_fs__put_node_revision (fs_fs.c:2308)
==23311== by 0x98E1039: svn_fs_fs__dag_increment_mergeinfo_count (dag.c:575)
==23311== by 0x9906847: merge (tree.c:1551)
==23311== by 0x9906A18: merge_changes (tree.c:1599)
==23311== by 0x9906C06: svn_fs_fs__commit_txn (tree.c:1696)
==23311== by 0x6C8C6D9: svn_fs_commit_txn (fs-loader.c:646)
==23311== by 0x6A5B12C: svn_repos_fs_commit_txn (fs-wrap.c:59)
The commit fails because the count: is wrong, but if I disable that
check to allow the commit to complete then I see a bogus minfo-cnt such
as:
id: 0.0.r7/322
type: dir
pred: 0.0.r4/180
count: 5
text: 7 198 111 0 242bff40060f22bcf85959dcf552851a
cpath: /
copyroot: 0 /
minfo-cnt: 43
Using the current trunk with the issue 4129 fix applied this no longer
happens: I don't get the valgrind warnings and the minfo-cnt is correct.
Hello Daniel.
The svnsync took a while to run once I got it going. I ran the command
on the hotcopy I had made originally to keep the results consistant.
I have run the following two commands:
dump-noderev.pl /repo /
dump-noderev.pl /mirror2 /
Here are the outputs from the commands:
dump-noderev.pl /repo /
---------------------
id: 0.0.r62104/28771
type: dir
pred: 0.0.r62103/28680
count: 62071
text: 62104 27520 1238 1238 ea635421e867454f9f7bc503c8160a2c
cpath: /
copyroot: 0 /
minfo-cnt: 25707
---------------------
dump-noderev.pl /mirror2 /
-----------------------
id: 0.0.r62104/6122
type: dir
pred: 0.0.r62103/6039
count: 62104
text: 62104 4874 1235 1235 1f315ed2437ba5d70dba2587d9ef2d5a
cpath: /
copyroot: 0 /
minfo-cnt: 25707
-----------------------
Is this in line with what you expected?
Jason Wong.
It's in line with my expectations, insofar as on the mirror the 'count'
is correct.
It also indicates that you weren't bitten by the minfo-cnt part of this
bug. As you know from the dev@ thread, Philip identified that part and
fixed it too -- after my above email.
Thanks again for your help in chasing down this bug. It was backported
today towards 1.7.5 too.
Cheers,
Daniel
Hi Daniel.
No problem. I am glad the issues are fixed. Thank you for all your help
and patience with my slow replies. It has been a busy couple of months
for me in trying find the time to do these tests.
So for correcting the "count" information in our live repository, I
should run svnsync on it at some point? Is there anything I need to do
after running that command in order to have it not link to the original?
Thanks.
Jason Wong
Welcome.
> So for correcting the "count" information in our live repository, I
> should run svnsync on it at some point?
You already did, as you have a mirror. (Maybe you created it via
dump/load.) Now you just need to swap the mirror for the original
repository:
- stop commits
- svnsync sync URL/to/mirror2 URL/to/repo
(or svnadmin dump -r (YOUNGEST ON MIRROR2):HEAD --deltas --incremental repo \
| svnadmin load mirror2)
- rename repo repo.i4129-victim && rename mirror2 repo
- enable commits
> Is there anything I need to do
> after running that command in order to have it not link to the original?
You can optionally remove the svn:sync-* revprops from r0.
>
> Thanks.
>
>
> Jason Wong