Triage recovery of damaged Subversion repo

91 views
Skip to first unread message

Michael K

unread,
Oct 28, 2022, 6:26:03 PM10/28/22
to us...@subversion.apache.org
Hi, I could really use some direction, advice, etc. I came across some past messages that made me think some might be able to help with this. I'd appreciate any input!

I am working on an important Subversion repository that was hit by a targeted ransomware attack. Apparently the backups were deleted securely as well, though there is a backup from a few years back that was unaffected in different storage. In brief, the ransomware encrypted and overwrote (up to) the first 4 KB of data and also added some encrypted data and zero-padding to the end of every file. Since Subversion has many small files, the data has been slashed up badly and some is gone forever. But files larger than 4 KB have original data remaining.

My goal is to build a working repository with as much of the original data that is remaining as I can, like a triage operation. I have a backup that was not affected, but it does not contain the last few years of data. I need to utilize the data that is affected by ransomware encryption.

Eventually I plan to write a program that will work over all the affected revs and revprops files required and output new files. I'm coming at this without previous knowledge of the inner workings of Subversion, but I am comfortable working in a hex editor and writing programs that process raw data. So for now, I have been learning about Subversion from reading the documentation and while working hands-on with the raw data of these files in a hex editor. I've learned a bit about the "representations" within the revs files. That will probably be helpful since those provide units that each revs file can be broken down into. I can use that knowledge to try keeping full "representations" and discard partial ones.

Currently, I am trying to add a single new empty revision that Subversion will accept after testing with the "svnadmin verify" and "svn info" commands. I fabricated data for a revprops file on this new revision, I adjusted the "current" file to the new revision number, and I'm working on the revs file. If I can achieve that, I'll move on to adding a new revision that contains some original data.

I've learned about the footer of the revs files as I've come across errors when trying those commands. I know how the L2P_OFFSET and P2L_OFFSET work and I have remedied the errors when those offsets are incorrect. I also discovered some kind of item indexes from logical addressing (I think, not sure what they are called) which occur right after both "L2P_OFFSET" and "P2L_OFFSET" in the revs files. By looking at many files, I figured out how to calculate the binary representation for that based on the rev number (strange calculation). That got me past the error such as - "svn: E160054: Index rev / pack file revision numbers do not match" - from the svn info command.

And now I'm trying to get past the "L2P index checksum mismatch" error. I don't know yet how the "actual" checksum value is calculated. Thankfully Subversion's error message shows both the "expected" and "actual" checksums. So I've tried taking an MD5 hash on byte ranges of the L2P-INDEX area (and variations), but haven't gotten a match to that "actual" value yet.

If you could provide insight to where these 2 checksums come from, I'd be really grateful. Also, any other general thoughts on this project would be appreciated.

Michael

Daniel Shahaf

unread,
Nov 11, 2022, 9:09:30 AM11/11/22
to Michael K, us...@subversion.apache.org
Michael K wrote on Fri, Oct 28, 2022 at 17:25:19 -0500:
> I am working on an important Subversion repository that was hit by a
> targeted ransomware attack. Apparently the backups were deleted securely as
> well, though there is a backup from a few years back that was unaffected in
> different storage. In brief, the ransomware encrypted and overwrote (up to)
> the first 4 KB of data and also added some encrypted data and zero-padding
> to the end of every file. Since Subversion has many small files, the data
> has been slashed up badly and some is gone forever. But files larger than 4
> KB have original data remaining.
>
> My goal is to build a working repository with as much of the original data
> that is remaining as I can, like a triage operation. I have a backup that
> was not affected, but it does not contain the last few years of data. I
> need to utilize the data that is affected by ransomware encryption.
>
> Eventually I plan to write a program that will work over all the affected
> revs and revprops files required and output new files. I'm coming at this
> without previous knowledge of the inner workings of Subversion, but I am
> comfortable working in a hex editor and writing programs that process raw
> data. So for now, I have been learning about Subversion from reading the
> documentation and while working hands-on with the raw data of these files
> in a hex editor. I've learned a bit about the "representations" within the
> revs files. That will probably be helpful since those provide units that
> each revs file can be broken down into. I can use that knowledge to try
> keeping full "representations" and discard partial ones.
>

Yes, rev files have quite a bit of internal structure: reps, node-rev
headers, changed-paths, P2L/L2P, final line. These are generally easy
to parse out of surrounding contexts (revprop files use counted-length
strings, reps have their header and "ENDREP" trailer, L2P-INDEX and
P2L-INDEX know their own length and have ASCII before and after them,
and everything else is ASCII in specific formats).

Similarly, it should be easy to recognize where the appended cryptogram
and padding start, since the part from L2P-INDEX to the last line is
distinctive and self-checksummed.

I don't know by heart what elements will be serialized into the first
4KB of a rev file in logical addressing mode. (By the way, it's worth
looking up in the implementation what physical order it writes the items
to the file in. Chances are this wasn't left to chance.) What you
might find there is:

- File reps.

A rep is a compressed [see fsfs.conf, no relation to "self-compressed"
in the sense of having no base] svndelta [see notes/svndiff], whose
base, if there is one, might or might not be the preceding revision of
the node [see notes/skip-deltas and fsfs.conf] [note: this means it's
possible for rN+M of a file to be recoverable even if rN's rep is lost].

In principle, you can even dive down this rabbit hole of abstractions to
recover data from the surviving tail ends of partially-overwritten reps.

- Dir reps. These are like file reps but the content of the file is
an svn_hash_write2() hash mapping basenames to node-rev id's. IIRC,
the hashes are dumped in sorted order and the node-rev id's are also
fairly predictable, and in any case they are repeated in the node-rev
headers of the directory entries. It might even be possible to
reconstruct an overwritten dir rep from the remainder of the rev file.

- Node-rev headers. Parts of these are predictable (e.g., the "pred:"
value), or can be regenerated (e.g., the checksums), or inferred from
other parts of the rev file (e.g., "type: dir" can easily be guessed
if you still have the rep itself).

- Changed-paths. That's just an index/cache, IIRC, of information
derivable from the remainder of the file.

> Currently, I am trying to add a single new empty revision that Subversion
> will accept after testing with the "svnadmin verify" and "svn info"
> commands. I fabricated data for a revprops file on this new revision, I
> adjusted the "current" file to the new revision number, and I'm working on
> the revs file. If I can achieve that, I'll move on to adding a new revision
> that contains some original data.
>

I assume you mean this:

[[[
echo Hello world > foo
svnadmin create r
svnmucc -U file://$(pwd)/r -mm put foo iota # 'svn import' would do the trick too
xxd < r/db/revs/1
]]]

Why would you need to /manually/ create a rev file with original data?
You can use 'svn commit' to create rev files (on top of the old, good
backup). I'd have thought you'd focus on trying to extract data from
the partially-corrupted rev files (e.g., reconstruct the fulltexts of
reps where it's possible to do so).

Anyway, regarding creating rev files:

The rev files you get by default have bells and whistles turned on. For
instance, they use DELTA and self-DELTA reps even though it's a lot
easier to fabricate a PLAIN rep, and you can use PLAIN anywhere you can
use DELTA.

For this reason, I'd recommend to try to create a 1.1-era rev file
first. Pass «--compatible-version=1.1 --fs-type=fsfs» to «svnadmin
create» above. (Subversion 1.1's FSFS is the oldest FSFS there is; see
`svnadmin info`.)

Word of warning: when you test things, do NOT test with the r0 rev file.
The C code hard-codes the assumption that r0 is empty.

> I've learned about the footer of the revs files as I've come across errors
> when trying those commands. I know how the L2P_OFFSET and P2L_OFFSET work
> and I have remedied the errors when those offsets are incorrect. I also
> discovered some kind of item indexes from logical addressing (I think, not
> sure what they are called) which occur right after both "L2P_OFFSET" and
> "P2L_OFFSET" in the revs files.

Do you mean "L2P-INDEX" and "P2L-INDEX"?

> By looking at many files, I figured out how
> to calculate the binary representation for that based on the rev number
> (strange calculation).

The checksum in the final line is just MD5.

> That got me past the error such as - "svn: E160054:
> Index rev / pack file revision numbers do not match" - from the svn info
> command.
>
> And now I'm trying to get past the "L2P index checksum mismatch" error. I
> don't know yet how the "actual" checksum value is calculated. Thankfully
> Subversion's error message shows both the "expected" and "actual"
> checksums. So I've tried taking an MD5 hash on byte ranges of the L2P-INDEX
> area (and variations), but haven't gotten a match to that "actual" value
> yet.
>
> If you could provide insight to where these 2 checksums come from, I'd be
> really grateful.

I think you're looking for the modified FNV-1A in structure-indexes
(which I suspect is svn_checksum_fnv1a_32), but anyway, try setting the
checksum's value to all-zeroes: by convention, such a checksum is
considered equal to everything in checksum comparisons. You might even
be able to use «svnfsfs load-index» for that (after removing the
appended data or adjusting svnfsfs's source).

The on-disk format is documetned in subversion/libsvn_fs_fs/structure
(grep for "logical").

You can sidestep the entire L2P/P2L fabrication step by using physical
addressing. The C code as it stands makes use_log_addressing
a per-fs-instance knob rather than a per-rev-file one, but for your
purposes you can patch the C sources to pretend ffd->use_log_addressing
were FALSE for a specific fs instance and revnum range (the revnums
whose rev files you'll be fabricating). svn_fs_fs__item_offset() seems
a relevant callsite.

> Also, any other general thoughts on this project would be appreciated.

Enable post-commit email notifications with diffs?

Cheers,

Daniel

Michael K

unread,
Nov 28, 2022, 5:51:23 PM11/28/22
to us...@subversion.apache.org, Daniel Shahaf
Daniel, thanks for your reply! It is greatly appreciated.


"Yes, rev files have quite a bit of internal structure: reps, node-rev
headers, changed-paths, P2L/L2P, final line. These are generally easy
to parse out of surrounding contexts (revprop files use counted-length
strings, reps have their header and "ENDREP" trailer, L2P-INDEX and
P2L-INDEX know their own length and have ASCII before and after them,
and everything else is ASCII in specific formats)."

I have frequently looked over the documentation at https://svn.apache.org/repos/asf/subversion/trunk/subversion/libsvn_fs_fs/structure.

I can definitely recognize the border between reps when I see an "ENDREP", 0A (newline), and then a "DELTA SVN". But then there are those that have significant data bewteen "ENDREP" and "DELTA SVN" and I don't understand what is going on there yet. I don't know how I would split those if needed.


"Similarly, it should be easy to recognize where the appended cryptogram
and padding start, since the part from L2P-INDEX to the last line is
distinctive and self-checksummed."

Yes, I have been able to remove the tailing bit that the ransomware added at the end of files. I made a program to process all files and it does that fine.

"I don't know by heart what elements will be serialized into the first
4KB of a rev file in logical addressing mode."

I'll mention that a great many of these rev files are smaller than 4 KB, so they contain no original data.


"Why would you need to /manually/ create a rev file with original data?
You can use 'svn commit' to create rev files (on top of the old, good
backup). I'd have thought you'd focus on trying to extract data from
the partially-corrupted rev files (e.g., reconstruct the fulltexts of
reps where it's possible to do so)."

The old, good backup went through rev 88214, while the original data repo goes through rev 241130. So that is 152916 revisions difference. None of those revisions work. I assume so with the ~4KB of damage at the beginning... as far as I know nothing can read anything automatically from those, and Subversion will not show any data or verify any revisions at the point it hits the ransomware-affected files.

So I have been investigating a process to create rev files that includes remaining original data from the revs so they are functional within Subversion. If I can find a process to do that, then I can write a program that will execute that over all the ransomware-affected original revisions 88215 thru 241130. I am comfortable writing programs that process raw data from files in different ways. The plan would be to process all those files, output new files (completely outside of Subversion), and then access that within Subversion to check it. If something didn't work, I would rework the program and run it again.

I just started with an "empty" revision so that I would first know I can satisfy Subversion's minimum requirements for a revision (revprops and revs files). Someone related to the original project actually gave me an example repo for this purpose. Within that, they created a revision with a single change. They then used a dump filter to filter out the contents of that revision, and made a new repo from that. Then I look at the specific revision file in a hex editor. I've been looking at lots of files in a hex editor here.

Now, I am completely new to Subversion since this project. So if there is a better way to do this using Subversion, I'm certainly open to that! If I were to do an "svn commit", how would I include original data from the damaged repo?

My thought was these rev files include "reps" units, and those units are how I would include the original data in newly created rev files.

"(e.g., reconstruct the fulltexts of reps where it's possible to do so)"

Hmm I don't know what "the fulltexts" means.

I am also learning about SVNKit at the same time. Actually I was looking there to try to figure out the 2 hashes in the footer of the rev files. But it might be useful to use for this process.


"[note: this means it's possible for rN+M of a file to be recoverable even if rN's rep is lost].
...

In principle, you can even dive down this rabbit hole of abstractions to
recover data from the surviving tail ends of partially-overwritten reps."

That is intriguing to know. But as for diving down rabbit holes, I likely won't want to do that if it requires manual work per revision, or if it requires a lot of coding work with very little to gain. At this point I would love to get something that works and also contains some original data so that I know this is feasible. Then if I can improve from there, great.


"The rev files you get by default have bells and whistles turned on. For
instance, they use DELTA and self-DELTA reps even though it's a lot
easier to fabricate a PLAIN rep, and you can use PLAIN anywhere you can
use DELTA."

"For this reason, I'd recommend to try to create a 1.1-era rev file
first. Pass «--compatible-version=1.1 --fs-type=fsfs» to «svnadmin
create» above. (Subversion 1.1's FSFS is the oldest FSFS there is; see
`svnadmin info`.)"

All right, I understand what you are saying. So that would create a new repo that uses the old FSFS format. And that should be easier to fabricate. But that would differ from the backup repo I have with good data. So maybe that would have to be converted as well, if possible.


"Word of warning: when you test things, do NOT test with the r0 rev file.
The C code hard-codes the assumption that r0 is empty."

That is good to know, thank you.


"Do you mean "L2P-INDEX" and "P2L-INDEX"?"

I guess that is it. I see "L2P-INDEX", then 0A, then a value. Then later I see "P2L-INDEX", then 0A, then a value. Those values have an odd calculation apparently and I was able to derive them from the revision number. I'm not sure how it functions as an index. But I have read someone say they are meant to discourage analysis or something.


"The checksum in the final line is just MD5."

Which one? The final line of what? From what I understand, the end of a rev file has L2P-INDEX offset, FNV-1A hash, P2L-INDEX offset, FNV-1A, then terminal byte.
UPDATE - Ok now I know these are MD5, and the FNV-1A hashes are part of an intermediate step before the rev file is created.

Setting the value to all zeroes seemed to work, as now I have a different error message. Thank you!


"You can sidestep the entire L2P/P2L fabrication step by using physical
addressing."

That makes sense as well. I suppose the output of my program could be a repo that uses physical addressing. Again that would differ from the backup repo and so a conversion would be required if I would go that route.

Once again, huge thanks for your input!
Reply all
Reply to author
Forward
0 new messages