To be included in this list the project must have decompiled or disassembled at least 10% of the course code, in order to not pollute the post with hundreds of unfinished projects. An exemption to this rule can be made for consoles with very few reversing projects.
Recently, mainly in thanks to the huge success of the Super Mario 64 project, there has been revived interest in Nintendo 64 Decompilation. While most projects are still in very early stages, there are a few that meet the criteria for this list.
This project aims to perfectly reconstruct the source code of the first five Touhou Project games by ZUN Soft (now Team Shanghai Alice), which were originally released exclusively for the NEC PC-9801 system.
A decompilation of the 2005 Java mobile port of Sonic the Hedgehog (1991). This decompilation focuses on the 240x320 resolution version, of part 1 as it uses the original game's graphics. However, there are lower resolutions that function similarily. And part 2 consists of the largest available resolution; 208x208. Part 1 was created by iFone, and 2 by Glu.
In 2019, I started the decompilation the PC version of the Spider-Man game developed by Neversoft, which is most commonly called Spider-Man 2000. Work didn't last long, I made few commits the last being done on 10th of August 2020.
Fast-forward to 2023 and my interest in going back to the decompilation had re-ignited, all thanks to a few community members that still actively create mods and explore the game. Here's a non-exhaustive list of the most active ones:
Restarting was a no-brainer has I had completely shifted my approach to the project and learned a lot more about the game. Even though I knew the game was written in C++, I had previously decided to use C.I had no strong reason to do so besides the fact I was more comfortable with C. This time I did not make the same mistake.
Previously, I was live-testing the code. I would re-write a method, hook into the game code and run the game and see if it worked. Not only, this method was tedious it left the door open for the correctness of the code - did I mess up the signdness? did mess the edge case?. This time I would follow a process similar to matching decompilation projects, but instead of trying to recreate the game as close to the byte I would strive for function parity.This means that each function/sub-routine would be recreated trying to match as close as possible the original version. For cases where it didn't match, they'd need to be analyzed as it's quite easy to make a change in code that changes the register allocation or instruction order.
Having a project that when built produces a binary that is an exactly replica of the game is great as it means all the original behaviours (bugs included) were preserved. One could also develop a sense on how the original developers wrote code and thus be more efficient at decompiling more pieces of the code.
It is a laborious, fruitless and tedious process. Even when using the same compiler toolchain used by the original developers there's a lot of "code massages" until it produces the exact same output. After all that work one thinks to themselves - Did the developers go through all this effort to write this routine? It looks nothing like real code.When I say "code massages" they are not limited to changing the code but also the build flags - it's totally doable that some object files where built with different optimizations.
There's no bigger frustration than struggling to get a function to match. When I was decompiling a super-small routine of a PlayStation game I had the logic pinned down, it was taking parameters and writing to some global variable. My code was doing it but had an extra copy to a register which I was going crazy for. After a while a realized the issue, I had defined the function as returning an int if I changed to void it would be a perfect match. This to say that there's so many variables at play that getting a matching decompilation is just too much work.
Let's say we're decompiling a function, we can either do matching decompilation or equivalent decompilation. The matching process has been described, but how does one evaluate if two pieces of assembly are equivalent. Two factors: sign correctness and memory accesses.
The answer is the second! For the first one it's totally possible that in-between the writes the value changes, therefore the assembly generated would contain a read for each write. BUT! It's totally possible to make the compiler generate the same assembly for both, given you tweak with the flags enough or use the restrict keyword to convince the compiler there's no way the value would change in-between those writes.
The Macintosh version of the game contains symbols - function names, parameters types and names. This has been used to identify functions of the game, the compiler used for the PC version has quite a few routines that have been inlined which is annoying. A cool thing about this version is that it contains boundaries for the object files. Where the code inside foobar.cpp starts, there will be a dummy section called .sinit_foobar_cpp so it has been extremely helpful into getting a proper recreation of the source directory.
The game shares the same engine with Tony Hawk Pro Skater 2 and that game has had 2 demos for the PlayStation that contain symbols - they contain more information than Macintosh version has they also contain classes/structure definitions with offsets. Since both of these games (ab)use inheritance it has been a godsent helping outlining where base class ends and where child class begins. For reference here's the inheritance chart of the player class CPlayer -> CSuper -> CBody -> CItem (-> CClass). CClass doesn't seem to be present in the PC and Mac versions, either it was fully inlined or removed when ported.
So far 75 data structures have been added to the codebase. I've also created validation macros to make sure both structure size and memory accesses don't get messed up by padding or accidental mistakes. They are VALIDATE_SIZE that takes a structure/class name and the expected size and VALIDATE which takes a structure/class name, a field name and the expected offset. For both macros, if the check fails it'll print a message.
By starting to write code for the "lower" level methods it made me able to identify functions that have been inlined. I've also started to develop this intuition on how the compiler works. For example, when there's a conditional jump, it jumps towards the un-met condition (or as I used to say "jump to further ahead in the source file").
Recently to help identify which functions are matching and which need to be revisited I developed a tagging system which is @Ok, @NotOk and @TODO. The idea is to be easily searchable and make it easier to generate high-level status reports.
Finally, I've also made the decompilation setup reproducible which is basically IDA Pro, Ghidra and the same version of Visual Studio by the developers (it was identified using Detect-It-Easy). In the beginning I was working on a different machine that for a while didn't have access to.
In late 2023, I've decided to stream every time I'd work the project. The goal was simple to educate people on the complexity of this endeavor. Since I own a Discord community with close to a thousand members, it's not uncommon to get the question - what % of work is left to do? - which is quite tricky to answer.
Decompilation process goes as follows. Identify function in code and the data structures it uses. Then, recreate data structures in code and finally recreate the function. It's a lot of work until the code is done. For a lot of streams I was just identifying fields in data structures and naming them if it was possible to infer from the scenario.
As described in this post, decompilation project is tedious and laborious. Therefore it's easy to lose motivation and leave it in backburner for a long-time. The solution I have found is consistency, everyday try to work on it a bit as all effort will compound. An example of this are days where the time I worked on the project was just outlining fields in the classes, there will be a day where I won't need to open the Structures tab in IDA and that would mean I have the internal structures all figured out.
I've also worked in removing all the inconveniences that can hinder me working on it, such as requiring a specific computer to work on it. Now that I have a portable and reproducible environment I have less excuses not to tackle this project. A cool side-effect of working on it regularly is that I feel more motivated, like a self-fulfilling prophecy the more I work the more motivated I feel.
I've talked about this with MrMartinIden who is working on decompilation project for Ultimate Spider-Man PC Version and we both agree that this type of project requires much more than motivation to thrive.
For the first two instances the idea of dealing with loops makes it daunting and I haven't found a proper way to deal with it. For the third option I have little experience with LLVM so I'm not sure how feasible it is. If someone has any expertise in this topic please reach out to me! My contact information is in the About page.
I just want to say, I love reading this. I really wish you would do the same decompilation and analysis with Sega's Space Harrier (which no doubt used an earlier version of their "Super Scalar" technology).
Just wanted to let you know that there are those of us out there fascinating by decompiling and analyzing classic Sega arcade code, and thank you much for taking the time to do this.
I have always loved CTR, it was the first game that I played, except for maybe an Elmo game right before that. Me, my sisters and my dad all played the 4P multiplayer together and we all loved the game.
You mentioned the SM64 Decomp project which to my knowledge is published on GitHub. But your work is currently not in any public repository, only for download through YouTube video description. Is there any reason for this?
c01484d022