Protobuf Decompiler

0 views

Skip to first unread message

Katrine Freggiaro

unread,

Aug 3, 2024, 1:41:20 PM8/3/24

to potitifan

In most supported languages, you can locate the raw descriptor of the original protobuf definition within the generated template codes. This descriptor usually takes the form of a binary string (check the next section for examples).

Our project implements a distinct method to determine whether a given input is possibly a nested protobuf.The core of this logic is the is_maybe_nested_protobuf function.We recently enhanced this function to provide a more accurate distinction and handle nested protobufs effectively.

You can extend or modify the is_maybe_nested_protobuf function based on your specific requirements or use-cases.If you find a scenario where the current logic can be further improved,feel free to adapt the function accordingly.

In order to build Haskell packages with proto-lens, the Google protobufcompiler (which is a standalone binary named protoc) needs to be installedsomewhere on your $PATH. You can get it by downloading the correspondingfile for your system from (Thecorresponding file will be named something like protoc-*-.zip.)

Recently we extended the Datadog Agent to support extracting additional metrics from Kubernetes using the kube-state-metrics service. Metrics are exported through an HTTP API that supports content negotiation so that one can choose between having the response body in plain text format or as a binary stream encoded using Protocol buffers.

Messages in the real world can be way more complex but for the scope of the article we will try to keep things simple. You can dive deeper browsing the official docs, namely the language definition and the Python tutorial.

As we mentioned, the .proto file alone is not enough to use the message, we need some code representing the message itself in a programming language we can use in our project. A tool called protoc (standing for Protocol buffers compiler) is provided along with the libraries exactly for this purpose: given a .proto file in input, it can generate code for messages in several different languages.

That is neat but what if we want to encode/decode more than one metric from the same binary file, or stream a sequence of metrics over a socket? We need a way to delimit each message during the serialization process, so that processes at the other end of the wire can determine which chunk of data contains a single Protocol buffer message: at that point, the decoding part is trivial as we have ve already seen.

This is quite easy to achieve, except that the Java implementation keeps the size of the message in a Varint value. Varints are a serialization method that stores integers in one or more bytes: the smaller the value, the fewer bytes you need. Even if the concept is quite simple, the implementation in Python is not trivial but stay with me, there is good news coming.

Protobuf messages are not self-delimited but some of the message fields are. The idea is always the same: fields are preceded by a Varint containing their size. That means that somewhere in the Python library there must be some code that reads and writes Varints - that is what the google.protobuf.internal package is for:

There are a number of reasons why people need to serialize data: sending messages between two processes in the same machine, sending them across the internet, or both - all of these use cases imply a very different set of requirements. Protocol buffers are clever and efficient but some optimizations and perks provided by the format are more visible when applied to certain data formats, or in certains environments: whether this is the right tool or not, that should be decided on a case by case basis.

Results are similar with a real world example: a payload returned by the kube-state-metrics API containing about 60 messages, encoded with Protocol buffers and gzip compression, takes about 6Kb; the same payload encoded with the Prometheus text format and gzip compression has pretty much the same size.

This is somehow expected since strings in Protobuf are utf-8 encoded and our message is mostly text. If your data looks like our Metric message and you can use compression, payload size should not be a criterion to choose Protocol buffers over something else. Or to not choose it.

The encoding phase is where Protobuf spends more time. The cpp encoder is not bad, being able to serialize 10k messages in about 76ms while the pure Python implementation takes almost half a second. For one million messages the pure Python protobuf library takes about 40 seconds so it was removed from the chart.

A Python version of the protocol used by the API is publicly available and can be included in our codebase so we can ignore the overhead in terms of setup and tooling caused by the compilation process.

In terms of payload size, being the API server capable to compress HTTP responses, we can choose any of the two formats (Protobuf and plain text) provided by the kube-state-metrics API without changing the overall performance of the check.

As you can see, parsing complex data in text format is very different from our simple metric message: the prometheus parser has to deal with multiple lines, comments, and nested messages. Putting aside the pure Python version of the library, parsing the binary payload outperforms plain text by one order of magnitude, crunching 60 metrics in about 2 milliseconds - the payload transfer will always be the bottleneck, no matter what.

Even if the Protobuf Python library does not support chained messages out of the box, the code needed to implement this feature is less than 10 lines. The data we get is structured, so once the message is parsed we are sure that a field contains what we expect, meaning less code, less tests, less hacks, less odds of a regression if the API changes.

Google Play can be queried in two ways: using the official website or the Android client. The website contains pretty much all the useful information, such as app name and developer name, comments, last version number and release date, permissions required by the app, statistics, etc. I guess one could build a simple program that queries this website and parses the pages, but it would still have one limitation: you simply cannot download apps. Well, you can, but for this you will need an actual compatible phone, and as soon as you perform the install request, the application will get downloaded and installed on your phone. Then if you want to retrieve it in order to analyse it, you must plug in your phone and use adb pull. Some managed to get Google Play run within the emulator, but this is still a bit complicated and not straightforward: you need Java, Android SDK, customize your emulator ROM to embed Google Play, and script everyting yourself.

The weird thing is that the non-numeric assetId problem occurs quite often, but not on all apps. I guess this is because Google updated their API when they switched to Google Play; those projects are using the old version of the API. The only way to have up-to-date information and be able to download any app would then be to analyse the updated Android client, and adapt existing projects.

Here we go! We retrieve com.android.vending-1.apk from an up-to-date Android phone using adb, and we use our favorite Android RE tools. A first look at class names highlights a pretty explicit VendingProtos class, under the com.google.android.vending.remoting.protos package. It contains references to a package named com.google.protobuf.micro, embedded within the app. This package contains classes used to encode and decode messages. It is actually part of a public project, named micro-protobuf, which is a lightweight version of Protobuf. However, the underlying protocol remains the same.

Most of network traffic is sent using HTTPS. After installing our own on CA onto the phone and setting up an interception proxy like Burp, we can sniff traffic. From a black-box approach, the exchanged data looks like a binary stream:

This field has a tag equal to 3 (26 >> 3) and is a message which name is AppDataProto. In order to get this sub-message structure, we would have to repeat the analysis process to the corresponding class, and so on.

We now have a way of recovering a message structure by analyzing the generated code. All we need now is automating the process. For this, we can use Androguard, a multi-purpose framework intended to make Android reversing easier. With Androguard, we can simply open an APK, decompile it, parse its Dalvik code, and do all sorts of things. Once installed, one can use the provided androlyze tool to dynamically interact with the framework, and then write a script to automate everything.

Then we extract the mergeFrom() method of each class by filtering the method list generated by dvm.get_methods_class(class_name). The basic block list of each method can be obtained with vma.get_method(m).basic_blocks.gets().
The first is usually the one that implements the switch instruction. In Dalvik, a switch is often represented as a sparse-switch instruction, which operand is a table composed of a list of values and offsets, called sparse-switch-payload. Here is an example:

Each (value, offset) tuple correspond to a case of the switch; if the value matches the compared register, then the execution continues to the corresponding offset. Once we are able to browse each case of the switch (and its target basic block), we can determine the name of each field and its type by examining the name of the corresponding accessors. For instance, here is a typical basic block:

Each basic block contains two accessor calls: readXXX() and setYYY(). Their goal is to read an incoming series of bytes and initialize one field of the message. XXX corresponds to the type of the field (here, string), and YYY to its name (city).

The resulting output is almost usable with protoc. Almost, because there is a duplicate message that you need to manually remove in order to make protoc happy. But after taking care of that detail, you have a working googleplay.proto that you can use to generate C++, Java and Python stubs for querying Google Play API!