allis a special case: it compiles all possible targetlanguages, creating language-specific directories (as per languageidentifiers) inside output directory, and then creating outputmodule(s) for each language starting from there
The main idea of Kaitai Struct is that you create a description of a binary datastructure format using a formal language, save it as a .ksy file, andthen compile it with the Kaitai Struct compiler into a target programming language.
Many file formats use some sort of safeguard measure against using acompletely different file type in place of the required file type. Thesimple way to do so is to include some "magic" bytes (AKA "filesignature"): for example, checking that the first bytes of the file are equal totheir intended values provides at least some degree of protectionagainst such blunders.
This reads the first 4 bytes and compares them to the 4 bytes CA FE BA BE. Ifthere is any mismatch (or less than 4 bytes are read),it throws an exception and stops parsing at an early stage, before anydamage (pointless allocation of huge structures, waste of CPU cycles)is done.
When a value does not meet the specified criteria, Kaitai Struct raises avalidation error, halting further parsing. This preemptive measureensures the data being parsed is within the expected domain, providing afirst layer of error handling.
Another popular way to avoid allocating huge fixed-size buffers is touse some sort of trailing delimiter. The most well-known example ofthis is probably the null-terminated string which became a standardstring representation in C:
These 4 bytes actually represent the 3-character string "abc", plus one extratrailing byte "0" (AKA null) which serves as a delimiter orterminator. By agreement, C strings cannot include a zero byte: every timea function in C sees that either in stream or in memory, it considersthat as a special mark to stop processing.
Another very widespread model is actually having both a fixed-sizedbuffer for a string and a terminator. This is typically an artifactof serializing structures like this from C. For example, take thisstructure:
Effectively, the buffer is still 16 bytes, but the only meaningfulcontents it has is up to first null terminator. Everything beyond thatis garbage left over from either the buffer not being initialized at all(these ?? bytes could contain anything), or it will contain parts ofstrings previously occupying this buffer.
terminator, given that size is present, only works inside these16 bytes, cutting string short early with the first terminator byteencountered, saving application from getting all that trailinggarbage.
What do we do if we need to use many of the strings in such a format?Writing so many repetitive my_len- / my_str-style pairs would be sobothersome and error-prone. Fear not, we can define another type,defining it in the same file, and use it as a custom type in a stream:
Some protocols and file formats have optional fields, which only existin some conditions. For example, one can have some byte first thatdesignates if some other field exists (1) or not (0). In Kaitai Struct, you can do thatusing the if key:
This one reads 4-byte signed integer numbers until encountering -1. Onencountering -1, the loop will stop and further sequence elements (ifany) will be processed. Note that -1 would still be added to array.
If we want to catch up the "else" branch, i.e. match everything notmatched with our ifs, we have to write an inverse of sum of ifsmanually. For anything more than 1 or 2 types it quickly becomes a mess.
One needs to make sure that the type used in switch-on and types usedin cases are either identical or at least comparable. For example,comparing strings against integers will yield a compile-time error:
Quite a few protocols and file formats, especially those which aim toconserve space, pack multiple integers into one byte, using integersizes less than 8 bits. For example, an IPv4 packet starts with a bytethat packs both a version number and header length:
Using the meta/bit-endian key, we specify big-endian bit field order(see Specifying bit endianness for more info). In this mode, Kaitai Struct starts parsing bitfields from the most significant bit (MSB, 7) to the least significant bit(LSB, 0). In this case, "version" comes first and "len_header" second.
Most formats using little-endian byte order with packed multi-bytebit fields (e.g. android_img,rar or swf)assume that such bit fields are unpacked manually using bitwise operatorsfrom a little-endian integer parsed in advance containing the whole bitfield. The bit layout of the field is designed accordingly.
The expressions for extracting the values look exactly the same asfor the big-endian order, but the actual bit layoutwill be different, because here the packed integer is readin little-endian (LE) byte order.
As you can see in the KSY snippet, the bit field members in seqare listed from the least significant value to the most significant.If we look at the bit masks of bit field members (which can bedirectly used for ANDing & with the 2-byte little-endian unsignedvalue), they would be sorted in ascending order (starting withthe least significant value):
The key meta/bit-endian specifies the default parsing direction(bit endianness) of bit-sized integers. It can only have theliteral value le or be (run-time switchingis not supported).
Like meta/endian, meta/bit-endian also applies to bX attributesin the current type and all subtypes, but it can be overriddenusing the le/be suffix (bXle/bXbe) for the individual bitintegers. For example:
The doc key has a "sister" key doc-ref, which can be used to specifyreferences to original documentation. This is very useful to keeptrack of what corresponds to what when transcribing an existingspecification. Everywhere where you can use doc, you can usedoc-ref as well. Depending on the target language, this key would berendered as something akin to a "see also" extra paragraph after themain docstring. For example:
The Kaitai Struct compiler will just ignore any key that starts with-, and silently allow it. These kind of keys can be used to storearbitrary additional information, which can be accessible to externaltools (i.e. other than the compiler). Feel free to add more arbitrarykeys if you need to store extra structured information for somereason. For example, if you have 2 concurrent existing implementationsin C++ and Java, you can store IDs for both of them for futurereference:
In this format, instead of specifying just the identifier for everynumeric value, you specify a YAML map, which has an id key forthe identifier, and allows other regular keys (like doc and doc-ref)to specify documentation.
forensicswiki specifies an article name atForensics Wiki, which is aCC-BY-SA-licensed wiki with information on digital forensics, fileformats and tools. A full link could be generated as + this value + /.
loc specifies an identifier in theDigitalFormats database of the US Library of Congress, amajor effort to enumerate and document many file formats for digitalpreservation purposes. The value typically looks like fddXXXXXX, whereXXXXXX is a 6-digit identifier.
mime specifies aMIME (Multipurpose InternetMail Extensions) type, AKA "media type" designation, a stringtypically used in various Internet protocols to specify format ofbinary payload. As of 2019, there is acentralregistry of media types managed by IANA. The value must specify the fullMIME type (both parts), e.g. image/png.
pronom specifies a format identifier in thePRONOMTechnical Registry of the UKNational Archives, which is a massive file formats database thatcatalogues many file formats for digital preservationpurposes. The value typically looks like fmt/xxx, where xxx is anumber assigned at PRONOM (this idenitifer is called a "PUID", AKA"PRONOM Unique Identifier" in PRONOM itself). If many differentPRONOM formats correspond to a particular spec, specify them as a YAMLarray (see example above).
rfc specifies a reference toRFC, "Requestfor Comments" documents maintained by ISOC (InternetSociety). Despite the confusing name, RFCs are typically treated asglobal, Internet-wide standards, and, for example, many networking /interoperability protocols are specified in RFCs. The value should bejust the raw RFC number, without any prefixes, e.g. 1234.
wikidata specifies an item name atWikidata, a global knowledge base. AllWikimedia projects (such as language-specific Wikipedias,Wiktionaries, etc) use Wikidata at least for connecting varioustranslations of encyclopedic articles on a particular subject, sokeeping just a link to Wikidata is typically enough to. The valuetypically follows a Qxxx pattern, where xxx is a number generatedby Wikidata, e.g. Q535473.
However, what gets changed under the hood? It turns out thatspecifying size actually brings some new features: if you modify theperson type to be less than 20 bytes long, it still reserves exactly20 bytes for joe:
Every class that Kaitai Struct generates carries a concept of a "stream", usuallyavailable as an _io member. This is the default stream it reads fromand writes to. This stream works just as you might expect from aregular IO stream implementation in a typical language: itencapsulates reading from files and memory, stores a pointer to itscurrent position, and allows reading/writing of various primitives.
Declaring a new user-defined type in the middle of the seq attributesgenerates a new object (usually via a constructor call), and this object,in turn, needs its own IO stream. So, what are our options here?
In the "sized" case, we know the size a priori and want the object wecreated to be limited within that size. So, instead of passing anexisting stream, we create a new substream that will beshorter and will contain the exact number of bytes requested.
Kaitai Struct allows you to plug-in some predefined "processing" algorithmsto do decompression, de-obfuscation and decryption to get aclear stream, ready to be parsed. Consider parsing a file, in which themain body is obfuscated by applying XOR with 0xaa for every byte:
3a8082e126