Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Tokenizer theory and practice

36 views
Skip to first unread message

Hans-Peter Diettrich

unread,
May 13, 2008, 8:51:21 AM5/13/08
to
I'm currently thinking about lexing/tokenizing binary data and Unicode,
in e.g. PDF files, and possible optimizations in LL/PEG parsers.

Binary data typically comes in fixed length chunks, where the parser has
to provide the length and encoding of the next token. This procedure is
quite different from lexing text, that's why I prefer the term
"tokenizer" in this context, and bottom-up (LR) parsers seem not to be
very usable in this case. A lexer at best can skip over binary data,
what often is sufficient, but then it should not be confused by embedded
null "characters" in the input. Traditional lexers often use null
characters as end markers, for either the end of the whole input or for
the end of the input buffer, so that this value needs special handling
anyways. Then it should not be hard to add another meaning to such
bytes, as legal members of the expected character set. But how to
implement that handling?

IMO the most time-efficient procedure would use an pointer into an input
buffer, so that special handling (e.g. fill_buffer) is only required
when a null byte is detected. A dedicated getter procedure instead can
implement transparent buffering, returning an dedicated (epsilon/EOF)
value only at the real end of the input. The latter case seems to be
more suitable with lexer generators, where null bytes could be
substituted by some non-zero code, which would not require changes to an
automatically generated lexer, apart from handling an extended character
set of at least 257 members. Opinions?


Unicode introduces a couple of problems into lexers, which I don't want
to discuss too deeply. Most important seems to be the expansion of the
character codes, from single to multiple bytes. Together with various
forms of encoding (UTF...), a getter method seems to be preferable over
using pointers. Then the getter can implement and encapsulate the
encoding, so that the lexer only has to deal with uniform (4 byte)
character codes.

In formal languages a mix of ASCII (keywords) and "real" Unicode parts
(literals...) can be expected, so that an optimized tokenizer should be
able to treat these parts differently. Again a top-down parser seems to
be preferable, because it can switch the tokenizer into the different
modes (SBCS, MBCS, binary...). But wouldn't it make sense, then, to
replace the lexer/tokenizer by multiple "matchers", which are called by
the parser as appropriate, and which can e.g. match expected literals
(required keywords...) without guessing of the kind of the next token.
But can such specialized functions really speed up or simplify parsing?


Back to binary files, including PDF, ZIP and other file formats with
structured content, a multi-level parser seems to be required in most
cases, which can separate the file into distinct parts, which can (often
must) be processed in random order. I vaguely remember an approach to
formalize the description and decoding of binary files - can somebody
give me more precise hints in that direction?

DoDi

cr88192

unread,
May 15, 2008, 9:13:16 PM5/15/08
to
"Hans-Peter Diettrich" <DrDiet...@aol.com> wrote in message

> I'm currently thinking about lexing/tokenizing binary data and Unicode,
> in e.g. PDF files, and possible optimizations in LL/PEG parsers.
>
> Binary data typically comes in fixed length chunks, where the parser has
> to provide the length and encoding of the next token. This procedure is
> quite different from lexing text, that's why I prefer the term
> "tokenizer" in this context, and bottom-up (LR) parsers seem not to be
> very usable in this case. A lexer at best can skip over binary data,
> what often is sufficient, but then it should not be confused by embedded
> null "characters" in the input. Traditional lexers often use null
> characters as end markers, for either the end of the whole input or for
> the end of the input buffer, so that this value needs special handling
> anyways. Then it should not be hard to add another meaning to such
> bytes, as legal members of the expected character set. But how to
> implement that handling?

<snip/>

> Back to binary files, including PDF, ZIP and other file formats with
> structured content, a multi-level parser seems to be required in most
> cases, which can separate the file into distinct parts, which can (often
> must) be processed in random order. I vaguely remember an approach to
> formalize the description and decoding of binary files - can somebody
> give me more precise hints in that direction?

IMO, parsers of this sort (especially text-intended parser
generators), are not a good approach to parsing binary data, nor do I
believe are the formalisms often employed, which are themselves
intended for textual structures.

Rather, I suggest looking at how such formats are actually structured,
and how they are commonly represented within format specs, and try to
derive from this a general notation, formalism, and possible
implementation.

an issue I think:
there are a wide variety of possible ways in which data can be, and often
is, structured:
uniform-sized blocks (common in filesystems, formats like GCF, B-Trees,
....);
structures and offsets (ELF, COFF, ...);
TLV structures (RIFF, PNG, ASN.1, ...);
bitstreams (Deflate, ...);
bytestreams (Java class, WBXML, many simpler LZ77 variants, ...);
escaped tags (JPEG, MPEG, FLAC, ZIP, ...);
....

So, rather than having a single formalism (such as LL or LR), more
likely, one will end up with a kind of mini programming language just
to specify how the pieces fit together.

presumed features:
structs (explicit on-disk layout), conditionals, support for variable length
fields and members, ability to deal with file offsets and similar ideas, ...

even then, there is likely to be a whole lot more that can't be formalized,
than that can be.

more likely, one can end up more with what would amount to a format-design
tool, than something that can actually reliably parse existing formats.

even then, it is usually not parsing the format that is the work, rather it
is processing the data that is usually a lot more work.


for example (just pulling something out of nowhere):
struct PACK_Entry {
char name[56];
u32le offset;
u32le size;
byte@offset data[size];
};

struct PACK_Header {
FOURCC magic="PACK";
u32le offset;
u32le ents;
PACK_Entry@offset dir[ents];
};

now, we can note that this is a trivial format, but even as such it is
awkward to formalize...

now, most other formats are nowhere near this trivial...


now, personally, I don't usually use such formalisms or tools, since
personally I have usually found it to be most effective simply to
write the parser myself, and if likely it were beyond my abilities to
write a parser, it would also likely be somewhat beyond the abilities
of these formalisms as well...

Dmitry A. Kazakov

unread,
May 16, 2008, 12:23:45 PM5/16/08
to
On Tue, 13 May 2008 14:51:21 +0200, Hans-Peter Diettrich wrote:

> I'm currently thinking about lexing/tokenizing binary data and Unicode,
> in e.g. PDF files, and possible optimizations in LL/PEG parsers.
>
> Binary data typically comes in fixed length chunks, where the parser has
> to provide the length and encoding of the next token. This procedure is
> quite different from lexing text, that's why I prefer the term
> "tokenizer" in this context, and bottom-up (LR) parsers seem not to be
> very usable in this case. A lexer at best can skip over binary data,
> what often is sufficient, but then it should not be confused by embedded
> null "characters" in the input.

I don't really understand the problem, maybe, you can elaborate
it. Why NUL is a problem and why tokens need to be "raw text." When I
do similar stuff, I do it in a way that the parser returned typed
objects rather than copies of the source. The whole idea to copy the
source is bogus, IMO.

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Hans-Peter Diettrich

unread,
May 16, 2008, 6:08:05 PM5/16/08
to
cr88192 schrieb:

>> Back to binary files, including PDF, ZIP and other file formats with
>> structured content, a multi-level parser seems to be required in most
>> cases, which can separate the file into distinct parts, which can (often
>> must) be processed in random order. I vaguely remember an approach to
>> formalize the description and decoding of binary files - can somebody
>> give me more precise hints in that direction?
>
> IMO, parsers of this sort (especially text-intended parser
> generators), are not a good approach to parsing binary data, nor do I
> believe are the formalisms often employed, which are themselves
> intended for textual structures.

That's why I mentioned multi-level parsers, where the top level handles
the file structure. In PDF files the file structure is textual, with
embedded binary streams. In other files the structure can be binary,
with embedded text snippets.


> Rather, I suggest looking at how such formats are actually structured,
> and how they are commonly represented within format specs, and try to
> derive from this a general notation, formalism, and possible
> implementation.

It would be nice to start with some already existing formalism.

> an issue I think:
> there are a wide variety of possible ways in which data can be, and often
> is, structured:
> uniform-sized blocks (common in filesystems, formats like GCF, B-Trees,

> .....);


> structures and offsets (ELF, COFF, ...);
> TLV structures (RIFF, PNG, ASN.1, ...);
> bitstreams (Deflate, ...);
> bytestreams (Java class, WBXML, many simpler LZ77 variants, ...);
> escaped tags (JPEG, MPEG, FLAC, ZIP, ...);

> .....


>
> So, rather than having a single formalism (such as LL or LR), more
> likely, one will end up with a kind of mini programming language just
> to specify how the pieces fit together.

I hope to separate the data structures, as syntactical elements, from
attributes etc. as kind of semantic code. In the best case it should be
possible to derive both an serializer and an de-serializer from a given
formal description.


> presumed features:
> structs (explicit on-disk layout), conditionals, support for variable length
> fields and members, ability to deal with file offsets and similar ideas, ...
>
> even then, there is likely to be a whole lot more that can't be formalized,
> than that can be.

I don't expect that a single formalism can cover really all weird file
formats. But it would be nice to have more than a description of the
basic "tokens", which cannot be much more than numbers or strings - in a
lot of different formats, of course. I don't want to decode bitstreams
or encrypted data, that would be a task for an automaton in an sub-parser.


> more likely, one can end up more with what would amount to a format-design
> tool, than something that can actually reliably parse existing formats.
>
> even then, it is usually not parsing the format that is the work, rather it
> is processing the data that is usually a lot more work.

The task of the parser is finished, when a file is converted into a tree
structure, possibly with additional tables for file offsets, tag names
and more.

> for example (just pulling something out of nowhere):
> struct PACK_Entry {
> char name[56];
> u32le offset;
> u32le size;
> byte@offset data[size];
> };
>
> struct PACK_Header {
> FOURCC magic="PACK";
> u32le offset;
> u32le ents;
> PACK_Entry@offset dir[ents];
> };

Such a description indeed will cover the most common data structures, of
a fixed or counted length. We'll have to add (zero) terminated items
(strings and other vectors), and multi-level descriptions for offsets
(relative to some file section, table lookup...). A "whitespace"
equivalent will be required as well, for the description of alignment
and padding.


> now, personally, I don't usually use such formalisms or tools, since
> personally I have usually found it to be most effective simply to
> write the parser myself, and if likely it were beyond my abilities to
> write a parser, it would also likely be somewhat beyond the abilities
> of these formalisms as well...

I've decoded many file formats myself, and my most useful tool was a
hex viewer with formatting capabilities. It could display files as
tables of fixed-size records, where the record formats were described
by strings, with 'c' for a character, 'l' for a longword etc. The
table positions and formats were written to an file, ready for later
reuse. What I'm still missing is the next step, with more
sophisticated types, and an ability to convert the fixed table offsets
and counts into computed elements, as templates for viewing files of
the same type. It may be more promising to extend that old tool with
the capabilities, mentioned above, instead of formalizing all aspects
of automated parsing of non-textual files. BTW, IDA is a fine example
for such a tool...

DoDi

Hans Aberg

unread,
May 17, 2008, 5:13:26 AM5/17/08
to
Hans-Peter Diettrich wrote:
> Unicode introduces a couple of problems into lexers, which I don't want
> to discuss too deeply. Most important seems to be the expansion of the
> character codes, from single to multiple bytes.

Unicode regular expressions can be lexed directly by rewriting into UTF.
I posted some Haskell function for doing that here
http://lists.gnu.org/archive/html/help-flex/2005-01/msg00043.html

Hans Aberg

Hans-Peter Diettrich

unread,
May 17, 2008, 4:22:34 AM5/17/08
to
Dmitry A. Kazakov schrieb:

>> Binary data typically comes in fixed length chunks, where the parser has
>> to provide the length and encoding of the next token. This procedure is
>> quite different from lexing text, that's why I prefer the term
>> "tokenizer" in this context, and bottom-up (LR) parsers seem not to be
>> very usable in this case. A lexer at best can skip over binary data,
>> what often is sufficient, but then it should not be confused by embedded
>> null "characters" in the input.
>
> I don't really understand the problem, maybe, you can elaborate
> it. Why NUL is a problem and why tokens need to be "raw text."

These are problems of many (C based) lexer generators. A conversion,
into anything but an built-in data type, requires semantic code, which
makes a formal grammar usable only with the language of that code. Few
generators come with a built-in language for doing such conversions,
like MetaS or TextTransformer.

> When I do similar stuff, I do it in a way that the parser returned
> typed objects rather than copies of the source. The whole idea to
> copy the source is bogus, IMO.

Indeed, textual copies are of little use. Can you suggest a
descriptive formalism for the objects, returned by an lexer?

DoDi

cr88192

unread,
May 17, 2008, 7:06:04 PM5/17/08
to
"Hans-Peter Diettrich" <DrDiet...@aol.com> wrote in message

not sure if useful, but, in my recent incarnations of my compiler, the
parse, and many other subsystems, are making use of XML as the native
representation.

sadly, though XML is acceptable for textual inputs, for binary formats it is
less sensible.


another possible notation is a variation of S-Expressions, which tends to be
a lot more compact and concise than XML (both notationally, and in terms of
memory usage as well).

I had also used them sometimes within my compiler as well. the main
disadvantage of s-exps, however, is that they tend to be far less flexible
and annotatable than XML (this has often been an issue in my case).

with XML-based designs, often one can add new tags and attributes as needed
without needing to change existing code, but s-exp based notations tend to
lack this ability, usually requiring things to be explicitly designed and
requiring structural alteration (aka: lots of code modification...) to deal
with simple design changes or feature additions (sadly... it is often more
painful to deal with this issue with s-exps than with raw structs).


a kind of hybrid system could also be possible, combining both
binary-encoded datums (such as integer values) with a more open-ended
structure (tag based rather than positional).
this is possible with s-exps, albeit it is not common practice (it being
much more common to use then with a good old LISP-like positional
structure...).

also possible would be, rather than directly working with s-exps, making use
of a kind of schema, such that we can change the schema and also the
structure, without having to change all of the code (but, then, we almost
may as well be using a binary representation...).

or such...

cr88192

unread,
May 17, 2008, 7:51:34 PM5/17/08
to
"Hans-Peter Diettrich" <DrDiet...@aol.com> wrote in message
> cr88192 schrieb:
>
>>> Back to binary files, including PDF, ZIP and other file formats with
>>> structured content, a multi-level parser seems to be required in most
>>> cases, which can separate the file into distinct parts, which can (often
>>> must) be processed in random order. I vaguely remember an approach to
>>> formalize the description and decoding of binary files - can somebody
>>> give me more precise hints in that direction?
>>
>> IMO, parsers of this sort (especially text-intended parser
>> generators), are not a good approach to parsing binary data, nor do I
>> believe are the formalisms often employed, which are themselves
>> intended for textual structures.
>
> That's why I mentioned multi-level parsers, where the top level handles
> the file structure. In PDF files the file structure is textual, with
> embedded binary streams. In other files the structure can be binary,
> with embedded text snippets.
>

ok.

>
>> Rather, I suggest looking at how such formats are actually structured,
>> and how they are commonly represented within format specs, and try to
>> derive from this a general notation, formalism, and possible
>> implementation.
>
> It would be nice to start with some already existing formalism.
>

should one exist...


>> an issue I think:
>> there are a wide variety of possible ways in which data can be, and often
>> is, structured:
>> uniform-sized blocks (common in filesystems, formats like GCF, B-Trees,
>> .....);
>> structures and offsets (ELF, COFF, ...);
>> TLV structures (RIFF, PNG, ASN.1, ...);
>> bitstreams (Deflate, ...);
>> bytestreams (Java class, WBXML, many simpler LZ77 variants, ...);
>> escaped tags (JPEG, MPEG, FLAC, ZIP, ...);
>> .....
>>
>> So, rather than having a single formalism (such as LL or LR), more
>> likely, one will end up with a kind of mini programming language just
>> to specify how the pieces fit together.
>
> I hope to separate the data structures, as syntactical elements, from
> attributes etc. as kind of semantic code. In the best case it should be
> possible to derive both an serializer and an de-serializer from a given
> formal description.
>

ok.

so, one first describes all of the pieces, and then later how they are
assembled?...
ok, but it may be difficult to pull off...


>
>> presumed features:
>> structs (explicit on-disk layout), conditionals, support for variable
>> length
>> fields and members, ability to deal with file offsets and similar ideas,
>> ...
>>
>> even then, there is likely to be a whole lot more that can't be
>> formalized,
>> than that can be.
>
> I don't expect that a single formalism can cover really all weird file
> formats. But it would be nice to have more than a description of the
> basic "tokens", which cannot be much more than numbers or strings - in a
> lot of different formats, of course. I don't want to decode bitstreams
> or encrypted data, that would be a task for an automaton in an sub-parser.
>

yes, these would be types.
however, I find it doubtful that they could be recognized automatically,
except maybe in some crude likilihood-based measure ("this looks like it is
probably a 32 bit integer"...).


>
>> more likely, one can end up more with what would amount to a
>> format-design
>> tool, than something that can actually reliably parse existing formats.
>>
>> even then, it is usually not parsing the format that is the work, rather
>> it
>> is processing the data that is usually a lot more work.
>
> The task of the parser is finished, when a file is converted into a tree
> structure, possibly with additional tables for file offsets, tag names
> and more.
>

yes, ok.

however, in many formats, the parsing and processing are heavily
interconnected.
for example, how do you decode a JPEG, apart from first processing the more
basic elements (headers, huffman table, ...).

of course, one could limit themselves to context-independent fileformats,
but this is fairly limiting.


>> for example (just pulling something out of nowhere):
>> struct PACK_Entry {
>> char name[56];
>> u32le offset;
>> u32le size;
>> byte@offset data[size];
>> };
>>
>> struct PACK_Header {
>> FOURCC magic="PACK";
>> u32le offset;
>> u32le ents;
>> PACK_Entry@offset dir[ents];
>> };
>
> Such a description indeed will cover the most common data structures, of
> a fixed or counted length. We'll have to add (zero) terminated items
> (strings and other vectors), and multi-level descriptions for offsets
> (relative to some file section, table lookup...). A "whitespace"
> equivalent will be required as well, for the description of alignment
> and padding.
>

as noted, I also said that there would be conditionals, that or we could
also support BNF-based descriptions.

u64 uvli() {
byte v;
return((v&0x80)?(((v&0x7f)<<7)|uvli()):(v&0x7f));
};

u64 svli() {
uvli v;
return((v&1)?(-((v+1)>>1)):(v>>1));
}

of course, this does not make it clear how to encode these types...


>
>> now, personally, I don't usually use such formalisms or tools, since
>> personally I have usually found it to be most effective simply to
>> write the parser myself, and if likely it were beyond my abilities to
>> write a parser, it would also likely be somewhat beyond the abilities
>> of these formalisms as well...
>
> I've decoded many file formats myself, and my most useful tool was a
> hex viewer with formatting capabilities. It could display files as
> tables of fixed-size records, where the record formats were described
> by strings, with 'c' for a character, 'l' for a longword etc. The
> table positions and formats were written to an file, ready for later
> reuse. What I'm still missing is the next step, with more
> sophisticated types, and an ability to convert the fixed table offsets
> and counts into computed elements, as templates for viewing files of
> the same type. It may be more promising to extend that old tool with
> the capabilities, mentioned above, instead of formalizing all aspects
> of automated parsing of non-textual files. BTW, IDA is a fine example
> for such a tool...
>

your intent is for reverse engineering or something?...
that is what this sounds like to me at least...

or, at least, in my case when examining files (usually to verify encoders or
decoders are working about right) I often end up decomposing the structure
in my head. this works fairly well for byte-based formats at least, but
things like huffman-coded data or highly compressed formats go beyond my
mental abilities to decode.

usually for more complex or bulky formats, I write special tools...

Dmitry A. Kazakov

unread,
May 18, 2008, 4:29:59 AM5/18/08
to
On Sat, 17 May 2008 10:22:34 +0200, Hans-Peter Diettrich wrote:

> Dmitry A. Kazakov schrieb:


>
>> When I do similar stuff, I do it in a way that the parser returned
>> typed objects rather than copies of the source. The whole idea to
>> copy the source is bogus, IMO.
>
> Indeed, textual copies are of little use. Can you suggest a
> descriptive formalism for the objects, returned by an lexer?

Not with a bottom-up approach. But when parser does it top-down or
else somewhere in the middle, it well knows what to expect at the
cursor. Being at the top it knows the exact type, so that parsing
either fails or yields a token. Below that it knows only some set of
types, i.e. in OO terms, a class of types. In this case the returned
token would be a polymorphic object from that class (or else a
failure). The class could be like "infix operation","literal" etc. In
fact, this is merely the abstract factory pattern. The parser acts a
factory, the parsed source at the cursor determines the concrete token
type and then its value.

I think this could be formalized. One premise is that the set of
tokens forms a tree/forest-like hierarchy, which is, I believe, almost
always the case.

Hans-Peter Diettrich

unread,
May 18, 2008, 4:37:49 AM5/18/08
to
cr88192 schrieb:

>> I hope to separate the data structures, as syntactical elements, from
>> attributes etc. as kind of semantic code. In the best case it should be
>> possible to derive both an serializer and an de-serializer from a given
>> formal description.
>>
>
> ok.
>
> so, one first describes all of the pieces, and then later how they are
> assembled?...
> ok, but it may be difficult to pull off...

You are right. Looking at some popular binary file formats, the
construction (writing) must be done sequentially, with possible
patches of offsets in preceding tables. When an offset table is
written last, it must be read in first, from the end of the file. I'm
not sure whether a formal description of such procedures is possible,
for use in both reading and writing such an file. An according grammar
may become context sensitive, what classifies the problem as very
interesting from the scientific point of view, but it's unlikely that
it will result in a usable tool.


> as noted, I also said that there would be conditionals, that or we could
> also support BNF-based descriptions.
>
> u64 uvli() {
> byte v;
> return((v&0x80)?(((v&0x7f)<<7)|uvli()):(v&0x7f));
> };
>
> u64 svli() {
> uvli v;
> return((v&1)?(-((v+1)>>1)):(v>>1));
> }
>
> of course, this does not make it clear how to encode these types...

A common example would be UTF-8 encoding/decoding, which is easily
described in a procedural way, and possibly also in a grammar, but
deriving the code from such a grammar seems to exceed my capabilities.
<sigh>


> your intent is for reverse engineering or something?...
> that is what this sounds like to me at least...

That's the background, how I came to parsers at all ;-)

> usually for more complex or bulky formats, I write special tools...

After considering all the topics, mentioned in this thread, I better
leave the theory to other people, too. As you stated before:


>>
more likely, one can end up more with what would amount to a
format-design tool, than something that can actually reliably parse
existing formats.
<<

DoDi

cr88192

unread,
May 20, 2008, 5:54:45 AM5/20/08
to
"Hans-Peter Diettrich" <DrDiet...@aol.com> wrote in message
> cr88192 schrieb:
>
>>> I hope to separate the data structures, as syntactical elements, from
>>> attributes etc. as kind of semantic code. In the best case it should be
>>> possible to derive both an serializer and an de-serializer from a given
>>> formal description.
>>>
>>
>> ok.
>>
>> so, one first describes all of the pieces, and then later how they are
>> assembled?...
>> ok, but it may be difficult to pull off...
>
> You are right. Looking at some popular binary file formats, the
> construction (writing) must be done sequentially, with possible
> patches of offsets in preceding tables. When an offset table is
> written last, it must be read in first, from the end of the file. I'm
> not sure whether a formal description of such procedures is possible,
> for use in both reading and writing such an file. An according grammar
> may become context sensitive, what classifies the problem as very
> interesting from the scientific point of view, but it's unlikely that
> it will result in a usable tool.

yes, this is one problem.

with format writing, the problem is made a little worse, for example, in the
case of formats which can't easily be written sequentially, or have a
non-trivial representation for storage management.

this would seem to be the case, for example, for filesystem-like container
formats...

>> as noted, I also said that there would be conditionals, that or we could
>> also support BNF-based descriptions.
>>
>> u64 uvli() {
>> byte v;
>> return((v&0x80)?(((v&0x7f)<<7)|uvli()):(v&0x7f));
>> };
>>
>> u64 svli() {
>> uvli v;
>> return((v&1)?(-((v+1)>>1)):(v>>1));
>> }
>>
>> of course, this does not make it clear how to encode these types...
>
> A common example would be UTF-8 encoding/decoding, which is easily
> described in a procedural way, and possibly also in a grammar, but
> deriving the code from such a grammar seems to exceed my capabilities.
> <sigh>

yeah. likely any values of the sort would need builtin types, leading to
incompleteness issues (the formalism being unable to even represent many of
its basic constructs).

or, at least this would be the abstract case. including a C-like
mini-language for writing serializers and deserializers would be helpful,
but would reduce this from the level of a formalism to a special purpose
programming language.

>> your intent is for reverse engineering or something?...
>> that is what this sounds like to me at least...
>
> That's the background, how I came to parsers at all ;-)

ok.
I came from the direction of writing load/save code, and later interpreters.

long ago, some of my more-complex formats had a vaguely POV-ray style
syntax, and I later write a scheme interpreter. my skills gradually
improved, and my ability to write parsers somewhat improved.

2001: POV-ray like syntax (fileformats);
2002: Scheme interpreter
2003: well, graduated HS, worth note as a reference point, Scheme
interpreter inflated into an unmaintainable horror...
2004: new language, JS-like syntax (remaining Scheme influences)
2006: new language, C/JS hybrid style (experimentally used JIT compilation)
2007: harsh jump, compiler for C proper
2008: gradually rewriting the C compiler piece by piece it seems...

note that jumping from a JavaScript-like language to C is far more work than
would seem implied by the syntex (internally, C is a beast which is severely
different WRT the internals).


>> usually for more complex or bulky formats, I write special tools...
>
> After considering all the topics, mentioned in this thread, I better
> leave the theory to other people, too. As you stated before:
> >>
> more likely, one can end up more with what would amount to a
> format-design tool, than something that can actually reliably parse
> existing formats.
> <<
>

yeah.

sadly, I am not even really sure if a general solution is possible, albeit
clean design of fileformats is a worthwhile goal, I am not even so sure this
is possible (AFAIK ASN.1 was one attempt at this problem, but it is far from
a universal solution IMO).


possibly relevant (I have not really looked much into it), a format known as
TSN.1:
http://www.protomatics.com/TSN1_Spec.pdf

0 new messages