Parsing Binary Structures; Is there a better way / What is your way?

Martin P. Hellwig

unread,

Aug 5, 2009, 10:46:13 AM8/5/09

to

Hi List,

On several occasions I have needed (and build) a parser that reads a
binary piece of data with custom structure. For example (bogus one):

what I usually do is read the packet in binary mode, convert the output
to a concatenated 'binary string'(i.e. '0101011000110') and then use
slice indeces to get the right data portions.
Depending on what I need to do with these portions I convert them to
whatever is handy (usually an integer).

This works out fine for me. Most of the time I also put the ASCII art
diagram of this 'protocol' as a comment in the code, making it more
readable/understandable.

Though there are a couple of things that bothers me with my approach:
- This seems such a general problem that I think that there must be
already a general pythonic solution.
- Using a string for binary representation takes at least 8 times more
memory for the packet than strictly necessary.
- Seems to need a lot of prep work before doing the actual parsing.

Any suggestion is greatly appreciated.

--
MPH
http://blog.dcuktec.com
'If consumed, best digested with added seasoning to own preference.'

Jon Clements

unread,

Aug 5, 2009, 11:39:07 AM8/5/09

to

On 5 Aug, 15:46, "Martin P. Hellwig" <martin.hell...@dcuktec.org>
wrote:

> MPHhttp://blog.dcuktec.com

> 'If consumed, best digested with added seasoning to own preference.'

IIRC (and I have my doubts) the BitVector module may be of use, but
it's been about 3 years since I had to look at it. I think it used the
C equiv. of short ints to do its work. Otherwise, maybe the array
module, the struct module or even possibly ctypes.

Not much use, but might give a few pointers.

Jon.

Martin P. Hellwig

unread,

Aug 5, 2009, 1:05:08 PM8/5/09

to

Jon Clements wrote:
<cut>

> IIRC (and I have my doubts) the BitVector module may be of use, but
> it's been about 3 years since I had to look at it. I think it used the
> C equiv. of short ints to do its work. Otherwise, maybe the array
> module, the struct module or even possibly ctypes.
>
> Not much use, but might give a few pointers.
>
> Jon.

I tried struct before (it seemed logical) but it didn't really do what I
expected:
>>> import struct
>>> struct.unpack('?','x')
(True,)

I expected a list like (False, True, True, True, True, False, False,
False). But it is (actually I hope) just an error of my side, so if
somebody knows how to achieve what I want, with this module, please
enlighten me :-)

The array module doesn't seem to have an option for boolean values.

BitVector does look very promising, but after peeking in the source it
does more or less the same as what I do with converting back and forth
of 'bitstrings'.

I haven't thought about using ctypes for this, excellent suggestion!
Thank you for your feedback.

--
MPH

Bearophile

unread,

Aug 5, 2009, 1:17:29 PM8/5/09

to

Martin P. Hellwig:

> On several occasions I have needed (and build) a parser that reads a
> binary piece of data with custom structure. For example (bogus one):
>
> BE
> +---------+---------+-------------+-------------+------+--------+
> | Version | Command | Instruction | Data Length | Data | Filler |
> +---------+---------+-------------+-------------+------+--------+
> Version: 6 bits
> Command: 4 bits
> Instruction: 5 bits
> Data Length: 5 bits
> Data: 0-31 bits
> Filler: filling 0 bits to make the packet dividable by 8

Have you tried Hachoir? (I think its name may be changed to Fusil, I
don't know).

Bye,
bearophile

Masklinn

unread,

Aug 5, 2009, 1:22:42 PM8/5/09

to pytho...@python.org

The gold standard for binary parsing (and serialization) is probably
Erlang's bit syntax, but as far as Python goes you might be interested
by Hachoir (http://hachoir.org/ but it seems down right now).

It's not going to match your second point, but it can probably help
with the rest (caveat: I haven't used hachoir personally).

Masklinn

unread,

Aug 5, 2009, 1:26:53 PM8/5/09

to pytho...@python.org

On 5 Aug 2009, at 19:17 , Bearophile wrote:
> Have you tried Hachoir? (I think its name may be changed to Fusil, I
> don't know).

Name hasn't been changed (I think fusil is a subproject, something
like that) on the other hand the hachoir.org site is dead.

But apparently Hachoir was moved to bitbucket (http://bitbucket.org/haypo/hachoir/wiki/Home
) just in time so…

Paul Rubin

unread,

Aug 5, 2009, 2:12:05 PM8/5/09

to

"Martin P. Hellwig" <martin....@dcuktec.org> writes:
> what I usually do is read the packet in binary mode, convert the
> output to a concatenated 'binary string'(i.e. '0101011000110') and

Something wrong with reading the data words as an integer and using
old fashioned shifts and masks to get at the bit fields?

Martin P. Hellwig

unread,

Aug 5, 2009, 2:46:12 PM8/5/09

to

No not at all of course, I just found it more readable to slice it
instead of shifting and masking; i.e. if the ASCII art schematics of the
protocol are included as a comment it is almost an 1 to 1 implementation.

Is there an advantage using shifts and masks over my kitchen type solution?

Paul Rubin

unread,

Aug 5, 2009, 2:51:06 PM8/5/09

to

"Martin P. Hellwig" <martin....@dcuktec.org> writes:
> Is there an advantage using shifts and masks over my kitchen type solution?

Weren't you complaining about the 8-to-1 expansion from turning each bit
to an ascii char?

Martin P. Hellwig

unread,

Aug 5, 2009, 3:41:26 PM8/5/09

to

Yes you are (of course) right, my 'dream' solution would be something
that accepts slice indeces on bit level. Your reasoning did reveal some
flaws in my approach though ;-)

Jon Clements

unread,

Aug 5, 2009, 4:39:33 PM8/5/09

to

On 5 Aug, 20:41, "Martin P. Hellwig" <martin.hell...@dcuktec.org>
wrote:
> Paul Rubin wrote:

> > "Martin P. Hellwig" <martin.hell...@dcuktec.org> writes:
> >> Is there an advantage using shifts and masks over my kitchen type solution?
>
> > Weren't you complaining about the 8-to-1 expansion from turning each bit
> > to an ascii char?
>
> Yes you are (of course) right, my 'dream' solution would be something
> that accepts slice indeces on bit level. Your reasoning did reveal some
> flaws in my approach though ;-)
>
> --

> MPHhttp://blog.dcuktec.com

> 'If consumed, best digested with added seasoning to own preference.'

Ahh huh -- found it...

ID: DLT119
Desc: datasyzygy misc. development, production and accounts.
Date: 2008-12-09
File: pybits.py
"""
Implements a python level interface to reading bit fields with an
associated type. Fields are accessed via __getitem__ and must be a
valid python name. eg:

class MyStructure:
a = BitField(4, int)
b = BitField(6, int)
c = BitField(2, int)
d = BitField(36, str)
"""

Ummm, seems I left the rest pretty vague, but I'm guessing that's what
you're after, and I'm fairly curious to see what I wrote... mind you,
not too hard to write again now that's jogged my memory! (although
after a couple at the pub, not going to do it tonight!)

Cheers,

Jon.

Gabriel Genellina

unread,

Aug 6, 2009, 1:50:01 AM8/6/09

to pytho...@python.org

En Wed, 05 Aug 2009 11:46:13 -0300, Martin P. Hellwig
<martin....@dcuktec.org> escribi�:

> On several occasions I have needed (and build) a parser that reads a
> binary piece of data with custom structure. For example (bogus one):
>
> BE
> +---------+---------+-------------+-------------+------+--------+
> | Version | Command | Instruction | Data Length | Data | Filler |
> +---------+---------+-------------+-------------+------+--------+
> Version: 6 bits
> Command: 4 bits
> Instruction: 5 bits
> Data Length: 5 bits
> Data: 0-31 bits
> Filler: filling 0 bits to make the packet dividable by 8

> - Using a string for binary representation takes at least 8 times more

> memory for the packet than strictly necessary.

The size difference isn't so big; an integer takes 12 bytes and a string
takes 24+len bytes. "Data" above would take 56 bytes max when stored as a
string '0110001...' vs. 12 bytes when using a plain integer (sizes
computed on Windows 32bits). Plus, doing bitwise operations in Python
isn't a fast operation as it is in C, by example -- so your current
implementation might be a quite good one (in pure Python, I mean).

--
Gabriel Genellina

Dave Angel

unread,

Aug 6, 2009, 2:45:01 AM8/6/09

to pythonlist

One warning to Martin:

If you want your code portable across systems, watch out for
big-endian/little-endian issues, as well as alignment ones. Shift &
mask code tends to be highly specific to a particular endian-ness,
especially if trying to get multiple bits that cross a byte or word
boundary.

Over the years, I know I've seen at least three endian versions for the
same 32bit word. Something like abcd, dcba, and cdab.

One advantage of converting first to bitstrings, is that there's just
the two places to fix, for portability, the conversion from byte array
to bitstring, and the conversion back.

DaveA

Hendrik van Rooyen

unread,

Aug 6, 2009, 3:39:33 AM8/6/09

to Martin P. Hellwig, pytho...@python.org

On Wednesday 05 August 2009 16:46:13 Martin P. Hellwig wrote:
> Hi List,
>
> On several occasions I have needed (and build) a parser that reads a
> binary piece of data with custom structure. For example (bogus one):
>
> BE
> +---------+---------+-------------+-------------+------+--------+
>
> | Version | Command | Instruction | Data Length | Data | Filler |
>
> +---------+---------+-------------+-------------+------+--------+
> Version: 6 bits
> Command: 4 bits
> Instruction: 5 bits
> Data Length: 5 bits
> Data: 0-31 bits
> Filler: filling 0 bits to make the packet dividable by 8
>
> what I usually do is read the packet in binary mode, convert the output
> to a concatenated 'binary string'(i.e. '0101011000110') and then use
> slice indeces to get the right data portions.
> Depending on what I need to do with these portions I convert them to
> whatever is handy (usually an integer).

This bit banging stuff is a PITA, no matter what you do.
Python does not have bit fields like C.
And C bit fields are implementation dependent.
Write an extension?

Some time ago I asked a similar question, and
Castironpi came up with what was essentially an
indexed integer, with named bits.

It stores the bits natively, but I suspect that the
price paid is access time.

I enclose a module that you can adapt.
It talks about bytes but they are integers really.
It is different from what you are doing, as it
was aimed at reading and writing bits in
a hardware context.

If you get your head around the concept,
then it may give you some ideas. It should
be possible to extend the concept to
pass name,length tuples at construction time
instead of just a name with an implied length
of one bit, and it may make sense to change
the underlying type from an integer to an
array of one byte integers.

It is nice to be able to say:
val = bitname()
to read, and
bitname(1)
or
bitname(0)
to write.

I can also write:
outputbits[3] = 1
or
val = inputbits[5]

If you can successfully generalise it
to field names It should be very useful.

I cannot think of a way though, to not
have the "in" and "out" split, but you can
program your way around that - you do not
have to update "in place".

- Hendrik

bits.py

Hendrik van Rooyen

unread,

Aug 6, 2009, 3:58:30 AM8/6/09

to pytho...@python.org

There is nothing wrong with that, and in the final analysis, it is what
is used - however, I can sympathize with the OP, as it is much nicer
to just call a named thing than to juggle a bunch of hard coded
shifts and masks.

I think the real point here is sort of meta programming - what
is needed is a way to describe and name the elements of the
structure so that useful python objects are created.

The code I posted for bits does that, and I use it all the
time - it is almost as good as being back in 8031 assembler
with direct named bit access.

- Hendrik

Hendrik van Rooyen

unread,

Aug 6, 2009, 4:06:04 AM8/6/09

to Martin P. Hellwig, pytho...@python.org

On Wednesday 05 August 2009 21:41:26 Martin P. Hellwig wrote:

> Yes you are (of course) right, my 'dream' solution would be something
> that accepts slice indeces on bit level. Your reasoning did reveal some
> flaws in my approach though ;-)

This is the first time I have been compared to the sandman...

:-)

- Hendrik

sgriffiths

unread,

Aug 6, 2009, 4:44:19 AM8/6/09

to

On Aug 5, 3:46 pm, "Martin P. Hellwig" <martin.hell...@dcuktec.org>
wrote:

> MPHhttp://blog.dcuktec.com

> 'If consumed, best digested with added seasoning to own preference.'

Take a look at the bitstring module (in pypi or google code). It's
designed to help make this sort of thing easy and it's more fully
featured than BitVector or BitSet. Internally the data is stored as a
byte array, so memory isn't wasted. It will also do all the dirty work
of bit masking and shifting so that you can concentrate on the real
problems. For example:

>>> s = BitString('0x1232432312') # just to give us some data to play with
>>> ver, comm, instr, bitlen = s.read('uint6, bin4, bin5, uint5')
>>> data = s.readbits(bitlen)

Different interpretations of the binary data are given using Python
properties (e.g. s.hex, s.oct, s.uint, etc.) and it supports bit-wise
slicing, modification, finding, replacing and more. It is also still
in active development (full disclosure: I'm the author :-)).

Martin P. Hellwig

unread,

Aug 6, 2009, 2:50:30 PM8/6/09

to

Thanks all for your insights and suggestions.
It seems to me that there are a couple of ways to this bit manipulation
and a couple of foreign modules to assist you with that.

Would it be worth the while to do a PEP on this?
Personally I think that it would be nice to have a standard module in
Python for this but perhaps there is a good reason that there isn't
already one.

Hendrik van Rooyen

unread,

Aug 7, 2009, 3:15:34 AM8/7/09

to pytho...@python.org

On Thursday 06 August 2009 20:50:30 Martin P. Hellwig wrote:
> Thanks all for your insights and suggestions.
> It seems to me that there are a couple of ways to this bit manipulation
> and a couple of foreign modules to assist you with that.
>
> Would it be worth the while to do a PEP on this?
> Personally I think that it would be nice to have a standard module in
> Python for this but perhaps there is a good reason that there isn't
> already one.

I agree that it would be nice - I think it would be nice if something
like this is added to the struct module to give a "this is the obvious
way that python handles foreign data" module

- Hendrik

andrew cooke

unread,

Aug 7, 2009, 7:29:13 AM8/7/09

to

On Aug 5, 10:46 am, "Martin P. Hellwig" <martin.hell...@dcuktec.org>
wrote:

> Hi List,
>
> On several occasions I have needed (and build) a parser that reads a
> binary piece of data with custom structure. For example (bogus one):
>
> BE
> +---------+---------+-------------+-------------+------+--------+
> | Version | Command | Instruction | Data Length | Data | Filler |
> +---------+---------+-------------+-------------+------+--------+
> Version: 6 bits
> Command: 4 bits
> Instruction: 5 bits
> Data Length: 5 bits
> Data: 0-31 bits
> Filler: filling 0 bits to make the packet dividable by 8

hi,

sorry i'm a bit late here, but lepl does exactly this. also, as you
asked for in another post, the BitString class allows arbitrary
indexing into a sequence of bits. and because it's part of a
recursive descent parser you can match "anything" (lepl will handle -
although less efficiently - left recursive and ambiguous grammars).

see the example at http://www.acooke.org/lepl/binary.html#matching
which constructs an ethernet frame and then parses it, extracting the
source and destination addresses.

feel free to email me with more questions.

disclaimer: this is quite new and i don't know of anyone that actually
uses it; it is also Python3 only (because it uses bytes()).

andrew