On several occasions I have needed (and build) a parser that reads a
binary piece of data with custom structure. For example (bogus one):
BE
+---------+---------+-------------+-------------+------+--------+
| Version | Command | Instruction | Data Length | Data | Filler |
+---------+---------+-------------+-------------+------+--------+
Version: 6 bits
Command: 4 bits
Instruction: 5 bits
Data Length: 5 bits
Data: 0-31 bits
Filler: filling 0 bits to make the packet dividable by 8
what I usually do is read the packet in binary mode, convert the output
to a concatenated 'binary string'(i.e. '0101011000110') and then use
slice indeces to get the right data portions.
Depending on what I need to do with these portions I convert them to
whatever is handy (usually an integer).
This works out fine for me. Most of the time I also put the ASCII art
diagram of this 'protocol' as a comment in the code, making it more
readable/understandable.
Though there are a couple of things that bothers me with my approach:
- This seems such a general problem that I think that there must be
already a general pythonic solution.
- Using a string for binary representation takes at least 8 times more
memory for the packet than strictly necessary.
- Seems to need a lot of prep work before doing the actual parsing.
Any suggestion is greatly appreciated.
--
MPH
http://blog.dcuktec.com
'If consumed, best digested with added seasoning to own preference.'
IIRC (and I have my doubts) the BitVector module may be of use, but
it's been about 3 years since I had to look at it. I think it used the
C equiv. of short ints to do its work. Otherwise, maybe the array
module, the struct module or even possibly ctypes.
Not much use, but might give a few pointers.
Jon.
I tried struct before (it seemed logical) but it didn't really do what I
expected:
>>> import struct
>>> struct.unpack('?','x')
(True,)
I expected a list like (False, True, True, True, True, False, False,
False). But it is (actually I hope) just an error of my side, so if
somebody knows how to achieve what I want, with this module, please
enlighten me :-)
The array module doesn't seem to have an option for boolean values.
BitVector does look very promising, but after peeking in the source it
does more or less the same as what I do with converting back and forth
of 'bitstrings'.
I haven't thought about using ctypes for this, excellent suggestion!
Thank you for your feedback.
--
MPH
Have you tried Hachoir? (I think its name may be changed to Fusil, I
don't know).
Bye,
bearophile
It's not going to match your second point, but it can probably help
with the rest (caveat: I haven't used hachoir personally).
But apparently Hachoir was moved to bitbucket (http://bitbucket.org/haypo/hachoir/wiki/Home
) just in time so…
Something wrong with reading the data words as an integer and using
old fashioned shifts and masks to get at the bit fields?
No not at all of course, I just found it more readable to slice it
instead of shifting and masking; i.e. if the ASCII art schematics of the
protocol are included as a comment it is almost an 1 to 1 implementation.
Is there an advantage using shifts and masks over my kitchen type solution?
Weren't you complaining about the 8-to-1 expansion from turning each bit
to an ascii char?
Yes you are (of course) right, my 'dream' solution would be something
that accepts slice indeces on bit level. Your reasoning did reveal some
flaws in my approach though ;-)
Ahh huh -- found it...
ID: DLT119
Desc: datasyzygy misc. development, production and accounts.
Date: 2008-12-09
File: pybits.py
"""
Implements a python level interface to reading bit fields with an
associated type. Fields are accessed via __getitem__ and must be a
valid python name. eg:
class MyStructure:
a = BitField(4, int)
b = BitField(6, int)
c = BitField(2, int)
d = BitField(36, str)
"""
Ummm, seems I left the rest pretty vague, but I'm guessing that's what
you're after, and I'm fairly curious to see what I wrote... mind you,
not too hard to write again now that's jogged my memory! (although
after a couple at the pub, not going to do it tonight!)
Cheers,
Jon.
> On several occasions I have needed (and build) a parser that reads a
> binary piece of data with custom structure. For example (bogus one):
>
> BE
> +---------+---------+-------------+-------------+------+--------+
> | Version | Command | Instruction | Data Length | Data | Filler |
> +---------+---------+-------------+-------------+------+--------+
> Version: 6 bits
> Command: 4 bits
> Instruction: 5 bits
> Data Length: 5 bits
> Data: 0-31 bits
> Filler: filling 0 bits to make the packet dividable by 8
> - Using a string for binary representation takes at least 8 times more
> memory for the packet than strictly necessary.
The size difference isn't so big; an integer takes 12 bytes and a string
takes 24+len bytes. "Data" above would take 56 bytes max when stored as a
string '0110001...' vs. 12 bytes when using a plain integer (sizes
computed on Windows 32bits). Plus, doing bitwise operations in Python
isn't a fast operation as it is in C, by example -- so your current
implementation might be a quite good one (in pure Python, I mean).
--
Gabriel Genellina
If you want your code portable across systems, watch out for
big-endian/little-endian issues, as well as alignment ones. Shift &
mask code tends to be highly specific to a particular endian-ness,
especially if trying to get multiple bits that cross a byte or word
boundary.
Over the years, I know I've seen at least three endian versions for the
same 32bit word. Something like abcd, dcba, and cdab.
One advantage of converting first to bitstrings, is that there's just
the two places to fix, for portability, the conversion from byte array
to bitstring, and the conversion back.
DaveA
This bit banging stuff is a PITA, no matter what you do.
Python does not have bit fields like C.
And C bit fields are implementation dependent.
Write an extension?
Some time ago I asked a similar question, and
Castironpi came up with what was essentially an
indexed integer, with named bits.
It stores the bits natively, but I suspect that the
price paid is access time.
I enclose a module that you can adapt.
It talks about bytes but they are integers really.
It is different from what you are doing, as it
was aimed at reading and writing bits in
a hardware context.
If you get your head around the concept,
then it may give you some ideas. It should
be possible to extend the concept to
pass name,length tuples at construction time
instead of just a name with an implied length
of one bit, and it may make sense to change
the underlying type from an integer to an
array of one byte integers.
It is nice to be able to say:
val = bitname()
to read, and
bitname(1)
or
bitname(0)
to write.
I can also write:
outputbits[3] = 1
or
val = inputbits[5]
If you can successfully generalise it
to field names It should be very useful.
I cannot think of a way though, to not
have the "in" and "out" split, but you can
program your way around that - you do not
have to update "in place".
- Hendrik
There is nothing wrong with that, and in the final analysis, it is what
is used - however, I can sympathize with the OP, as it is much nicer
to just call a named thing than to juggle a bunch of hard coded
shifts and masks.
I think the real point here is sort of meta programming - what
is needed is a way to describe and name the elements of the
structure so that useful python objects are created.
The code I posted for bits does that, and I use it all the
time - it is almost as good as being back in 8031 assembler
with direct named bit access.
- Hendrik
> Yes you are (of course) right, my 'dream' solution would be something
> that accepts slice indeces on bit level. Your reasoning did reveal some
> flaws in my approach though ;-)
This is the first time I have been compared to the sandman...
:-)
- Hendrik
Take a look at the bitstring module (in pypi or google code). It's
designed to help make this sort of thing easy and it's more fully
featured than BitVector or BitSet. Internally the data is stored as a
byte array, so memory isn't wasted. It will also do all the dirty work
of bit masking and shifting so that you can concentrate on the real
problems. For example:
>>> s = BitString('0x1232432312') # just to give us some data to play with
>>> ver, comm, instr, bitlen = s.read('uint6, bin4, bin5, uint5')
>>> data = s.readbits(bitlen)
Different interpretations of the binary data are given using Python
properties (e.g. s.hex, s.oct, s.uint, etc.) and it supports bit-wise
slicing, modification, finding, replacing and more. It is also still
in active development (full disclosure: I'm the author :-)).
Would it be worth the while to do a PEP on this?
Personally I think that it would be nice to have a standard module in
Python for this but perhaps there is a good reason that there isn't
already one.
I agree that it would be nice - I think it would be nice if something
like this is added to the struct module to give a "this is the obvious
way that python handles foreign data" module
- Hendrik
hi,
sorry i'm a bit late here, but lepl does exactly this. also, as you
asked for in another post, the BitString class allows arbitrary
indexing into a sequence of bits. and because it's part of a
recursive descent parser you can match "anything" (lepl will handle -
although less efficiently - left recursive and ambiguous grammars).
see the example at http://www.acooke.org/lepl/binary.html#matching
which constructs an ethernet frame and then parses it, extracting the
source and destination addresses.
feel free to email me with more questions.
disclaimer: this is quite new and i don't know of anyone that actually
uses it; it is also Python3 only (because it uses bytes()).
andrew