Reading a binary file using Python

1,009 views
Skip to first unread message

Ben Hearn

unread,
Mar 1, 2015, 10:58:50 AM3/1/15
to python_in...@googlegroups.com
Hello all,

I am currently in the process of writing a tool for work that requires the need to read our in house.model format which is a binary file.

Here's the thing, the guy who wrote the tools that I inherited never finished them and never commented them so I have been hacking at the old code for a few months now which has been great fun and a brilliant learning process. He left an old file that seems like it reads a binary file which is exactly what I need.

I am however confused over something involved reading the data in a binary file, so here goes.

To get the information from the binary file it would appear that certain functions need to be called in a certain order. The only thing is all they essentially do is read the file in byte chunks. If I comment any of the functions out and run the code the data is all wrong.

So my question is, when you are reading binary files (or any files) every time you perform a read function or a struct.unpack does the position in the file that you read from change? Or is it something simpler like if you have written your binary file in a certain way, you HAVE to read it in exactly the same way?

I might be missing some crucial theory so any help would be much appreciated :)

Here are the function calls and the loop that contains the reading of info:

def _read_uint32(file):
data = file.read(4)
data = struct.unpack('I', data)[0]
return data
 
def _read_string(file):
size = _read_uint32(file)
data = file.read(size)
return data
 
def _read_int32(file):
data = file.read(4)
data = struct.unpack('i', data)[0]
return data
 
def _read_matrix4(file):
data = file.read(8 * 16)
return data

And here is the function that I am using to loop through the file and gather the data:

def unserialize(file):
 
materials = []
 
print "OPENING THIS FILE PATH: ", file
 
file = QtCore.QFile(file)
 
file.open(QtCore.QIODevice.ReadOnly)
 
buf = file.read(4)
 
total_size = _read_uint32(file)
 
start_offset = file.pos()
 
passes = 1
 
for i in range(passes):
file.seek(start_offset)
while( (file.pos() - start_offset) < total_size):
type_id = _read_uint32(file)
chunk_size = _read_uint32(file)
if type_id == 0 and i == 0:
print "ID: ", _read_int32(file)
print "NAME: ", _read_string(file)
print "TRANSFORM ", _read_matrix4(file)
print "PIVOT TRANSFORM ", _read_matrix4(file)
print "PARENT ID ", _read_int32(file)
print
elif type_id == MATERIAL and i == 0:
print _read_int32(file)
print _read_string(file)
print _read_int32(file)
else:
file.seek(file.pos() + chunk_size)

Cheers!

Ben

Brad Friedman

unread,
Mar 1, 2015, 12:57:19 PM3/1/15
to python_in...@googlegroups.com
Yes, typical binary stream reading behavior is that the "cursor" or "read head" or "offset" or "pointer" moves with reads. You can also usually "seek" to an offset various ways to skip over/to different parts of the file rather than read through. On semi-sequential media like a platter hard-drive there are usually performance gains when data is read sequentially without intermediate seeks.

Binary file layout and calculation of offsets for a given file, to best suit the purpose of the format, can be a bit of a dark art at times. 

--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_m...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/45b8fdf5-c7e4-4f86-8dbc-f575659cf585%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Justin Israel

unread,
Mar 1, 2015, 1:49:48 PM3/1/15
to python_in...@googlegroups.com


On Mon, 2 Mar 2015 4:58 AM Ben Hearn <ben....@starbreeze.com> wrote:

Hello all,

I am currently in the process of writing a tool for work that requires the need to read our in house.model format which is a binary file.

Here's the thing, the guy who wrote the tools that I inherited never finished them and never commented them so I have been hacking at the old code for a few months now which has been great fun and a brilliant learning process. He left an old file that seems like it reads a binary file which is exactly what I need.

I am however confused over something involved reading the data in a binary file, so here goes.

To get the information from the binary file it would appear that certain functions need to be called in a certain order. The only thing is all they essentially do is read the file in byte chunks. If I comment any of the functions out and run the code the data is all wrong.

So my question is, when you are reading binary files (or any files) every time you perform a read function or a struct.unpack does the position in the file that you read from change? Or is it something simpler like if you have written your binary file in a certain way, you HAVE to read it in exactly the same way?

The file handle would advance with each read, regardless if it is a binary or a plaint text file. It is just how the nature of the file operations work. The difference with a binary file is that of you seek to a completely random position, you would most likely not know how much to read to get anything useful. A binary file follows a protocol defined by its format. You start at a known point and read an expected amount. That may return a known value that indicates what to read next, or the protocol may require that you always just read N different amounts, in order, in a loop, until the file is done.
Something like an int32 is going to be a specific number of bytes in the file. So if you know you should be reading an int32 at this point, you have to both be at the starting position of that data, and read the 4 bytes. Then you have your valid data and have advanced to the start of something else.
Some binary formats could use a protocol where you first read some kind of descriptor string which might say:

   100,s,4,i,4,i,50,s,30

and you would then know that you next should expect to read a 100 byte string, 2 4 byte ints, and a 50 byte string. And that the next descriptor will be a 30 byte string describing possible a different amount of data.

Or maybe you have a header at the beginning that indexes a bunch of byte offsets to various components of the file format.

There are a number of different ways it could work, but in your example it does seem like it is just a pattern of assumed reads that have to be consistently read over and over.

--

Benjam901

unread,
Mar 2, 2015, 3:44:52 AM3/2/15
to python_in...@googlegroups.com
Hey guys,

Thanks for the solid info! It has cleared up a lot for me and it all makes a bit more sense now. I am still a little fuzzy about what the header does? The header for our files is a 4 byte string, what does this indicate or is it for the engine/computer so it is able to decode it properly?

Cheers,

Ben

Justin Israel

unread,
Mar 2, 2015, 4:51:30 PM3/2/15
to python_in...@googlegroups.com
We can't really be sure what exactly the header is used for in your case, but similar to my other examples, it can be used to store information about the rest of the file. Maybe it is a magic id that can be checked to ensure the file content is exactly what you expect it to be. You could design a file to have a specific amount of the start of the file be an ascii or binary header, and then the body picks up at a certain point.


--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_m...@googlegroups.com.

Brad Friedman

unread,
Mar 2, 2015, 4:59:16 PM3/2/15
to python_in...@googlegroups.com
I'd agree. It's probably "magic". Which is usually assumed to be a standard constant value thrown at the head of a binary file to help verify the file-format is of the expected type, rather than corrupt garbage. Usually, a binary reader will check the value and gracefully fail if it's not the expected value. 

Anthony Tan

unread,
Mar 2, 2015, 5:37:30 PM3/2/15
to python_in...@googlegroups.com
Sounds about the right size for a magic number too, most seem to be around the 4 byte size - EXRs for example. If you get morbidly curious, crack open the /usr/share/misc/magic file (or wherever the equivalent is on your system) and revel in the voluminous numbers of file signatures out there.

Ben Hearn

unread,
Mar 3, 2015, 3:52:56 AM3/3/15
to python_in...@googlegroups.com
Thanks for the replies. The header was indeed a magic number. I managed to get my script to read the model file and assign each object the correct parent based on the exported hierarchy from maya :)

Thanks a lot!

-- Ben

--
You received this message because you are subscribed to a topic in the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/python_inside_maya/rTuOozRW-No/unsubscribe.
To unsubscribe from this group and all its topics, send an email to python_inside_m...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/1425335838.2530779.234581177.593CFE56%40webmail.messagingengine.com.

For more options, visit https://groups.google.com/d/optout.



--

Tel - +46 76245 92 90 (Sweden)
Reply all
Reply to author
Forward
Message has been deleted
0 new messages