Help reading structured binary data files

1,418 views
Skip to first unread message

David McInnis

unread,
Sep 17, 2015, 9:09:15 PM9/17/15
to julia-users
I'm in the process of switching from python to julia and have gotten stuck for a couple of days trying to read a, for me, typical data file.

In python I'd create a C-style format, open the file and read the data.
I don't see an equivalent method in Julia.

Ex:
Using a data structure of something like:    "<4sii28s4i"  
I'd figure out the size of the structure, point to the beginning byte, and then unpack it. 

In Julia it looks like *maybe* I could make a data type to do this, but I can't figure out how.
There's also StrPack.jl,  but it too is a little beyond what I understand.

I work with a lot of different instruments, each with its own file format.  Usually I only need to read these files.  After processing I'll save everything into an hdf5 file.  

Thanks,  David.

Tom Breloff

unread,
Sep 17, 2015, 9:23:01 PM9/17/15
to julia-users
I have an alternative to StrPack.jl here: https://github.com/tbreloff/CTechCommon.jl/blob/master/src/macros.jl.  If you have a type that mimics a c-struct, you can create like:

@packedStruct immutable MyStruct
  field1::UInt8
  field2::Byte
end

and it creates some methods: read, reinterpret, etc which can convert raw bytes into the immutable type.

I've only used it on very specific types of data, but it may work for you.

Stefan Karpinski

unread,
Sep 17, 2015, 9:59:48 PM9/17/15
to Julia Users
You can pretty easily just read the pieces out of the file stream one at a time. E.g. if the file starts with the four-byte magic sequence "FuZz" and then has two big-endian two-byte words, and a bunch of bytes, you can do this:

# create a "file" in memory:

julia> io = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)

julia> write(io, "FuZz")
4

julia> write(io, hton(UInt16(12)), hton(UInt16(34)))
4

julia> data = rand(UInt8, 12, 34)
12x34 Array{UInt8,2}:
 0x32  0x12  0xc5  0x8d  0x30  0x32  …  0x99  0x57  0x41  0x14  0xb1  0x28
 0x8f  0x0b  0x8a  0x81  0x1c  0x53     0xc5  0x9b  0x2b  0x88  0x87  0x6f
 0x1e  0xff  0xb1  0xac  0x74  0x08     0x1a  0x61  0x6a  0x54  0x8c  0x25
 0xca  0x70  0x87  0x9d  0x44  0xc7     0x48  0x62  0x10  0xf2  0x3e  0x40
 0xce  0x39  0x23  0xc9  0x54  0x15     0x8d  0xfd  0x32  0xfe  0xab  0x00
 0x0c  0xd1  0x86  0x66  0x06  0xa9  …  0x58  0x4f  0x45  0x4c  0x7e  0xe3
 0x1e  0x98  0xde  0x87  0x71  0x14     0x65  0x5b  0x0f  0xdb  0x5b  0xc5
 0x42  0xc1  0x75  0xc5  0x8d  0xd8     0x91  0x5d  0xce  0xa5  0x84  0x58
 0xf5  0xd7  0xdf  0x71  0x65  0x6e     0xd2  0xc8  0xec  0xcf  0x46  0xc7
 0x64  0x88  0x57  0x58  0x3f  0x5b     0x41  0xad  0x14  0xf8  0x03  0xf4
 0xa0  0xb5  0x42  0xed  0xed  0x80  …  0xb4  0xe3  0x5e  0xa7  0xde  0xa3
 0x6a  0x30  0x15  0xd9  0xe5  0xbd     0x17  0x3b  0xfd  0x6f  0x4e  0x98

julia> write(io, data)
408

julia> seek(io, 0)
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=416, maxsize=Inf, ptr=1, mark=-1)

# now read data back from that file:

julia> readbytes(io, 4) == b"FuZz" || error("invalid magic bytes")
true

julia> m = ntoh(read(io, UInt16))
0x000c

julia> n = ntoh(read(io, UInt16))
0x0022

julia> data = Array(UInt8, m, n);

julia> read!(io, data)
12x34 Array{UInt8,2}:
 0x32  0x12  0xc5  0x8d  0x30  0x32  …  0x99  0x57  0x41  0x14  0xb1  0x28
 0x8f  0x0b  0x8a  0x81  0x1c  0x53     0xc5  0x9b  0x2b  0x88  0x87  0x6f
 0x1e  0xff  0xb1  0xac  0x74  0x08     0x1a  0x61  0x6a  0x54  0x8c  0x25
 0xca  0x70  0x87  0x9d  0x44  0xc7     0x48  0x62  0x10  0xf2  0x3e  0x40
 0xce  0x39  0x23  0xc9  0x54  0x15     0x8d  0xfd  0x32  0xfe  0xab  0x00
 0x0c  0xd1  0x86  0x66  0x06  0xa9  …  0x58  0x4f  0x45  0x4c  0x7e  0xe3
 0x1e  0x98  0xde  0x87  0x71  0x14     0x65  0x5b  0x0f  0xdb  0x5b  0xc5
 0x42  0xc1  0x75  0xc5  0x8d  0xd8     0x91  0x5d  0xce  0xa5  0x84  0x58
 0xf5  0xd7  0xdf  0x71  0x65  0x6e     0xd2  0xc8  0xec  0xcf  0x46  0xc7
 0x64  0x88  0x57  0x58  0x3f  0x5b     0x41  0xad  0x14  0xf8  0x03  0xf4
 0xa0  0xb5  0x42  0xed  0xed  0x80  …  0xb4  0xe3  0x5e  0xa7  0xde  0xa3
 0x6a  0x30  0x15  0xd9  0xe5  0xbd     0x17  0x3b  0xfd  0x6f  0x4e  0x98

ggggg

unread,
Sep 18, 2015, 12:52:56 PM9/18/15
to julia-users
This slightly different from your topic, but related. Numpy has a very nice interface for reading structured data that I've liked using. I felt like I would learn something if I tried to do similar things in Julia.

For example
import numpy as np

# generate file to read
dt
= [("i",np.int64,1),("time", "<f8",1), ("data", np.float64, 100)]
data0
= np.zeros(10, dtype=dt)
for i in range(len(data_in)):
data0
[i] = (i, time.time(), np.random.rand(100))
data0
.tofile("test.dat")

# read file
data
= np.fromfile("test.dat",dtype=dt)
data
["i"] # gives [0,1,2,3,4,5,6,7,8,9]
data
[0] # (0, 1442593345.768639, [0.27580074453117376, 0.9292904639313211,, ...])


It's easy to do most of the same in julia.
# generate file like object to read
io
= IOBuffer()
for i = 1:10
 write
(io, i)
 write
(io, time())
 write
(io, rand(100))
end
# define read type (like dtype in numpy)
immutable
ReadType
 i
::Int
 time
::Float64
 data
::NTuple{100,Float64} # not sure how else to do fixed size array, how would I do say a 10x10 Matrix?
end
len
= position(seekend(io))
data
= read(seekstart(io), ReadType, div(len,sizeof(ReadType)))

Int[d.i for d in data] # equivalent to data["i"] numpy example, not sure why I need to specify Int here
data
[1] # equivalent to data[0] in numpy example

A few things appear to be missing in my Julia version. Specification of byte order and the data["i"] syntax. The second is easy to add.
function Base.getindex(v::Vector, s::Symbol) # not sure it's possible to make it type stable while accepting a symbol
 T
=fieldtype(eltype(v), s)
 T
[getfield(entry,s) for entry in v]
end
data
[:i] # equivalent to data["i"] in numpy


Patrick O'Leary

unread,
Sep 18, 2015, 4:05:56 PM9/18/15
to julia-users
On Thursday, September 17, 2015 at 8:23:01 PM UTC-5, Tom Breloff wrote:
I have an alternative to StrPack.jl here: https://github.com/tbreloff/CTechCommon.jl/blob/master/src/macros.jl.  If you have a type that mimics a c-struct, you can create like:

@packedStruct immutable MyStruct
  field1::UInt8
  field2::Byte
end

and it creates some methods: read, reinterpret, etc which can convert raw bytes into the immutable type.

I've only used it on very specific types of data, but it may work for you.

This looks like a nice simple alternative. I haven't touched StrPack in a long time, but I believe the analogous StrPack syntax is:

@struct type MyStruct
  field1::UInt8
  field2::Byte
end, align_packed

Though since these are both 1-byte fields, there wouldn't be any padding under the default strategy anyways.

I'm not sure if this works with immutables--someone may have contributed that? Maintainer neglect of StrPack is acknowledged :D Have you considered possibly spinning your simplified version into its own package, Tom?

StrPack is/was pretty ambitious; I wanted to be able to get down to bit-level unpacking, with the goal of being able to parse a Novatel RANGECMP log (http://www.novatel.com/assets/Documents/Bulletins/apn031.pdf) entirely with an annotated type declaration. I'd still like to do this, but it's been hard to find the motivation to work on it. (There's a branch up on the repository which pushes towards this, but the work is incomplete.)

The manual alternative Stefan proposes is definitely a good option, especially if this is a one-off structure.

Tom Breloff

unread,
Sep 18, 2015, 4:19:02 PM9/18/15
to julia-users
I never intended it to be a standalone package, mainly because, for it to be useful, I'd have to generalize a lot of stuff and add a lot of features.  If someone else wanted to though...

David McInnis

unread,
Sep 30, 2015, 12:31:36 PM9/30/15
to julia-users
Sorry for the slow response, was called away.

As a starting place I'll try to stick with the builtin routines first.  
With Stefan's idea I've got something that works though I don't see a way to make it more..  ummm...   elegant.

Here's where I'm at:
myfile = "dnp.sam"
dnp = { "File" => myfile }

fh = open(myfile, "r")

dnp["Label"] = bytestring(readbytes(fh, 4))
dnp["Version"] = reinterpret( Uint16, readbytes(fh, 2) )
dnp["Revision"] = reinterpret( Uint16, readbytes(fh, 2) )
dnp["Date"] = bytestring(readbytes(fh, 28))
# and so on for 30 other variables

close(fh)


Any suggestions?

@Tom :  I love how clean your code looks.
@gggg :  We may be after the same thing.

Jameson Nash

unread,
Sep 30, 2015, 12:54:24 PM9/30/15
to julia...@googlegroups.com
For the bitstypes, you can do `[read(fh, UInt16)]` to be a bit more concise.

David McInnis

unread,
Sep 30, 2015, 2:02:44 PM9/30/15
to julia-users
Ah,  thank you..  that's much nicer.


@Tom and Patrick
I like the idea but don't understand how to specify say a string.   ??
field1::Uint16          # makes sense but how to do multiple numbers?
field2::asciistringofXbytes   -or- Xbytes and I'll convert it later

whee, a fire alarm.  

David McInnis

unread,
Oct 1, 2015, 4:39:40 PM10/1/15
to julia-users
Related follow-on question..

Is there a nice way to get the data into a   type  format rather than a  dict ?

Here's the form I'm using now..
function DNP(myfile::String)

dnp = { "File" => myfile }
fh = open(myfile, "r")

dnp["Label"] = bytestring(readbytes(fh, 4))
dnp["Version"] = read(fh, Uint32)
dnp["Revision"] = read(fh, Uint32)
dnp["Date"] = bytestring(readbytes(fh, 28))
dnp["FileFormat"] = read(fh, Uint32)
dnp["FileType"] = bytestring(readbytes(fh,4))
dnp["OriginalFileName"] = bytestring(readbytes(fh,68))
dnp["ReferenceFileName"] = bytestring(readbytes(fh,68))
dnp["RelatedFileNameA"] = bytestring(readbytes(fh,68))
dnp["RelatedFileNameB"] = bytestring(readbytes(fh,68))
dnp["RelatedFileNameC"] = bytestring(readbytes(fh,68))
dnp["Annotate"] = bytestring(readbytes(fh,84))
dnp["InstrumentModel"] = bytestring(readbytes(fh,36))
dnp["InstrumentSerialNumber"] = bytestring(readbytes(fh,36))
dnp["SoftwareVersionNumber"] = bytestring(readbytes(fh,36))
dnp["CrystalMaterial"] = bytestring(readbytes(fh,36))
dnp["LaserWavelengthMicrons"] = read(fh, Float64)
dnp["LaserNullDoubling"] = read(fh, Uint32)
dnp["Padding"] = read(fh, Uint32)
dnp["DispersionConstantXc"] = read(fh, Float64)
dnp["DispersionConstantXm"] = read(fh, Float64)
dnp["DispersionConstantXb"] = read(fh, Float64)
dnp["NumChan"] = read(fh, Uint32)
dnp["InterferogramSize"] = read(fh, Uint32)
dnp["ScanDirection"] = read(fh, Uint32)
dnp["ACQUIREMODE"] = read(fh, Uint32)
dnp["EMISSIWITY"] = read(fh, Uint32)
dnp["APODIZATION"] = read(fh, Uint32)
dnp["ZEROFILL"] = read(fh, Uint32)
dnp["RUNTIMEMATH"] = read(fh, Uint32)
dnp["FFTSize"] = read(fh, Uint32)
dnp["NumberOfCoAdds"] = read(fh, Uint32)
dnp["SingleSided"] = read(fh, Uint32)
dnp["ChanDisplay"] = read(fh, Uint32)
dnp["AmbTemperature"] = read(fh, Float64)
dnp["InstTemperature"] = read(fh, Float64)
dnp["WBBTemperature"] = read(fh, Float64)
dnp["CBBTemperature"] = read(fh, Float64)
dnp["TEMPERATURE_DWR"] = read(fh, Float64)
dnp["EMISSIVITY_DWR"] = read(fh, Float64)
dnp["LaserTemperature"] = read(fh, Float64)
dnp["SpareI"] = read(fh, Uint32,10)
dnp["SpareF"] = read(fh, Float64,10)
dnp["SpareNA"] = bytestring(readbytes(fh,68))
dnp["SpareNB"] = bytestring(readbytes(fh,68))
dnp["SpareNC"] = bytestring(readbytes(fh,68))
dnp["SpareND"] = bytestring(readbytes(fh,68))
dnp["SpareNE"] = bytestring(readbytes(fh,68))
dnp["End"] = bytestring(readbytes(fh,4))


dnp["Interferograms"] = read(fh, Int16, dnp["InterferogramSize"], dnp["NumberOfCoAdds"])
fft_size = dnp["FFTSize"] * dnp["ZEROFILL"] * 512
dnp["Spectrum"] = read(fh, Float32, fft_size)

close(fh)

wavelength_range = 10000.0 / dnp["LaserWavelengthMicrons"]
spectral_range = wavelength_range / 2
spectral_binsize = spectral_range / fft_size
x_fft = [0:fft_size-1] * spectral_binsize
m = dnp["DispersionConstantXm"]
b = dnp["DispersionConstantXb"]
c = dnp["DispersionConstantXc"]
x_corrected = x_fft + c + exp10(m * x_fft + b)
dnp["WL_cm"] = x_corrected
dnp["WL_microns"] = 10000.0 ./ x_corrected


return dnp
end

Which works fine, but it leaves me with the data in a form that's ugly (to me) in calculations:  dnp["InterferogramSize"] * dnp["NumberOfCoAdds"]
Instead of:    dnp.InterferogramSize * dnp.NumberOfCoAdds

I can create an appropriate type like:
type DNP
Label
Version
Revision
Date
#### etc etc
end

..but I can't figure out a good way to get the data there.    Well, other than keeping my current function, defining the type, and then having another function to copy the data into the type...     ugggggly.    

I read all the docs I could find on types but never saw anything that hinted at a solution..    maybe a  function/type hybrid??
I tried creating the type within the function but didn't get anywhere.
Ideas?

Tom Breloff

unread,
Oct 1, 2015, 6:41:30 PM10/1/15
to julia-users
This is exactly what my packedStruct macro is for... define the type (composed of bitstypes) and the macro will set up helper methods to read raw bytes into the fields.  I'm curious if it suits your needs, and what is missing.

David McInnis

unread,
Oct 2, 2015, 10:01:29 AM10/2/15
to julia-users
@Tom :  I couldn't figure out how to do arrays or strings of a specific number of bytes.

Tom Breloff

unread,
Oct 2, 2015, 2:28:29 PM10/2/15
to julia...@googlegroups.com
Gotcha. Those would probably require some additional syntax in the type to be able to define an array/string length... Not too crazy, but not supported now.  If you figure out a clean solution, please post it here. 

Patrick O'Leary

unread,
Oct 3, 2015, 6:26:29 PM10/3/15
to julia-users
Going back to StrPack, there's syntax for that:

@struct type SomeThings
    anInt::Int32
    aVectorWithSixElements::Vector{Int32}(6)
    aStringOfEightBytes::ASCIIString(8)
end

Tom Breloff

unread,
Oct 3, 2015, 7:38:44 PM10/3/15
to julia-users
Thanks Patrick.  Yes StrPack is the way to go then.  My only warning is that StrPack has a bunch of logic for endianness, etc which slows it down a little, but for most purposes it should work well.  (Also, it might have changed since I last tried it ~8 months ago, so ymmv)

Patrick O'Leary

unread,
Oct 3, 2015, 8:19:03 PM10/3/15
to julia-users
On Saturday, October 3, 2015 at 6:38:44 PM UTC-5, Tom Breloff wrote:
Thanks Patrick.  Yes StrPack is the way to go then.  My only warning is that StrPack has a bunch of logic for endianness, etc which slows it down a little, but for most purposes it should work well.  (Also, it might have changed since I last tried it ~8 months ago, so ymmv)

Trust me, it hasn't changed :D Those are totally valid points; I never really tried to wring out all the performance stuff, and Julia itself has changed a lot since I put it together--but it's not something I see myself getting to anytime soon.

David McInnis

unread,
Oct 8, 2015, 11:07:17 AM10/8/15
to julia-users
Ohhh...  shiny!   This looks really nice, thanks.
It's certainly fast enough for my needs.
Reply all
Reply to author
Forward
0 new messages