Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Decoding Binary Files with unknown structure?

294 views
Skip to first unread message

Shawn

unread,
Jul 1, 2010, 2:42:06 PM7/1/10
to

Hello MathWorks Community,

I'm curious how to decode a binary file that is in some unknown structure..
The file appears to be written in ieee-le machine format.

I can open the file and read it using fopen and fread. fread works well, but it doesn't reveal the nature of the delimiters. It seems each line of the file is of different lengths. alternatively fgetl works to read line by line, but only provides the char representation of the data.

Ideally, I would like to scan the file for a known string which will indicate where a list of fields exists within the file, which I would like to read in order to reveal the order in which some number, N of fields occurs. The strings appear in char format amidst a undetermined number of binary bytes.

Jan Simon

unread,
Jul 1, 2010, 2:58:05 PM7/1/10
to
Dear Shawn,

> I'm curious how to decode a binary file that is in some unknown structure..

Because there is an infinite number of possibilities for unknown structures, there is only one efficient and secure method:
Ask the person who has created the file.
Even if it is an employee of the CIA the chance to be successful is higher than starting to guess what might be matching for some, but not for all instances of files.

Kind regards, Jan

us

unread,
Jul 1, 2010, 5:51:04 PM7/1/10
to
"Shawn " <sha...@geemail.com> wrote in message <i0inht$qo$1...@fred.mathworks.com>...

a hint:
- use one of the many(!) dumpers floating around in the web...

us

Shawn

unread,
Jul 8, 2010, 1:05:03 PM7/8/10
to
"Shawn " <sha...@geemail.com> wrote in message <i0inht$qo$1...@fred.mathworks.com>...
>


Thanks for the suggestions..
1) Unfortunately the company that created the file format has disappeared, so can't ask anyone to define the format.
2) Dumpers - maybe, not sure about that stuff.. would like to only use matlab to get in there and finesse out the appropriate information.


To elaborate on the problem - the information I would like is somewhere after the 1400th line of the binary file. I am unable to identify newlines without the envoking of fgetl, which seems strange (there must be an easy way to identify the end of a line). There does not seem to be a block structure to the binary file; i.e. each line has some unknown number of bytes. The file is encoded in IEEE-le windows-1252.

I would like to scan the file for a particular string that follows each unique field variable string (which can change between each of the similar binary files, so I cannot search for the field variable strings themselves) in order to determine the ordering of the field variables encoded in another data file (that this file defines).

I have identified a list of uint8 numbers that compose the marker that follows all field variables (e.g. 10 0 70 105 101 108 100) and would like to create a way to search for this in each line keeping track of the order of each associated field variable listed just before this entry. Hoever, each variable string is of unknown length and also bordered by some other uninterpretable characters.. (e.g. 0 64 2 0). What would be a good way in matlab to search for the first identifier then rewind along that line to the second identifier to reveal the string of characters between them?

Jan Simon

unread,
Jul 8, 2010, 3:09:04 PM7/8/10
to
Dear Shawn,

> > I'm curious how to decode a binary file that is in some unknown structure..
> > The file appears to be written in ieee-le machine format.

How did you identify that?

> 1) Unfortunately the company that created the file format has disappeared, so can't ask anyone to define the format.

Perhaps you take the chance and post the name of the company and the extension and purpose of the files. The CSSM has a lot of readers, perhaps the grandma of a reader has invented the format...

> To elaborate on the problem - the information I would like is somewhere after the 1400th line of the binary file.

I stop after this sentence already: Binary files usualy do not use "lines".
The data could be compressed also...

Jan

Shawn

unread,
Jul 20, 2010, 5:03:04 PM7/20/10
to
"Jan Simon" <matlab.T...@nMINUSsimon.de> wrote in message <i157og$c2b$1...@fred.mathworks.com>...

Hi Jan,

> Dear Shawn,
>
> > > I'm curious how to decode a binary file that is in some unknown structure..
> > > The file appears to be written in ieee-le machine format.
>
> How did you identify that?

I used the fopen callback feature to have matlab describe the machine format and encoding: i.e.
[filename, permission, machineformat, encoding] = fopen(fid)

>
> > 1) Unfortunately the company that created the file format has disappeared, so can't ask anyone to define the format.
>
> Perhaps you take the chance and post the name of the company and the extension and purpose of the files. The CSSM has a lot of readers, perhaps the grandma of a reader has invented the format...

Ok, the file that I am wanting to read has the extension '.BPF' which stands for 'bubble publishing form'.. The company Bubble Publishing created Optical Mark Reading software. This particular file describes the format of optical marks on a piece of paper in relation to the question number and their spatial position.
Bubble Publishing was purchased by Scantron and is no longer supported.

>
> > To elaborate on the problem - the information I would like is somewhere after the 1400th line of the binary file.
>
> I stop after this sentence already: Binary files usualy do not use "lines".
> The data could be compressed also...

This could be the case, however 'fgetl' reads and reports something - how it knows where to stop, i have no idea. What it reports are lines of mixed interpretable and non-interpreted characters and strings. Each 'line' is of different length.
I am able to search each line using regular expressions for a combination of characters. I ended up crating a count of the number of lines into the file I am while searching each line for a particular string that borders each variable field name, then I record the field name and it's line number in the file to define the ordering of the variable. Unfortunately, this doesn't work for every '.bpf' file that I've tried.. It would still be great to hear from someone who knows about the file structure.

>
> Jan

Thanks Jan,
Shawn

Jan Simon

unread,
Jul 20, 2010, 6:27:04 PM7/20/10
to
Dear Shawn,

> > > > The file appears to be written in ieee-le machine format.

> I used the fopen callback feature to have matlab describe the machine format and encoding: i.e.
> [filename, permission, machineformat, encoding] = fopen(fid)

By this way "permission", "machineformat" and "encoding" equal the settings, which have been use when opening the file with FOPEN(FileName).
Try it:
file = which('plot.m'); % Arbitrary file!
fid = fopen(file, 'rb', 'b'); % 'b' is synonym for ieee-be
[name, perm, fmt] = fopen(fid)
fclose(fid);
fid = fopen(file, 'rb', 'l'); % synonym for ieee-le
[name, perm, fmt] = fopen(fid)
The last is the default on PCs. Therefore you do not know if the file is big-endian or low-endian.
See: "help fopen"

> This could be the case, however 'fgetl' reads and reports something - how it knows where to stop, i have no idea.

FGETL reads until a CHAR(10), CHAR(13) or CHAR([13, 10]) - this can differ if the file is opened in binary or text mode (fopen("rb") or fopen("ra")). In binary files bytes with the contents 10 or 13 appear from time to time, e.g. in the DOUBLE 6990767227 ==> written to the file as: "0, 0, 176, 71, 234, 10, 250, 65" (at least this is the result of TYPECAST to UINT8 - the le/be could change the order). There you find a CHAR(10), but this is *not* a line break!

My impression: as long as nobody reveals more details about the format, you have no chance to read the file. Spend your time for something else.

Kind regards, Jan

Steven_Lord

unread,
Jul 21, 2010, 1:19:39 PM7/21/10
to

"Shawn " <sha...@geemail.com> wrote in message

news:i252u8$28h$1...@fred.mathworks.com...


> "Jan Simon" <matlab.T...@nMINUSsimon.de> wrote in message
> <i157og$c2b$1...@fred.mathworks.com>...

*snip*

>> Perhaps you take the chance and post the name of the company and the
>> extension and purpose of the files. The CSSM has a lot of readers,
>> perhaps the grandma of a reader has invented the format...
>
> Ok, the file that I am wanting to read has the extension '.BPF' which
> stands for 'bubble publishing form'.. The company Bubble Publishing
> created Optical Mark Reading software. This particular file describes the
> format of optical marks on a piece of paper in relation to the question
> number and their spatial position. Bubble Publishing was purchased by
> Scantron and is no longer supported.

Then I think your best bet is probably going to be to contact Scantron and
see if they're willing to provide you with the specification for the format
or a conversion tool to convert your data set into something that either can
be read by a supported application or whose format is documented.

--
Steve Lord
sl...@mathworks.com
comp.soft-sys.matlab (CSSM) FAQ: http://matlabwiki.mathworks.com/MATLAB_FAQ
To contact Technical Support use the Contact Us link on
http://www.mathworks.com

0 new messages