Efficient way to read binary data line-by-line

pfalloon

unread,

Sep 9, 2009, 4:38:37 AM9/9/09

to

Hi All,
I am trying to set up an efficient procedure to handle large binary
datasets in such a way that I can read (or write) them line-by-line
without ever needing to have the entire dataset in memory.

I have been using the BinaryRead/Write functions to do this, but am
finding that they run significantly (dramatically even) slower than
reading the entire file using Import. It would be great to know if
anyone has found a solution for this and if not whether it's something
that's likely to improve in future versions.

Let me illustrate my attempts with an example (apologies for the
length of this; I've tried to make it as succinct as possible while
remaining non-trivial):

(* initial definitions *)
{nRow, nCol} = {100000,10};
mat = RandomReal[{-1,1}, {nRow,nCol}];
file = "C:\\falloon\\test.dat";
fmt = ConstantArray["Real64", nCol];

In[240]:= (* METHOD 1A: write to file using Export: very efficient *)
Export[file, mat, "Real64"] // AbsoluteTiming

Out[240]= {0.0937500,C:\falloon\test.dat}

In[241]:= (* METHOD 2A: write to file line-by-line *)
If[FileExistsQ[file], DeleteFile[file]];
str = OpenWrite[file, BinaryFormat->True];
Do[BinaryWrite[file, row, fmt], {row, mat}] // AbsoluteTiming
Close[str];

Out[249]= {2.1718750,Null}

(* METHOD 3A: write to file element-by-element *)
If[FileExistsQ[file], DeleteFile[file]];
str = OpenWrite[file, BinaryFormat->True];
Do[BinaryWrite[file, mat[[i,j]], "Real64"], {i,nRow}, {j,nCol}] //
AbsoluteTiming
Close[str];
Out[253]= {11.4296875,Null}

In[266]:= (* METHOD 1B: read entire file using Import *)
mat2 = Partition[Import[file, "Real64"], nCol]; // AbsoluteTiming
mat2 == mat

Out[266]= {0.1093750,Null}
Out[267]= True

In[255]:= (* METHOD 2B: read file line-by-line *)
str = OpenRead[file, BinaryFormat->True];
mat2 = Table[BinaryRead[str, fmt], {nRow}]; // AbsoluteTiming
Close[str];
mat == mat2

Out[256]= {11.7500000,Null}
Out[258]= True

In[259]:= (* METHOD 3B: read file element-by-element *)
str = OpenRead[file, BinaryFormat->True];
mat3 = Table[BinaryRead[str, "Real64"], {nRow}, {nCol}]; //
AbsoluteTiming
Close[str];
mat == mat3

Out[260]= {2.2812500,Null}
Out[262]= True

So, based on this example, I guess my question can be summarized as:

1. Why are line-by-line or element-by-element reading so much slower
than importing all-at-once?

2. Why is line-by-line writing better than element-by-element, but
vice versa when reading?

3. Is there any solution or workaround that can avoid reading entire
file at one?

Many thanks for any help!

Cheers,
Peter.

Kurt TeKolste

unread,

Sep 10, 2009, 7:18:00 AM9/10/09

to

I believe that the answer to this is trivial:

The expensive part of a write operation is the interaction with the
storage medium (disk drive?), which requires establishing the proper
logical and physical relationships (e.g. any write involves finding the
physical space on the disk, which may or many not be a simple set of
contiguous physical addresses, updating the data that defines the file
to ensure that the proper set of physical addresses is read in the
proper order to reconstruct the file). This not only requires a lot of
instructions, it involves the slowest interaction on your computer --
waiting for the read-write head to be in the proper location on the
disk.

These factors are inherent in the use of a disk drive. The design
tradeoff is capacity and persistence against speed -- drives give
capacity and persistence, RAM and cache give speed.

ekt

Regards,
Kurt Tekolste

pfalloon

unread,

Sep 11, 2009, 5:26:59 AM9/11/09

to

Kurt, thanks for the comments. I appreciate your point, and I
certainly agree that these types of considerations would explain why
the Import versions (method 1 in my example) are faster than the
others.

But I don't think that's the whole story since, by this reasoning,
method 2B should presumably be faster than 3B, whereas the reverse is
true! (It has been suggested that ReadBinaryList may be more suitable,
but I haven't tried this yet..).

Obviously working with everything in memory is the way to go if at all
possible; what I'm looking for is the optimal (or at least, a not-too-
sub-optimal) solution when that *isn't* possible.

Cheers,
Peter.

Leonid Shifrin

unread,

Sep 11, 2009, 5:28:46 AM9/11/09

to

Hi Peter,

While I don't have a deep knowledge to answer your questions, I would
suggest to look at BinaryReadList. It has an optional third parameter which
specifies how many elements you want to read at once. In the following
example, I read your test file in 1000 iterattions, reading 1000 elements
at every iteration. Even though I used AppendTo to form the resulting
1000 x 1000 matrrix, the timing is comparable to that of the single-shot
BinaryRead call.

In[1]:= file = "C:\\test.dat";

In[2]:=
res = {};
str = OpenRead[file, BinaryFormat -> True];
Do[AppendTo[res,
BinaryReadList[str, "Real64", 1000]], {1000}]; // AbsoluteTiming
Close[str];

Out[4]= {0.1250016, Null}

In[6]:= res1 = {};
str = OpenRead[file, BinaryFormat -> True];
res1 = BinaryReadList[str, "Real64", 1000000]; // AbsoluteTiming
Close[str];

Out[8]= {0.0937512, Null}

In[10]:= Flatten[res] == res1

Out[10]= True

If you keep the stream open, you can read more of your file when needed,
without losing the efficiency of BinaryRead. It looks like the only
limitation is that
you will not have the random access to the file - only sequential. It would
be
nice if in the future various read-write Mathematica built-ins will support
random-access stream, with functionality similar to say Java
RandomAccessFile
class.

Regards,
Leonid

Frank Iannarilli

unread,

Dec 30, 2011, 7:14:47 AM12/30/11

to

Leonid is spot-on with the sequential nature of BinaryReadList.

But BinaryRead is EXTREMELY inefficient compared to BinaryReadList (at least for version 8.0.1). I just took a look at why, and it is apparently because the result from BinaryRead is NOT in PackedArray format (per `Developer`PackedArrayQ), even for a 65K-sized read request.

OTOH, BinaryReadList returns a payload as a PackedArray, which implies its conversions from Bytes to requested type will be MUCH faster.