I have been using the BinaryRead/Write functions to do this, but am
finding that they run significantly (dramatically even) slower than
reading the entire file using Import. It would be great to know if
anyone has found a solution for this and if not whether it's something
that's likely to improve in future versions.
Let me illustrate my attempts with an example (apologies for the
length of this; I've tried to make it as succinct as possible while
remaining non-trivial):
(* initial definitions *)
{nRow, nCol} = {100000,10};
mat = RandomReal[{-1,1}, {nRow,nCol}];
file = "C:\\falloon\\test.dat";
fmt = ConstantArray["Real64", nCol];
In[240]:= (* METHOD 1A: write to file using Export: very efficient *)
Export[file, mat, "Real64"] // AbsoluteTiming
Out[240]= {0.0937500,C:\falloon\test.dat}
In[241]:= (* METHOD 2A: write to file line-by-line *)
If[FileExistsQ[file], DeleteFile[file]];
str = OpenWrite[file, BinaryFormat->True];
Do[BinaryWrite[file, row, fmt], {row, mat}] // AbsoluteTiming
Close[str];
Out[249]= {2.1718750,Null}
(* METHOD 3A: write to file element-by-element *)
If[FileExistsQ[file], DeleteFile[file]];
str = OpenWrite[file, BinaryFormat->True];
Do[BinaryWrite[file, mat[[i,j]], "Real64"], {i,nRow}, {j,nCol}] //
AbsoluteTiming
Close[str];
Out[253]= {11.4296875,Null}
In[266]:= (* METHOD 1B: read entire file using Import *)
mat2 = Partition[Import[file, "Real64"], nCol]; // AbsoluteTiming
mat2 == mat
Out[266]= {0.1093750,Null}
Out[267]= True
In[255]:= (* METHOD 2B: read file line-by-line *)
str = OpenRead[file, BinaryFormat->True];
mat2 = Table[BinaryRead[str, fmt], {nRow}]; // AbsoluteTiming
Close[str];
mat == mat2
Out[256]= {11.7500000,Null}
Out[258]= True
In[259]:= (* METHOD 3B: read file element-by-element *)
str = OpenRead[file, BinaryFormat->True];
mat3 = Table[BinaryRead[str, "Real64"], {nRow}, {nCol}]; //
AbsoluteTiming
Close[str];
mat == mat3
Out[260]= {2.2812500,Null}
Out[262]= True
So, based on this example, I guess my question can be summarized as:
1. Why are line-by-line or element-by-element reading so much slower
than importing all-at-once?
2. Why is line-by-line writing better than element-by-element, but
vice versa when reading?
3. Is there any solution or workaround that can avoid reading entire
file at one?
Many thanks for any help!
Cheers,
Peter.
The expensive part of a write operation is the interaction with the
storage medium (disk drive?), which requires establishing the proper
logical and physical relationships (e.g. any write involves finding the
physical space on the disk, which may or many not be a simple set of
contiguous physical addresses, updating the data that defines the file
to ensure that the proper set of physical addresses is read in the
proper order to reconstruct the file). This not only requires a lot of
instructions, it involves the slowest interaction on your computer --
waiting for the read-write head to be in the proper location on the
disk.
These factors are inherent in the use of a disk drive. The design
tradeoff is capacity and persistence against speed -- drives give
capacity and persistence, RAM and cache give speed.
ekt
Regards,
Kurt Tekolste
Kurt, thanks for the comments. I appreciate your point, and I
certainly agree that these types of considerations would explain why
the Import versions (method 1 in my example) are faster than the
others.
But I don't think that's the whole story since, by this reasoning,
method 2B should presumably be faster than 3B, whereas the reverse is
true! (It has been suggested that ReadBinaryList may be more suitable,
but I haven't tried this yet..).
Obviously working with everything in memory is the way to go if at all
possible; what I'm looking for is the optimal (or at least, a not-too-
sub-optimal) solution when that *isn't* possible.
Cheers,
Peter.
While I don't have a deep knowledge to answer your questions, I would
suggest to look at BinaryReadList. It has an optional third parameter which
specifies how many elements you want to read at once. In the following
example, I read your test file in 1000 iterattions, reading 1000 elements
at every iteration. Even though I used AppendTo to form the resulting
1000 x 1000 matrrix, the timing is comparable to that of the single-shot
BinaryRead call.
In[1]:= file = "C:\\test.dat";
In[2]:=
res = {};
str = OpenRead[file, BinaryFormat -> True];
Do[AppendTo[res,
BinaryReadList[str, "Real64", 1000]], {1000}]; // AbsoluteTiming
Close[str];
Out[4]= {0.1250016, Null}
In[6]:= res1 = {};
str = OpenRead[file, BinaryFormat -> True];
res1 = BinaryReadList[str, "Real64", 1000000]; // AbsoluteTiming
Close[str];
Out[8]= {0.0937512, Null}
In[10]:= Flatten[res] == res1
Out[10]= True
If you keep the stream open, you can read more of your file when needed,
without losing the efficiency of BinaryRead. It looks like the only
limitation is that
you will not have the random access to the file - only sequential. It would
be
nice if in the future various read-write Mathematica built-ins will support
random-access stream, with functionality similar to say Java
RandomAccessFile
class.
Regards,
Leonid