How can I reset a FileInputStream?

611 views
Skip to first unread message

Jacob Rief

unread,
Jan 13, 2010, 6:02:22 PM1/13/10
to Protocol Buffers
Hello Kenton,

currently I have the following problem: I have a very big file with
many small messages serialized with Protobuf. Each message contains
its owner separator and thus can be found even in an unsynchronized
stream. I move through this file using lseek64, because
FileInputStream::Skip only works into forwarding direction and
FileInputStream::BackUp can move back only up to the current buffer
boundary. Since I am the owner of the file descriptor, also used by
FileInputStream, I can randomly seek to any position in the file.
However after seek'ing, my FileInputStream is obviously in an unusable
state and has to be reset. Currently the only feasible solution is to
replace the current FileInputStream object by a new one - which,
somehow is quite inefficient!

Wouldn't it make sense to add a member function which resets a
FileInputStream to the state of a natively opened and repositioned
file descriptor? Or is there any other solution to randomly access the
raw content of the file, say by wrapping seek?

Regards,
Jacob

Kenton Varda

unread,
Jan 13, 2010, 6:25:44 PM1/13/10
to Jacob Rief, Protocol Buffers
On Wed, Jan 13, 2010 at 3:02 PM, Jacob Rief <jacob...@gmail.com> wrote:
Currently the only feasible solution is to
replace the current FileInputStream object by a new one - which,
somehow is quite inefficient!

What makes you think it is inefficient?  It does mean the buffer has to be re-allocated but with a decent malloc implementation that shouldn't take long.  Certainly the actual reading from the file would take longer.  Have you seen performance problems with this approach?
 
Wouldn't it make sense to add a member function which resets a
FileInputStream to the state of a natively opened and repositioned
file descriptor?

If there really is a performance problem with allocating new objects, then sure.

Jacob Rief

unread,
Jan 17, 2010, 5:46:47 PM1/17/10
to Protocol Buffers
Hello Kenton,

> What makes you think it is inefficient?  It does mean the buffer has to be
> re-allocated but with a decent malloc implementation that shouldn't take
> long.  Certainly the actual reading from the file would take longer.  Have
> you seen performance problems with this approach?

Well, in order to see any performance penalties, I would have to
implement FileInputStream::Reset() and compare the results with the
current implementation, (I can do that, if there is enough interest).
I reviewed the implementation and I saw that by reinstantiating a
FileInputStream object, 3 destructors and 3 constructors have to be
called, where one (CopyingInputStreamAdaptor) invalidates a buffer
which in the "Next() step" immediately afterwards has to be
reallocated. A Reset() function would avoid these unnecessary steps.

> If there really is a performance problem with allocating new objects, then
> sure.

From the performance point of view, its certainly not a big issue, but
from the code cleanness point of view, it is.
I have written a class named LzipInputStream, which offers a Reset()
functionality to randomly access any part of the uncompressed input
stream without having to decompress everything. Therefore this Reset()
function is called quite often and it has to destroy and recreate its
lower layer, ie. the FileInputStream. If each stackable ...InputStream
offers a Reset() function, the upper layer then only has to call Reset
() on the lower layer, instead of keeping track how to reconstruct the
lower layered FileInputStream object.

Regards, Jacob

Kenton Varda

unread,
Jan 18, 2010, 3:18:33 PM1/18/10
to Jacob Rief, Protocol Buffers
I think that you'll find the performance penalty is irrelevant.  It doesn't matter how many constructors and destructors are called.  In this context, all the sub-constructors/destructors can be fully inlined.  Even if they couldn't be, the overhead of six function calls is completely irrelevant compared to the overhead of reading from a file descriptor.  The deleting and re-allocating of the buffer is more expensive, but even that is likely to be dwarfed by the I/O costs.  Please do feel free to try it out and get actual numbers, but my guess is that you'll find no real difference.

As for code cleanliness, I find the Reset() method awkward since the user has to remember to call it at the same time as they do some other operation, like seeking the file descriptor.  Either calling Reset() or seeking the file descriptor alone will put the object in an inconsistent state.  It might make more sense to offer an actual Seek() method which can safely perform both operations together with an interface that is not so easy to misuse.

--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to prot...@googlegroups.com.
To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.




Jacob Rief

unread,
Jan 18, 2010, 6:14:51 PM1/18/10
to Kenton Varda, Protocol Buffers
Hello Kenton,

2010/1/18 Kenton Varda <ken...@google.com>:
(...snip...)


> As for code cleanliness, I find the Reset() method awkward since the user
> has to remember to call it at the same time as they do some other operation,
> like seeking the file descriptor.  Either calling Reset() or seeking the
> file descriptor alone will put the object in an inconsistent state.  It
> might make more sense to offer an actual Seek() method which can safely
> perform both operations together with an interface that is not so easy to
> misuse.

I agree, a 64bit Seek function would definitely be safer.
Here is a patch for version 2.3.0 to add FileInputStream::Seek(). The
patch also fixes a compiler warning.
I also adopted my library to use this function and it seems to work.
Regards, Jacob

protobuf-2.3.0-Seek.patch

Kenton Varda

unread,
Jan 18, 2010, 7:37:32 PM1/18/10
to Jacob Rief, Protocol Buffers
Did you do any tests to determine if the performance difference is relevant?

Jacob Rief

unread,
Jan 20, 2010, 5:17:28 PM1/20/10
to Kenton Varda, Protocol Buffers
Hello Kenton,
now I did some benchmarks, while Seek'ing though a FileInputStream.
The testing code looks like this:

boost::posix_time::ptime
t0(boost::posix_time::microsec_clock::local_time()); // initialize
boost::posix_time
boost::shared_ptr<google::protobuf::io::FileInputStream>
fileInStream = new
google::protobuf::io::FileInputStream(fileDescriptor);
boost::posix_time::ptime t1(boost::posix_time::microsec_clock::local_time());
// using Seek(), the function available through my patch
fileInStream->Seek(offset, whence);
boost::posix_time::ptime t2(boost::posix_time::microsec_clock::local_time());
// this is the default method of achieving the same
::lseek64(fileDescriptor, offset, whence);
fileInStream.reset(new google::protobuf::io::FileInputStream(fileDescriptor));
boost::posix_time::ptime t3(boost::posix_time::microsec_clock::local_time());
std::cerr << "t1: " << boost::posix_time::time_period(t1,
t2).length() << " t2: " << boost::posix_time::time_period(t2,
t3).length() << std::endl;

and on my Intel Core2 Duo CPU E8400 (3.00GHz) with 4GB of RAM,
gcc-Version 4.4.1 20090725, compiled with -O2
I get these numbers:
t1: 00:00:00.000001 t2: 00:00:00.000003
t1: 00:00:00.000001 t2: 00:00:00.000003
t1: 00:00:00.000001 t2: 00:00:00.000004
t1: 00:00:00.000001 t2: 00:00:00.000007
t1: 00:00:00.000001 t2: 00:00:00.000002
t1: 00:00:00.000001 t2: 00:00:00.000003
t1: 00:00:00.000002 t2: 00:00:00.000003
t1: 00:00:00.000001 t2: 00:00:00.000004
t1: 00:00:00.000001 t2: 00:00:00.000004
t1: 00:00:00.000001 t2: 00:00:00.000003
t1: 00:00:00.000001 t2: 00:00:00.000004

In absolute numbers, ~1 microsecond compared to 3-4 microseconds is
not a big difference,
but from a relative point of view, direct Seek'ing is much faster than
object recreation. And since
I have to seek a lot in the FileInputStream, the measured times will accumulate.

Regards, Jacob

2010/1/19 Kenton Varda <ken...@google.com>:

Henner Zeller

unread,
Jan 20, 2010, 5:25:22 PM1/20/10
to Jacob Rief, Kenton Varda, Protocol Buffers
On Wed, Jan 20, 2010 at 14:17, Jacob Rief <jacob...@gmail.com> wrote:
> Hello Kenton,
> now I did some benchmarks, while Seek'ing though a FileInputStream.
> The testing code looks like this:
>
>  boost::posix_time::ptime
> t0(boost::posix_time::microsec_clock::local_time()); // initialize
> boost::posix_time
>  boost::shared_ptr<google::protobuf::io::FileInputStream>
> fileInStream = new
> google::protobuf::io::FileInputStream(fileDescriptor);
>  boost::posix_time::ptime t1(boost::posix_time::microsec_clock::local_time());
>   // using Seek(), the function available through my patch
>  fileInStream->Seek(offset, whence);
>  boost::posix_time::ptime t2(boost::posix_time::microsec_clock::local_time());
>  // this is the default method of achieving the same
>  ::lseek64(fileDescriptor, offset, whence);
>  fileInStream.reset(new google::protobuf::io::FileInputStream(fileDescriptor));
>  boost::posix_time::ptime t3(boost::posix_time::microsec_clock::local_time());
>  std::cerr << "t1: " <<        boost::posix_time::time_period(t1,
> t2).length() << " t2: " << boost::posix_time::time_period(t2,
> t3).length() << std::endl;

Can you run this microbenchmark in a loop, say 10^6 times and use a
couple of different offsets you jump back and forth - and then time
the whole loop ? Just a single measurement is not terribly reliable
even though it would at least suggest a trend here.

Kenton Varda

unread,
Jan 20, 2010, 5:25:34 PM1/20/10
to Jacob Rief, Protocol Buffers
(1) Normally micro-benchmarks involve running the operation in a loop many times so that the total time is closer to 1s or more, not running the operation once and trying to time that.  System clocks are not very accurate at that scale, and depending on what kind of clock it is, it may actually take significantly longer to read the lock than it does not allocate memory.

(2) Your benchmark does not include the time spent actually reading the file, which is what I asserted would be much slower than re-allocating the buffer.  Sure, the seek itself is fast but it is pointless without actually reading.

(3) What memory allocator are you using?  With tcmalloc, a malloc/free pair should take around 50ns, two orders of magnitude less than your 4us measurement.

Jacob Rief

unread,
Jan 21, 2010, 2:50:31 PM1/21/10
to Protocol Buffers
Hello Kenton,

2010/1/20 Kenton Varda <ken...@google.com>:


> (1) Normally micro-benchmarks involve running the operation in a loop many
> times so that the total time is closer to 1s or more, not running the
> operation once and trying to time that.  System clocks are not very accurate
> at that scale, and depending on what kind of clock it is, it may actually
> take significantly longer to read the lock than it does not allocate memory.
> (2) Your benchmark does not include the time spent actually reading the
> file, which is what I asserted would be much slower than re-allocating the
> buffer.  Sure, the seek itself is fast but it is pointless without actually
> reading.

now I modified the benchmark, now the code looks like this

boost::posix_time::ptime time0(boost::posix_time::microsec_clock::local_time());
boost::posix_time::ptime time1(boost::posix_time::microsec_clock::local_time());
for (int i = 0; i<1000000; ++i) {
       const void* data;
       int size;
       fileInStream->Seek(offset, whence);
       fileInStream->Next(&data, &size);
}
boost::posix_time::ptime time2(boost::posix_time::microsec_clock::local_time());
for (int i = 0; i<1000000; ++i) {
       const void* data;
       int size;


       ::lseek64(fileDescriptor, offset, whence);
       fileInStream.reset(new
google::protobuf::io::FileInputStream(fileDescriptor));

       fileInStream->Next(&data, &size);
}
boost::posix_time::ptime time3(boost::posix_time::microsec_clock::local_time());
std::cerr << "t1: " <<  boost::posix_time::time_period(time1,
time2).length() << " t2: " << boost::posix_time::time_period(time2,
time3).length() << std::endl;

The difference now is less significant, but still measurable:
t1: 00:00:02.068949 t2: 00:00:02.389942
t1: 00:00:02.092842 t2: 00:00:02.429206
t1: 00:00:02.080614 t2: 00:00:02.394708
t1: 00:00:02.094289 t2: 00:00:02.429952
t1: 00:00:02.323403 t2: 00:00:03.723459
t1: 00:00:02.151486 t2: 00:00:03.711809
t1: 00:00:02.084442 t2: 00:00:02.416326
t1: 00:00:02.052930 t2: 00:00:02.383500

> (3) What memory allocator are you using?  With tcmalloc, a malloc/free pair
> should take around 50ns, two orders of magnitude less than your 4us
> measurement.

The 'new' operator is not overloaded. I use gcc-Version 4.4.1 20090725
(Red Hat 4.4.1-2)

Regards, Jacob

Kenton Varda

unread,
Jan 21, 2010, 5:16:41 PM1/21/10
to Jacob Rief, Protocol Buffers
Can you send me the complete source file for this benchmark?

Kenton Varda

unread,
Jan 27, 2010, 8:25:44 PM1/27/10
to Jacob Rief, Protocol Buffers
Your new results seem to indicate that the extra malloc/free can cost between 300ns and 1500ns.  At the low end, the overhead is negligible and not worth complicating the interface.  At the high end, your results seem hard to believe, unless your memory allocator is extremely poor.  And even if that is the case, why the variability?  Note that the memory allocator is part of libc, not GCC, so it's the libc version that matters.

I wanted the source code so that I could run the test using tcmalloc instead.  Feel free to try this yourself:


Even if we did add the Seek() method, using tcmalloc is probably a good idea for any C++ app that cares about performance.

On Thu, Jan 21, 2010 at 3:14 PM, Jacob Rief <jacob...@gmail.com> wrote:
Hello Kenton,
I added this code to my working project, but it would be too messy to
send the whole project. I could write a small testing application.
BTW, I also added the Seek() function to FileOutputStream, a) for
symmetry reasons and b) because in one occasion I have to reposition
the file descriptor while writing. Everything is included in this
patch.

Regards, Jacob

2010/1/21 Kenton Varda <ken...@google.com>:

Jacob Rief

unread,
Jan 29, 2010, 12:39:59 PM1/29/10
to Kenton Varda, Protocol Buffers
Hello Kenton,
here I added a small testing program, to show the difference together
with the Seek()-patch.
The benchmarks:
duration with Seek: 00:00:01.285103
duration conventional: 00:00:01.544068

duration with Seek: 00:00:01.295928
duration conventional: 00:00:01.565688

duration with Seek: 00:00:01.292432
duration conventional: 00:00:01.563996

duration with Seek: 00:00:01.292950
duration conventional: 00:00:01.619057

duration with Seek: 00:00:01.302095
duration conventional: 00:00:01.569414

duration with Seek: 00:00:01.291782
duration conventional: 00:00:01.549302

duration with Seek: 00:00:01.289552
duration conventional: 00:00:01.568220

as I told in my last email, I did not overload operator new.
Everything I use is an out-of-the-box Fedora 11 with their default
libraries: gcc-c++-4.4.1-2.fc11.x86_64, libstdc++-4.4.1-2.fc11.x86_64
and glibc-2.10.1-5.x86_64.
Sure, at Google the default allocator might be tcmalloc, but this adds
another layer of complexity to the building step. IMO releasing and
reallocating three or more blocks of memory always is less efficient
than just resetting the internal state of an object, regardless which
allocator you use.

I thought about a possible FileIn/OutputStream::Seek function and
after some rethinking I have to admit, that I don't really like that
idea. Let me explain why:
Assume you have an AESEncryptionInputStream stacked upon a
CompressionInputStream stacked upon a FileInputStream. Each stream
knows about its lower level which is a ZeroCopyInputStream. If for
some reason, you have to change the session key (which for instance is
encrypted using RSA) in the AESEncryptionInputStream, you must reset
all the internal buffers to a clean state. Then however,
ZeroCopyInputStream needs a virtual Reset() function so that each
stacked Stream can call Reset() on its lower level. This would allow
to flexibly stack different kinds of compatible streams over each
other. Without such a Reset() function the caller has to keep track
about deallocating and reallocating all the streams, which certainly
makes that code much harder to understand than just using one single
Reset() statement which does everything for the caller.

BTW: As you know, I wrote a streaming class using the LZMAlgorithm.
This class has a better compression ratio than GzipIn/OutputStream and
another big advantage: The LzipInputStream'ing class is able to
resynchronize to the beginning of any compression member inside an
LZMA-stream, so one can seek to any position of the underlying file
and still read data. In order to achieve this, I had to convince
Antonio Diaz, author of the LZMA library, to add a reset function to
its decoder. Otherwise it would have been much more complicate.
Thus it would be really nice to also have a Reset() function on the
"other side" of the interface.

Regards, Jacob


2010/1/28 Kenton Varda <ken...@google.com>:

test.cc
protobuf-2.3.0-Seek.patch
lzip_stream.cc
lzip_stream.h

Kenton Varda

unread,
Jan 29, 2010, 7:15:45 PM1/29/10
to Jacob Rief, Protocol Buffers
We can't add a new virtual method to the ZeroCopyInputStream interface because it would break existing implementations.

Besides that, it's unclear what the Reset() method means in an abstract sense.  Yes, you can define it for the particular set of streams that you're thinking of, but what does it mean in general?

What should ArrayInputStream::Reset() do?  In this case Reset() is nonsensical.

What should IstreamInputStream::Reset() do?  Should it only discard its own buffer, or should it also reset the underlying istream?  If that istream is itself wrapping a file descriptor, and you're trying to seek that file descriptor directly, then you need to reset the istream.  But maybe the user is actually calling IstreamInputStream::Reset() because they have seeked the istream itself and what IstreamInputStream to acknowledge this.  Who knows?  But you can't say that Reset() is only propagated down the stack by *some* implementations and not others.

No, we won't be adding a Reset() method because the meaning is unclear.

Meanwhile, you seem to have made an argument against FileInputStream::Seek():  Any streams layered on top of it will be broken if you Seek() the stream under them.  So you have to have some way to reset those streams, and the problem starts again!

Please just don't add anything new.  If you are unhappy with what ZeroCopy{Input,Output}Stream provide, you can always just create your own stream framework to use.  All you have to do is write a ZeroCopyStream wrapper around your own interface and protocol buffers will then be able to use it.  Presumably you'd only construct the wrapper object immediately before parsing/serializing and destroy it immediately after, so you could allocate it on the stack and avoid malloc overhead altogether.  This way you can do whatever you want with your streams.

Jacob Rief

unread,
Jan 30, 2010, 6:03:29 PM1/30/10
to Kenton Varda, Protocol Buffers
Hello Kenton,

2010/1/30 Kenton Varda <ken...@google.com>:


> We can't add a new virtual method to the ZeroCopyInputStream interface
> because it would break existing implementations.

but only on a binary level. God's sake you are not Microsoft :)

> Besides that, it's unclear what the Reset() method means in an abstract
> sense.  Yes, you can define it for the particular set of streams that you're
> thinking of, but what does it mean in general?

Put the object into a state equivalent to a state immediately after
construction. The Reset() "button" should be pressed only when you
know what you are doing - otherwise you will loose valuable data.

> What should ArrayInputStream::Reset() do?  In this case Reset() is
> nonsensical.

the same as its constructor without reallocation, ie.:
void ArrayInputStream::Reset() {
position_ = 0;
last_returned_size_ = 0;
}

> What should IstreamInputStream::Reset() do?  Should it only discard its own
> buffer, or should it also reset the underlying istream? If that istream is

void IstreamInputStream::Reset() {
impl_.Reset();
}

since impl_ is a CopyingInputStreamAdaptor which itself IS-A
ZeroCopyInputStream its Reset() is just another implementation, ie.

void CopyingInputStreamAdaptor::Reset() {
position_ = 0;
buffer_used_ = 0;
backup_bytes_ = 0;
}

> itself wrapping a file descriptor, and you're trying to seek that file
> descriptor directly, then you need to reset the istream.  But maybe the user
> is actually calling IstreamInputStream::Reset() because they have seeked the
> istream itself and what IstreamInputStream to acknowledge this.  Who knows?
>  But you can't say that Reset() is only propagated down the stack by *some*
> implementations and not others.

Since the creator of IstreamInputStream is the owner of the file
descriptor, its his responsibility to seek to whatever location
desired.

> No, we won't be adding a Reset() method because the meaning is unclear.
> Meanwhile, you seem to have made an argument against
> FileInputStream::Seek():  Any streams layered on top of it will be broken if
> you Seek() the stream under them.  So you have to have some way to reset
> those streams, and the problem starts again!

Exactly! Therefore instead of destroying and recreating them, a much
simpler Reset() function would do the job. The design of the streaming
classes is to consider a stream which can move only forward. It was
not designed for moving backwards or random access.

> Please just don't add anything new.  If you are unhappy with what
> ZeroCopy{Input,Output}Stream provide, you can always just create your own
> stream framework to use.

Well, I have to live with that decision. Maybe in the future some
other people have similar use cases. Maybe in version 3?

Just for curiosity, the protobuf code is really easy to read and to
understand. The only thing is disliked is the mapping of class names
to filenames. Is all the code inside Google written that clearly?

Regards, Jacob

Kenton Varda

unread,
Feb 1, 2010, 1:26:51 PM2/1/10
to Jacob Rief, Protocol Buffers
On Sat, Jan 30, 2010 at 3:03 PM, Jacob Rief <jacob...@gmail.com> wrote: 
> Please just don't add anything new.  If you are unhappy with what
> ZeroCopy{Input,Output}Stream provide, you can always just create your own
> stream framework to use.

Well, I have to live with that decision. Maybe in the future some
other people have similar use cases. Maybe in version 3?

I don't think this is that bad.  The ZeroCopyInputStream interface was really designed to be an adaptor for using other streaming frameworks with ProtocolBuffers.  It's not meant to be a streaming framework in itself -- that would be outside the scope of the library.  So, I'd rather keep the interface as minimal as possible, providing only what is strictly necessary for protocol buffers to operate.  I don't think Seek() or Reset() are strictly necessary, and since it's not clear to me that either one is necessarily "correct", I'd just rather not get into it.
 
Just for curiosity, the protobuf code is really easy to read and to
understand. The only thing is disliked is the mapping of class names
to filenames. Is all the code inside Google written that clearly?

Thanks!  I don't know if I can give an unbiased answer to this question, since I wrote most of the proto2 code, so of course to *me* it is the clearest code in all of Google.  ;)  I think in general the code here tends to be pretty high-quality due to our policy of reviewing every change pre-checkin.  However, there is certainly some ugly code, mostly code which evolved slowly over time with many different authors.

If you want to find out for yourself, we're hiring!  :)
Reply all
Reply to author
Forward
0 new messages