currently I have the following problem: I have a very big file with
many small messages serialized with Protobuf. Each message contains
its owner separator and thus can be found even in an unsynchronized
stream. I move through this file using lseek64, because
FileInputStream::Skip only works into forwarding direction and
FileInputStream::BackUp can move back only up to the current buffer
boundary. Since I am the owner of the file descriptor, also used by
FileInputStream, I can randomly seek to any position in the file.
However after seek'ing, my FileInputStream is obviously in an unusable
state and has to be reset. Currently the only feasible solution is to
replace the current FileInputStream object by a new one - which,
somehow is quite inefficient!
Wouldn't it make sense to add a member function which resets a
FileInputStream to the state of a natively opened and repositioned
file descriptor? Or is there any other solution to randomly access the
raw content of the file, say by wrapping seek?
Regards,
Jacob
Currently the only feasible solution is to
replace the current FileInputStream object by a new one - which,
somehow is quite inefficient!
Wouldn't it make sense to add a member function which resets a
FileInputStream to the state of a natively opened and repositioned
file descriptor?
> What makes you think it is inefficient? It does mean the buffer has to be
> re-allocated but with a decent malloc implementation that shouldn't take
> long. Certainly the actual reading from the file would take longer. Have
> you seen performance problems with this approach?
Well, in order to see any performance penalties, I would have to
implement FileInputStream::Reset() and compare the results with the
current implementation, (I can do that, if there is enough interest).
I reviewed the implementation and I saw that by reinstantiating a
FileInputStream object, 3 destructors and 3 constructors have to be
called, where one (CopyingInputStreamAdaptor) invalidates a buffer
which in the "Next() step" immediately afterwards has to be
reallocated. A Reset() function would avoid these unnecessary steps.
> If there really is a performance problem with allocating new objects, then
> sure.
From the performance point of view, its certainly not a big issue, but
from the code cleanness point of view, it is.
I have written a class named LzipInputStream, which offers a Reset()
functionality to randomly access any part of the uncompressed input
stream without having to decompress everything. Therefore this Reset()
function is called quite often and it has to destroy and recreate its
lower layer, ie. the FileInputStream. If each stackable ...InputStream
offers a Reset() function, the upper layer then only has to call Reset
() on the lower layer, instead of keeping track how to reconstruct the
lower layered FileInputStream object.
Regards, Jacob
--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to prot...@googlegroups.com.
To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
2010/1/18 Kenton Varda <ken...@google.com>:
(...snip...)
> As for code cleanliness, I find the Reset() method awkward since the user
> has to remember to call it at the same time as they do some other operation,
> like seeking the file descriptor. Either calling Reset() or seeking the
> file descriptor alone will put the object in an inconsistent state. It
> might make more sense to offer an actual Seek() method which can safely
> perform both operations together with an interface that is not so easy to
> misuse.
I agree, a 64bit Seek function would definitely be safer.
Here is a patch for version 2.3.0 to add FileInputStream::Seek(). The
patch also fixes a compiler warning.
I also adopted my library to use this function and it seems to work.
Regards, Jacob
boost::posix_time::ptime
t0(boost::posix_time::microsec_clock::local_time()); // initialize
boost::posix_time
boost::shared_ptr<google::protobuf::io::FileInputStream>
fileInStream = new
google::protobuf::io::FileInputStream(fileDescriptor);
boost::posix_time::ptime t1(boost::posix_time::microsec_clock::local_time());
// using Seek(), the function available through my patch
fileInStream->Seek(offset, whence);
boost::posix_time::ptime t2(boost::posix_time::microsec_clock::local_time());
// this is the default method of achieving the same
::lseek64(fileDescriptor, offset, whence);
fileInStream.reset(new google::protobuf::io::FileInputStream(fileDescriptor));
boost::posix_time::ptime t3(boost::posix_time::microsec_clock::local_time());
std::cerr << "t1: " << boost::posix_time::time_period(t1,
t2).length() << " t2: " << boost::posix_time::time_period(t2,
t3).length() << std::endl;
and on my Intel Core2 Duo CPU E8400 (3.00GHz) with 4GB of RAM,
gcc-Version 4.4.1 20090725, compiled with -O2
I get these numbers:
t1: 00:00:00.000001 t2: 00:00:00.000003
t1: 00:00:00.000001 t2: 00:00:00.000003
t1: 00:00:00.000001 t2: 00:00:00.000004
t1: 00:00:00.000001 t2: 00:00:00.000007
t1: 00:00:00.000001 t2: 00:00:00.000002
t1: 00:00:00.000001 t2: 00:00:00.000003
t1: 00:00:00.000002 t2: 00:00:00.000003
t1: 00:00:00.000001 t2: 00:00:00.000004
t1: 00:00:00.000001 t2: 00:00:00.000004
t1: 00:00:00.000001 t2: 00:00:00.000003
t1: 00:00:00.000001 t2: 00:00:00.000004
In absolute numbers, ~1 microsecond compared to 3-4 microseconds is
not a big difference,
but from a relative point of view, direct Seek'ing is much faster than
object recreation. And since
I have to seek a lot in the FileInputStream, the measured times will accumulate.
Regards, Jacob
2010/1/19 Kenton Varda <ken...@google.com>:
Can you run this microbenchmark in a loop, say 10^6 times and use a
couple of different offsets you jump back and forth - and then time
the whole loop ? Just a single measurement is not terribly reliable
even though it would at least suggest a trend here.
2010/1/20 Kenton Varda <ken...@google.com>:
> (1) Normally micro-benchmarks involve running the operation in a loop many
> times so that the total time is closer to 1s or more, not running the
> operation once and trying to time that. System clocks are not very accurate
> at that scale, and depending on what kind of clock it is, it may actually
> take significantly longer to read the lock than it does not allocate memory.
> (2) Your benchmark does not include the time spent actually reading the
> file, which is what I asserted would be much slower than re-allocating the
> buffer. Sure, the seek itself is fast but it is pointless without actually
> reading.
now I modified the benchmark, now the code looks like this
boost::posix_time::ptime time0(boost::posix_time::microsec_clock::local_time());
boost::posix_time::ptime time1(boost::posix_time::microsec_clock::local_time());
for (int i = 0; i<1000000; ++i) {
const void* data;
int size;
fileInStream->Seek(offset, whence);
fileInStream->Next(&data, &size);
}
boost::posix_time::ptime time2(boost::posix_time::microsec_clock::local_time());
for (int i = 0; i<1000000; ++i) {
const void* data;
int size;
::lseek64(fileDescriptor, offset, whence);
fileInStream.reset(new
google::protobuf::io::FileInputStream(fileDescriptor));
fileInStream->Next(&data, &size);
}
boost::posix_time::ptime time3(boost::posix_time::microsec_clock::local_time());
std::cerr << "t1: " << boost::posix_time::time_period(time1,
time2).length() << " t2: " << boost::posix_time::time_period(time2,
time3).length() << std::endl;
The difference now is less significant, but still measurable:
t1: 00:00:02.068949 t2: 00:00:02.389942
t1: 00:00:02.092842 t2: 00:00:02.429206
t1: 00:00:02.080614 t2: 00:00:02.394708
t1: 00:00:02.094289 t2: 00:00:02.429952
t1: 00:00:02.323403 t2: 00:00:03.723459
t1: 00:00:02.151486 t2: 00:00:03.711809
t1: 00:00:02.084442 t2: 00:00:02.416326
t1: 00:00:02.052930 t2: 00:00:02.383500
> (3) What memory allocator are you using? With tcmalloc, a malloc/free pair
> should take around 50ns, two orders of magnitude less than your 4us
> measurement.
The 'new' operator is not overloaded. I use gcc-Version 4.4.1 20090725
(Red Hat 4.4.1-2)
Regards, Jacob
Hello Kenton,
I added this code to my working project, but it would be too messy to
send the whole project. I could write a small testing application.
BTW, I also added the Seek() function to FileOutputStream, a) for
symmetry reasons and b) because in one occasion I have to reposition
the file descriptor while writing. Everything is included in this
patch.
Regards, Jacob
2010/1/21 Kenton Varda <ken...@google.com>:
duration with Seek: 00:00:01.295928
duration conventional: 00:00:01.565688
duration with Seek: 00:00:01.292432
duration conventional: 00:00:01.563996
duration with Seek: 00:00:01.292950
duration conventional: 00:00:01.619057
duration with Seek: 00:00:01.302095
duration conventional: 00:00:01.569414
duration with Seek: 00:00:01.291782
duration conventional: 00:00:01.549302
duration with Seek: 00:00:01.289552
duration conventional: 00:00:01.568220
as I told in my last email, I did not overload operator new.
Everything I use is an out-of-the-box Fedora 11 with their default
libraries: gcc-c++-4.4.1-2.fc11.x86_64, libstdc++-4.4.1-2.fc11.x86_64
and glibc-2.10.1-5.x86_64.
Sure, at Google the default allocator might be tcmalloc, but this adds
another layer of complexity to the building step. IMO releasing and
reallocating three or more blocks of memory always is less efficient
than just resetting the internal state of an object, regardless which
allocator you use.
I thought about a possible FileIn/OutputStream::Seek function and
after some rethinking I have to admit, that I don't really like that
idea. Let me explain why:
Assume you have an AESEncryptionInputStream stacked upon a
CompressionInputStream stacked upon a FileInputStream. Each stream
knows about its lower level which is a ZeroCopyInputStream. If for
some reason, you have to change the session key (which for instance is
encrypted using RSA) in the AESEncryptionInputStream, you must reset
all the internal buffers to a clean state. Then however,
ZeroCopyInputStream needs a virtual Reset() function so that each
stacked Stream can call Reset() on its lower level. This would allow
to flexibly stack different kinds of compatible streams over each
other. Without such a Reset() function the caller has to keep track
about deallocating and reallocating all the streams, which certainly
makes that code much harder to understand than just using one single
Reset() statement which does everything for the caller.
BTW: As you know, I wrote a streaming class using the LZMAlgorithm.
This class has a better compression ratio than GzipIn/OutputStream and
another big advantage: The LzipInputStream'ing class is able to
resynchronize to the beginning of any compression member inside an
LZMA-stream, so one can seek to any position of the underlying file
and still read data. In order to achieve this, I had to convince
Antonio Diaz, author of the LZMA library, to add a reset function to
its decoder. Otherwise it would have been much more complicate.
Thus it would be really nice to also have a Reset() function on the
"other side" of the interface.
Regards, Jacob
2010/1/28 Kenton Varda <ken...@google.com>:
2010/1/30 Kenton Varda <ken...@google.com>:
> We can't add a new virtual method to the ZeroCopyInputStream interface
> because it would break existing implementations.
but only on a binary level. God's sake you are not Microsoft :)
> Besides that, it's unclear what the Reset() method means in an abstract
> sense. Yes, you can define it for the particular set of streams that you're
> thinking of, but what does it mean in general?
Put the object into a state equivalent to a state immediately after
construction. The Reset() "button" should be pressed only when you
know what you are doing - otherwise you will loose valuable data.
> What should ArrayInputStream::Reset() do? In this case Reset() is
> nonsensical.
the same as its constructor without reallocation, ie.:
void ArrayInputStream::Reset() {
position_ = 0;
last_returned_size_ = 0;
}
> What should IstreamInputStream::Reset() do? Should it only discard its own
> buffer, or should it also reset the underlying istream? If that istream is
void IstreamInputStream::Reset() {
impl_.Reset();
}
since impl_ is a CopyingInputStreamAdaptor which itself IS-A
ZeroCopyInputStream its Reset() is just another implementation, ie.
void CopyingInputStreamAdaptor::Reset() {
position_ = 0;
buffer_used_ = 0;
backup_bytes_ = 0;
}
> itself wrapping a file descriptor, and you're trying to seek that file
> descriptor directly, then you need to reset the istream. But maybe the user
> is actually calling IstreamInputStream::Reset() because they have seeked the
> istream itself and what IstreamInputStream to acknowledge this. Who knows?
> But you can't say that Reset() is only propagated down the stack by *some*
> implementations and not others.
Since the creator of IstreamInputStream is the owner of the file
descriptor, its his responsibility to seek to whatever location
desired.
> No, we won't be adding a Reset() method because the meaning is unclear.
> Meanwhile, you seem to have made an argument against
> FileInputStream::Seek(): Any streams layered on top of it will be broken if
> you Seek() the stream under them. So you have to have some way to reset
> those streams, and the problem starts again!
Exactly! Therefore instead of destroying and recreating them, a much
simpler Reset() function would do the job. The design of the streaming
classes is to consider a stream which can move only forward. It was
not designed for moving backwards or random access.
> Please just don't add anything new. If you are unhappy with what
> ZeroCopy{Input,Output}Stream provide, you can always just create your own
> stream framework to use.
Well, I have to live with that decision. Maybe in the future some
other people have similar use cases. Maybe in version 3?
Just for curiosity, the protobuf code is really easy to read and to
understand. The only thing is disliked is the mapping of class names
to filenames. Is all the code inside Google written that clearly?
Regards, Jacob
> Please just don't add anything new. If you are unhappy with what> ZeroCopy{Input,Output}Stream provide, you can always just create your ownWell, I have to live with that decision. Maybe in the future some
> stream framework to use.
other people have similar use cases. Maybe in version 3?
Just for curiosity, the protobuf code is really easy to read and to
understand. The only thing is disliked is the mapping of class names
to filenames. Is all the code inside Google written that clearly?