S3Object and closing streams

806 views
Skip to first unread message

Peppe

unread,
May 6, 2007, 3:07:20 PM5/6/07
to JetS3t Users
The usage of data input streams in S3Object - both when storing and
retrieving objects - is a little unclear. In the case of files,
streams are lazily opened by JetS3et. In the case of an explicit
setDataInputStream() call, the stream is opened by the application
that uses JetS3t. Thus, it's a little unclear who is reponsible of
closing a stream but it appears to me that the application is
responsible. I studied the source code (the REST implementation in
particular) and also did a test with a debug input stream (overrided
close() that logs when its called) and it seems that indeed JetS3t
doesn't close the streams - either on store or retrieval of an object.

No sample code, however, closes the streams. That seems to be either
wrong (and misleading to new users), or then I'm missing something..

To confuse matters further, however, I found one instance where
JetS3et does close the stream! In the putObjectWithSignedUrl() call in
RestS3Service, the input stream of the S3Object is closed! That method
seems a bit odd in general as it modifies the passed S3 object rather
than returning a new one as the method comments state.

Can someone explain when close() is supposed to be called by the
applicatio using JetS3t and when it gets called by JetS3t itself?
Getting close() right is pretty important when using files, so you
don't run out of file descriptors, so this is an area where
documentation could perhaps be improved.

Thanks!

Peppe

James Murty

unread,
May 7, 2007, 8:43:15 AM5/7/07
to jets3t...@googlegroups.com
Hi Peppe,

Thanks for bringing this to my attention, you're right that the data
input stream handling is incorrect.

The REST S3 should indeed be responsible for closing the input
streams of S3Object classes provided to it. The fact that this wasn't
happening was an oversight that was hidden by the fact that the multi-
threaded service used by all JetS3t applications did close these
streams. I have committed updated code that closes streams in the
RestS3Service.putObjectImpl() method and removed the stream closing
from the multi-threaded service, which is a little risky but will at
least bring any other errors to light more quickly.

The general rules for applications using JetS3t are:
- applications are not responsible for closing streams of objects
uploaded to S3
- applications are responsible for closing streams of objects
downloaded from S3, presumably after first consuming the data
(failure close these streams could cause nasty side-effects besides
open streams, such as network connections being held longer than
necessary)

As you point out, the DataInputStream/File distinction is a little
odd. The use of file objects instead of input streams was a late
addition to the toolkit that was necessary due to platform limits on
the number of open files that are permitted. Unfortunately Java
doesn't allow the creation of InputStream objects that aren't
automatically opened, so the necessary work-around is to ensure that
file input streams are only created at the moment they're needed. By
combining this with thread management (performed by the multi-
threaded service) you can ensure that only a limited number of files
are ever open at one time.

Good catch too with the comments on putObjectWithSignedUrl(), they
were plain misleading. I have updated these.

Hopefully the most recent changes in CVS will address these issues.
It would be great Peppe if you could check out this code and confirm
the fixes for me with your "stream closing" monitor.

Thanks for the precise and well-researched feedback, more is always
welcome.

Cheers,
James

Peppe

unread,
May 7, 2007, 12:09:57 PM5/7/07
to JetS3t Users
Thanks for the quick answer! It makes sense that close() is manual for
downloads - and close() for the stream should then propagate to the
network connection, which I haven't verified but it appears that
there's some code in place to do that. For uploads, close() being the
responsibility of JetS3t is reasonable considering you can't really
"reset" the input stream in most cases anyway so it's unusable..

I think the ideal would be to have methods in S3Object to do both
storing and retrieval, with the user explicitly both opening and
closing the streams (an output stream for storing and an input stream
for retrieval) and being responsible for both reading and writing data
to and from these streams. The problem with that is all the other
stuff that goes on - metadata, content length, content type, ACL, etc.
It should happen "behind the scenes" in that model, which means you'd
have to set it all up correctly before opening an output stream to
store an object.. The nice thing is that you could always read data
from an S3Object in a consistent way and not have the "read only once"
behavior. And one could have some sort of data caching in there too,
which would be useful for smaller objects..

Is there an automatically built binary from CVS daily or should I pull
the source from CVS?

Peppe

James Murty

unread,
May 13, 2007, 10:39:21 PM5/13/07
to jets3t...@googlegroups.com
Re CVS: There is no automatic daily build of the codebase, so a standard CVS checkout is the best way to obtain the latest version. The checked out project includes ANT build scripts to build the codebase. I sometimes provide the latest unstable version as a distribution via the website on request (eg for non-developers who need bug fixes) but this is a manual process.


The main problem with the whole input/output stream ownership question is that, ultimately, they have to be controlled at the networking level. This is partly a side effect of using the HttpClient library, but makes sense more generally because the networking layer is responsible for sending and receiving data in a timely manner and managing connections to achieve clean-up and reuse of networking resources.

Unfortunately the more control the user has over the S3Object streams, the less reliable the toolkit as a whole will be. This is especially the case with S3 which has fairly unforgiving network timeout settings - it can get cranky if it doesn't receive any data after only 10-15 seconds so any extra delays added by users' code would be a problem.

A possible work-around to get the kind of behaviour you describe, where the data in S3 Objects can be read multiple times and whenever the use wishes, would be to automatically write the data received from S3 to disk as a temporary file then use this as a cache to be reread. This is certainly possible, and probably even desirable in some cases, but makes too many assumptions for it to make sense at the S3Object level (what if the user's hard disk is full, or they only read data once in which case there's lots of unnecessary file reading and writing).

Maybe a wrapper object of some kind would best fulfil these goals? It could hide the complexity of S3Object and S3Service interactions and have some smarts to upload objects only when the user has finished providing the data and cached downloaded objects.
Reply all
Reply to author
Forward
0 new messages