Uploading compressed files

60 views
Skip to first unread message

Patrick Lightbody

unread,
Jun 2, 2010, 10:41:49 AM6/2/10
to JetS3t Users
I've got the following warning in my logs:

WARN 06/02 14:38:10 o.j.s.i.r.h.RestS3S~ - Content-Length of data
stream not set, will automatically determine data length in memory

It is coming from the code below. I understand that the reason it has
to determine the length is because I'm not setting the Content-Length
manually and I'm not using a File reference. As you can see, I'm
writing the raw *uncompressed* file and then giving jets3t the
FileInputStream wrapped around the GZipDeflatingInputStream.

It seems like the solution is to write the file compressed and then
just give jets3t a handle to the file. My problem is that there is no
equivalent GZipDeflatingOutputStream in the jets3t library. Will the
normal GZipOutputStream provided by Java work? Initially I assumed it
would, but I started to get doubts after looking at the source of the
jets3t impls compared to the JDK ones.

Any thoughts? Thanks!

Patrick

// get the key ready
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
String key = sdf.format(new java.util.Date()) + "/" +
record.job.getAccountId() + "/" + record.job.getId() + "/" +
tx.getId() + ".har";
S3Object s3Object = new S3Object(key);

// write to disk as GZIP'd JSON. We do this so that we don't buffer it
in memory, resulting in potentially
// 2X memory being temporarily chewed up
File file = File.createTempFile("har-" + record.job.getId(), "-" +
tx.getId());
try {
OM.writeValue(new FileOutputStream(file), tx.toHar());

// read the file back out and upload it
s3Object.setDataInputStream(new GZipDeflatingInputStream(new
RepeatableFileInputStream(file)));
s3Object.setContentEncoding("gzip");
s3Object.setContentType("application/json");
s3.putObject(persistenceBucket, s3Object);
} finally {
file.delete();
}

James Murty

unread,
Jun 2, 2010, 11:23:38 AM6/2/10
to JetS3t Users
Hi Patrick,

You should definitely write out the pre-compressed file data if you
can. Caching the upload content and calculating lengths in-memory will
cause problems sooner or later.

Data written by the standard Java GZipOutputStream should interoperate
fine with JetS3t's own (very primitive) Gzip streams. If this doesn't
work I should do something about it, probably by replacing the JetS3t
versions with more mature ones from a third-party library, so let me
know.

James

Patrick Lightbody

unread,
Jun 2, 2010, 7:19:35 PM6/2/10
to jets3t...@googlegroups.com
James,
Yup - that worked. Might not hurt to add some minor docs on those
classes to indicate they are indeed compatible with the standard JDK
ones.

While I have you, I have a question that isn't exactly jets3t related
but I imagine you might have some good advice:

We're now storing ~5M records a month and expect that to continue to
grow. The structure for the keys is:

2010-06-02/CUST_ID/JOB_ID/TX_ID.har

Where CUST_ID = the internal customer ID, JOB_ID = a monitoring job
(for our website monitoring service), and TX_ID = a sequential number.

We want to be able to delete S3 objects after X months, so having
everything prefixed by date is nice. But we're also thinking we'll
want to be able to quickly fetch all files for customer Y and/or job Z
between dates A and B. I've been experimenting with
listObjectsChunked() and I can't quite figure out the delimiter
concept.

I understand prefix and it works great for our first requirement, but
I was hoping that maybe the delimiter could help with our second need?
If not, it's no big deal, we can always run a SQL query out of RDS to
identify the unique keys we need to pull down, so it's not the end of
the world, but it'd be cool if I could somehow structure my keys in a
way that let me achieve both requirements without needing to query an
external "roadmap" of our S3 objects.

Patrick

> --
> You received this message because you are subscribed to the Google Groups "JetS3t Users" group.
> To post to this group, send email to jets3t...@googlegroups.com.
> To unsubscribe from this group, send email to jets3t-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/jets3t-users?hl=en.
>
>

James Murty

unread,
Jun 2, 2010, 7:48:01 PM6/2/10
to jets3t...@googlegroups.com
Hi Patrick,

Glad the GZip streams are still working OK. I'll add some notes to this effect to the Javadoc because the JetS3t versions are certainly basic, about as bare-bones as you can get while still working.

As for your delimiter question, if I understand your needs correctly you may be able to get some way there using delimiters though an alternate approach might be better. Basically, what you want to do is use '/' as a delimiter to help identify all the CUST_ID or JOB_ID "subdirectories" under the date ranges of interest.

What may be confusing you is that the "subdirectories" aren't returned as object keys from S3 since, on their own, they are not really complete keys at all -- just partial ones. When you get a chunked listing result from S3 you will need to pull out the CommonPrefixes strings to get the subdirectory path components.

The problem is that if you use a partial prefix like "2010-05" to find everything in May, applying a delimiter value of "/" will only find you the date portions of the path up to the next "/", and will not include any of the customer IDs in the path. You would then need to list each exact date string with a prefix like "2010-05-31/" (note the inclusion of the slash in the prefix) with a delimiter of "/" to find the customer IDs that are the next deeper path components.

This might be so painful that your RDS query is preferable?

James

Patrick Lightbody

unread,
Jun 6, 2010, 8:46:46 PM6/6/10
to jets3t...@googlegroups.com
Cool, thanks for the tip. Yeah, I'd say it's easier just to query the DB :)

Patrick

Reply all
Reply to author
Forward
0 new messages