Syncing Discrete Paths

21 views
Skip to first unread message

forum...@imaging-resource.com

unread,
Jan 18, 2008, 2:48:31 PM1/18/08
to JetS3t Users
We are testing the JetS3t package and had a question about attempting
to sync discrete virtual directories under our bucket.

An example (not using 'real' bucket names, etc.):

We have a bucket called:

test.test.com

which we use with our DNS to be able to use our own URL for accessing
objects over the web.

We have one 'virtual folder' under that hierarchy called:

IMAGES

Under Images we have several virtual directories:

Bob
Mary
Sue


That corresponds to the 'real' setup on our server of:

/var/www/html/IMAGES
/var/www/html/IMAGES/Bob
/var/www/html/IMAGES/Mary
/var/www/html/IMAGES/Sue

If we use the synchronize portion to synschronize everything, it works
as expected:

synchronize.sh UP test.test.com /var/www/html/IMAGES

That will sync all of the contents in Bob, Mary, etc.

Is there any way to only sync one of the 'virtual subfolders'? We've
tried various combinations but seem to not get the results we want.

We got close but seems that we ran into an issue with the delete flag.
We could get Bob to sync for example but Mary and Sue got deleted in
our test. Any way to cut off the delete flag or constrain the sync to
the one virtual folder?

Thanks in advance.

Angel Vera

unread,
Jan 18, 2008, 3:01:53 PM1/18/08
to jets3t...@googlegroups.com
You can try to use .jets3t-ignore, to skip directories. You create the .jets3t-ignore at the parent level, for example:

if you have a directory /zzz/yyy/www.com, with the following content:

------------------content of /zzz/yyy/www.com----------
documents
pictures
camera
------------------------------------------------------------------------

you then create a file inside of /zzz/yyy/www.com, with the content below
--------------------------
picture
samba
--------------------------

and when Jets3t syncronizes it will skip over those directories specified and is like they never existed.

forum...@imaging-resource.com

unread,
Jan 18, 2008, 4:50:40 PM1/18/08
to JetS3t Users
Angel:

Thanks for the suggestion.

We have a large number of directories under IMAGES and the discrete
path that we want to sync would change so we'd have to update the
ignore file each time we only wanted to sync one virtual directory.

That would certainly work if we scripted it but hopefully there's a
more direct way.

James Murty

unread,
Jan 18, 2008, 6:28:43 PM1/18/08
to jets3t...@googlegroups.com
One way to do this would be to use the --keepfiles option, which does not delete objects from S3. The problem with this approach is that you may end up with many outdated objects in S3 over time.

A better approach would be to run the synchronize command from within the specific directory you want to sync. If you specify an explicit S3 path in addition to the bucket name and use the * wildcard to refer to files within this directory, you can upload only the specific files you want.

For example, to sync only the contents of the Bob directory to the S3 path IMAGES/Bob in the test.test.com bucket:

cd /var/www/html/IMAGES/Bob
synchronize.sh  UP  test.test.com/IMAGES/Bob  *

Notice that the synchronize command includes a target S3 path after the bucket name, and that the files/folders inside the current directory are referred to using the wildcard.

This latter approach will still require some scripting to automate it, but the scripting should be simpler than using .jets3t-ignore files.

Hope this helps,
James

James Murty

unread,
Jan 18, 2008, 6:31:44 PM1/18/08
to jets3t...@googlegroups.com
Oh, and it's always worth mentioning that you can test exactly what a synchronize command will do without risking your data by including the --noaction option.

forum...@imaging-resource.com

unread,
Jan 22, 2008, 12:45:01 PM1/22/08
to JetS3t Users
> For example, to sync only the contents of the Bob directory to the S3 path
> IMAGES/Bob in the test.test.com bucket:
>
> cd /var/www/html/IMAGES/Bob
> synchronize.sh  UP  test.test.com/IMAGES/Bob  *
>

James:

Thank you!

That seems to do what we want on a test run with one exception. If we
have an ignore file in said directory, it doesn't seem to be used. Is
that by design or is there a way to sync just the one directory while
also 'respecting' whatever ignore files might be in place in said
directory?

James Murty

unread,
Jan 22, 2008, 5:26:53 PM1/22/08
to jets3t...@googlegroups.com
James:

Thank you!

That seems to do what we want on a test run with one exception. If we
have an ignore file in said directory, it doesn't seem to be used. Is
that by design or is there a way to sync just the one directory while
also 'respecting' whatever ignore files might be in place in said
directory?

Ah, you have identified an oversight in the way JetS3t checks for .jets3t-ignore files.

I have just committed a patch to the CVS code base to make JetS3t look for a .jets3t-ignore file in the current working directory when files are synced from this directory. Previously, the current working directory was not checked - which is why your ignore settings are being, well, ignored.

If you are able to use the latest version of JetS3t from CVS, you can use the new fix straight away. If you are using an older version, you will either have to put up with this problem or patch the version you are using. I can provide you with the necessary code change if you wish to patch an older version.

Cheers,
James

forum...@imaging-resource.com

unread,
Jan 23, 2008, 2:42:20 PM1/23/08
to JetS3t Users
James:

Thank you.

We're using the current stable version, so yes we'd prefer to patch
the version we are using.


James Murty

unread,
Jan 23, 2008, 6:29:54 PM1/23/08
to jets3t...@googlegroups.com
I have attached a unified diff file to this email. If you apply this as a patch to the source file src/org/jets3t/service/utils/FileComparer.java and rebuild the project, the .jets3t-ignore fix will be applied.

Cheers,
James
FileComparer.diff

forum...@imaging-resource.com

unread,
Jan 24, 2008, 2:08:08 PM1/24/08
to JetS3t Users
Thank you.

I'm following about applying the patch but how does one rebuild the
project itself?

Thanks,

James Murty

unread,
Jan 24, 2008, 6:03:25 PM1/24/08
to jets3t...@googlegroups.com
The best approach to build your own version of the codebase is to check out the patched 0.5.0 version of the codebase from the CVS repository at java.net, and use the ANT build scripts in the codebase to create updated Jar files.

Here is a rough guide to the process. It's fairly complex, and the commands I describe below assume you're using a Linux or Mac OSX computer. Jump to the end of this message for a short cut ;-)

- Install the Apache ANT build tools on your computer

- Sign up for a java.net account in the "jets3t" project with the Observer role: https://jets3t.dev.java.net/

- Check out the "Release-0_5_0-patched" branch of the CVS codebase with the following CVS command, using your java.net email address:
cvs -d :pserver:YourUs...@cvs.dev.java.net:/cvs co -r Release-0_5_0-patched jets3t

- Inside the "jets3t" codebase directory created by the CVS command, run the ANT build tool to compile a JetS3t distribution:
ant clean dist

- The directory "jets3t/dist/jets3t-0.5.0/jars/" will now contain Jar files built from the codebase. The file you are interested in is "jets3t-0.5.0.jar", which contains the updated FileComparer class

- Replace the "jets3t-0.5.0.jar" file in your normal JetS3t distribution with the updated version.


As an alternative to all these steps, I have attached a pre-prepared Jar file I built for testing from the patched 0.5.0 codebase.

Hope this helps,
James
jets3t-0.5.0.jar

forum...@imaging-resource.com

unread,
Jan 25, 2008, 10:44:51 AM1/25/08
to JetS3t Users
Thank you very much for the explanation and the pre-prepared jar.

forum...@imaging-resource.com

unread,
Feb 1, 2008, 6:52:58 PM2/1/08
to JetS3t Users
James:

I downloaded the jets3t-0.5.0.jar file above and replaced the older
file in jets3t-0.5.0/jars with the updated file.

When I attempted to rerun a test command, I got the same results. The
ignore file appears to be ignored.

As best I can tell, the correct file was replaced since that's the
only .jar file with that name.

No cached files that I can see (running Java 1.6.x branch).

Any suggestions?

Thanks.

James Murty

unread,
Feb 1, 2008, 8:52:12 PM2/1/08
to jets3t...@googlegroups.com
Oops, this is my fault. I've just re-checked my code and it turns out that the changes I sent you would ignore files in the current directory, but it would not ignore directories. I assume your .jets3t-ignore file includes directory names?

I have attached an updated Jar file that includes a fix for this fault. As before, please replace your jars/jets3t-0.5.0.jar file with the attached version.

Hopefully this time I got it right...

James
jets3t-0.5.0.jar

forum...@imaging-resource.com

unread,
Feb 5, 2008, 3:35:28 PM2/5/08
to JetS3t Users
Thank you.

On a quick check, that seems to perform as expected.

I'll let you know if we encounter any issues after we've tested more
extensively.

Thanks!

forum...@imaging-resource.com

unread,
Feb 15, 2008, 12:08:57 PM2/15/08
to JetS3t Users
James:

We were able to get everything uploaded using the new jar but now are
running into what appears to be a timeout issue.

There are now some 500K files on S3 from the initial upload and when
we attempt to do an UP sync of the full structure to upload various
changed files, it appears to time out will attempting to list/compare
the files. The error:

Exception in thread "main" org.jets3t.service.S3ServiceException:
Failed to parse XML document with handler class
org.jets3t.service.impl.rest.XmlResponsesSaxParser$ListBucketHandler
at
org.jets3t.service.impl.rest.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:
108)
at
org.jets3t.service.impl.rest.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResponsesSaxParser.java:
125)
at
org.jets3t.service.impl.rest.httpclient.RestS3Service.listObjectsInternal(RestS3Service.java:
793)
at
org.jets3t.service.impl.rest.httpclient.RestS3Service.listObjectsImpl(RestS3Service.java:
758)
at org.jets3t.service.S3Service.listObjects(S3Service.java:
609)
at org.jets3t.service.S3Service.listObjects(S3Service.java:
587)
at org.jets3t.service.S3Service.listObjects(S3Service.java:
480)
at
org.jets3t.service.utils.FileComparer.buildS3ObjectMap(FileComparer.java:
271)
at
org.jets3t.apps.synchronize.Synchronize.run(Synchronize.java:732)
at
org.jets3t.apps.synchronize.Synchronize.main(Synchronize.java:999)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at
com.sun.net.ssl.internal.ssl.InputRecord.readFully(InputRecord.java:
293)
at
com.sun.net.ssl.internal.ssl.InputRecord.read(InputRecord.java:331)
at
com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:
782)
at
com.sun.net.ssl.internal.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:
739)
at
com.sun.net.ssl.internal.ssl.AppInputStream.read(AppInputStream.java:
75)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:
254)
at java.io.BufferedInputStream.read(BufferedInputStream.java:
313)
at
org.apache.commons.httpclient.ChunkedInputStream.read(ChunkedInputStream.java:
181)
at java.io.FilterInputStream.read(FilterInputStream.java:111)
at
org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:
107)
at
org.jets3t.service.io.InterruptableInputStream.read(InterruptableInputStream.java:
72)
at
org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.read(HttpMethodReleaseInputStream.java:
123)
at sun.nio.cs.StreamDecoder
$CharsetSD.readBytes(StreamDecoder.java:411)
at sun.nio.cs.StreamDecoder
$CharsetSD.implRead(StreamDecoder.java:453)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:183)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.read1(BufferedReader.java:187)
at java.io.BufferedReader.read(BufferedReader.java:261)
at
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:
1740)
at
com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.scanName(XMLEntityScanner.java:
422)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:
1291)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl
$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:
1756)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:
368)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:
834)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:
764)
at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:
148)
at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:
1242)
at
org.jets3t.service.impl.rest.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:
101)
... 9 more

__

Any suggestions?

Thanks.

James Murty

unread,
Feb 15, 2008, 6:01:23 PM2/15/08
to jets3t...@googlegroups.com
This problem may be due to a fault with S3, and other AWS services, that has had a pretty widespread effect and caused unusual errors.

You can see the (massive) discussion about the issue here:
http://developer.amazonwebservices.com/connect/thread.jspa?threadID=19714

According to the thread above the issue has been resolved and things should be back to normal soon, let me know if this is not the case for you.

forum...@imaging-resource.com

unread,
Feb 16, 2008, 12:01:03 PM2/16/08
to JetS3t Users
James:

Thank you.

I tried again this AM and unfortunately got the same error.

I get a similar error with Cockpit as well as it appears to timeout
after only loading info. about 80K of the 500K objects.



James Murty

unread,
Feb 17, 2008, 8:54:32 PM2/17/08
to jets3t...@googlegroups.com
Hmm, you are not the only user to report Read Timeout errors when listing the contents of a bucket that contains many objects.

In the 0.6.0 version of JetS3t I added a retry mechanism for this specific problem but I don't know how effective this measure is, as I have not been able to replicate the issue when listing my buckets (My largest bucket is only 50K objects).

Could you try using the new online version of Cockpit to list the objects in this bucket to see if the retry mechanism is sufficient to solve this problem? If it is, I should be able to back-port the modification to the 0.5.0 version you are using.

Also, could you describe the object naming strategy you have used in your large bucket? I have been working on a new technique to speed up the object listing process by using multiple threads, but to work effectively it requires that the object names can be partitioned into roughly even groupings based on prefix strings. If your objects are named in a way that fits this model, the new listing feature could significantly speed up the synchronization process for large buckets.

Cheers,
James

forum...@imaging-resource.com

unread,
Feb 20, 2008, 2:04:23 PM2/20/08
to JetS3t Users
James:

Thanks for the response.

I pulled down the 0.6.0 version and tried to list the objects with
cockpit.

It got further than before but timed out at about 155K of the 500K
items this time.

The error was a bit different and it seemed to indicate an out of
memory error for the java heap space:

Caused by: java.lang.OutofMemoryError: Java heap space.

Is there a way to increase that allocation perhaps?

Regarding the naming structure, the files are image files so there is
some commonality.

For example:

IMAGES/GROUPA/image1.jpg
IMAGES/GROUPA/image2.jpg
IMAGES/GROUPA/image3.jpg
IMAGES/GROUPA/THUMB/image1thumb.jpg
IMAGES/GROUPA/FULL/image1full.jpg
IMAGES/GROUPB/image1.jpg
IMAGES/GROUPB/image2.jpg
IMAGES/GROUPB/image3.jpg
IMAGES/GROUPB/THUMB/image1thumb.jpg
IMAGES/GROUPB/FULL/image1full.jpg

The groupings are fairly constant but the number of images in each
virtual folder does vary.

Thanks for all of your assistance.

James Murty

unread,
Feb 20, 2008, 4:53:40 PM2/20/08
to jets3t...@googlegroups.com
You can increase the memory that is made available to the Java virtual machine by editing the synchronize script. The last line in the synchronize.sh and synchronize.bat scripts includes the setting -Xmx128M.

This setting allocates a maximum of 128 Megabytes of memory to Java, to increase the amount of memory change the "128" part of this setting to something larger like 512.

I will contact you out-of-band to discuss the object naming scheme and possibilities for speeding up the object listing.

James
--
http://www.jamesmurty.com

James Murty

unread,
Feb 20, 2008, 11:45:01 PM2/20/08
to jets3t...@googlegroups.com
I have just released a new version of the Synchronize application with a batch synching feature that should improve the performance and reliability of the application when synchronizing large buckets.

This latest version is new and has undergone only limited testing, but it will hopefully provide a solution for your synching difficulties.
 
Please see the following thread for information about this feature and further discussion:
http://groups.google.com/group/jets3t-users/t/8cc80dddac2a9706?hl=en

Cheers,
James


On Thu, Feb 21, 2008 at 6:04 AM, <forum...@imaging-resource.com> wrote:



--
http://www.jamesmurty.com

forum...@imaging-resource.com

unread,
Feb 22, 2008, 12:34:13 PM2/22/08
to JetS3t Users
James:

Thanks.

I'll post feedback about the new version in the other thread.
Reply all
Reply to author
Forward
0 new messages