Java Heap Space & Duplicate Files Rpt

Michael J. Bennett

unread,

Oct 20, 2011, 4:36:24 PM10/20/11

to ace-devel

I've been testing ACE for about a week now on an area of our archival
store and have been pretty pleased with what the tool can do. I can
see us using this in production at UConn some point soon.

I've run into a couple of issues, though, that I'd like ask about.
One is that I've had to bump up my JVM's memory a couple of times to
avoid Java heap space crashes. As I've been doing this in small
increments in a hunt and peck manner, I figured I might as well just
ask what the optimum setting might be. I'm currently allocating 500m,
which seemed to work fine until another recent crash while running a
duplicate file list.

Speaking of the duplicate file list, my assumption was that it would
re-populate from scratch with new data after each full audit. But
what I'm currently seeing appears to be the same list even after
removing some previously reported duplicates and running a new audit.
What am I missing?

Thanks,

Michael

Mike

unread,

Oct 20, 2011, 7:24:10 PM10/20/11

to ace-devel

Unfortunately, the duplicate file operation is very, very memory
intensive. Are you using a snapshot or 1.6 release of ACE? The latest
snapshots, while requiring a minor db upgrade have a rewritten
comparison mechanism that should perform a little bit better.

In production on chronopolis, I have our ace monitor set to 4g and we
are able to compare collections of 5 million files. How large are the
collections you are trying to compare?

The comparison uses the database as it appears after the last audit.
Have you removed the files both from your underlying storage and from
ace? When you rerun an audit, it will attempt to register anything
that it detects even if you previously removed it from ace.

-Mike

On Oct 20, 4:36 pm, "Michael J. Bennett" <michael.benn...@uconn.edu>
wrote:

Michael J. Bennett

unread,

Oct 21, 2011, 10:54:40 AM10/21/11

to ace-devel

Mike,

Thanks for the quick response. I'm currently running ACE 1.6 over
roughly 1TB's worth of data (91G files) just for testing purposed over
a recently upgraded gigabit network connection to the file store. As
was hoped for, the network upgrade has allowed me to better assess the
scalability of running this in production. Based upon your
chronopolis figures which we'll never hit here but can be proportioned
down to our environment, it sounds like I might want to bump my JVM up
more to allow for additional headroom. Thanks for those comparative
numbers.

With regard to the duplicate file issue, I'd previously removed the
duplicates just from the file store and not from ACE. Such a
selective removal from ACE looks to be done by selecting the ACE
collection, navigating to "Browse," then "Remove," correct? If I'm
understanding you correctly, then I should be able to re-run an audit
on the collection and the "fixed" duplicates shouldn't appear in the
Show Duplicates File list. Is that right?

Michael

Mike

unread,

Oct 21, 2011, 1:05:11 PM10/21/11

to ace-devel

Michael,

That's correct, you'll have to remove the files from ace in order for
them not to be used during a compare. ACE should have marked them as
missing if you ran an audit after removing them from the local
filesystem.

-Mike

On Oct 21, 10:54 am, "Michael J. Bennett" <michael.benn...@uconn.edu>

Reply all

Reply to author

Forward