java.io.file and unstable NFS filesystems

992 views
Skip to first unread message

Wayne Fay

unread,
Jan 18, 2011, 4:33:25 PM1/18/11
to java...@googlegroups.com
I've been running into a problem on and off for a while now and I'm
hoping maybe someone here might have some advice or suggestions...

For reasons unknown to me, it seems that my NFS mounts are
occasionally unstable (this is on Solaris and the NFS mounts are
hosted on NetApp Filers). Of course, Weblogic is writing logs to those
NFS mounts and all of my incoming and outgoing data files (100+ mb xml
files mostly) are located there along with logs of a Java daemon
process.

The Java daemon monitors the incoming data directory and automatically
"processes" the new files its finds by prepending a timestamp to the
name, moving it to an archive directory (on the same NFS mount), and
then parsing and working with the data in the XML files (mostly
reading and pushing it into an Oracle database).

Every couple of months, the NFS mounts seem to drop out for a short
period of time (on the order of ~5 minutes or less). When it happens,
my Java daemon can no longer scan the filesystem for new files and it
never seems to recover even once the NFS mount is restored. And
sometimes I lose a Weblogic instance to boot. Most recently, I lost
the WL cluster admin instance AND a WL instance on the same box.

My network, OS and netapp guys are all looking into why that is
happening (but it happens so infrequently that it is difficult to
investigate) and in the meantime, I need to find a way for my Java
processes to more gracefully recover from this. It is a real pain to
stop and restart this daemon process since we have ~10 instances
running on a variety of machines, and the production ones require
approvals to touch, etc.

Has anyone run into this? Does anyone have any specific suggestions or
advice? Obviously I can (and probably will) adjust the file scanner
code to try to catch this error and maybe throw away the File
object/handle and get a new one that might restore the connection to
the filesystem, but since that is basically already happening in the
daemon child process, I'm honestly unsure that will do much for me.
(Probably I will find there is a longer-lived File object somewhere
that is somehow trying to connect to the "old" filesystem and is in a
bad state, and so we're not really getting a "new" File object... but
I haven't dug deep into the code yet.)

Finally, if you have any suggestions on how I can effectively set up a
unit test for this disconnected filesystem so I can be certain that
I've fixed the problem, that would be appreciated too!

Thanks.
Wayne

Christian Catchpole

unread,
Jan 19, 2011, 5:52:09 PM1/19/11
to The Java Posse
java.io.File shouldn't be holding any system handles. it basically
holds the path information. note the thread from several months ago
about how inefficient this is on network drives when you are checking
attributes etc. lots of round trips. the Java team said there weren't
going to "fix" (change) this behaviour.

if your code is using a FileInputStream or RandomAccessFile for
processing, this should be invalidated when the file system drops out.
you can't reopen these without constructing new ones from the File.
It's obviously hard to tell where the exact problem lies. Perhaps the
error handing and retry isn't working as cleanly as you expect (ie.
it's irrelevant that it's a network drive). Perhaps the failing File
opens are killing the whole daemon process (or worse).

You should conduct some experiments where you make the directory "go
away". Renaming your data directory should do the trick.

The other possible problem is that it is a problem with the share
remount, but not necessarily a problem with Java or your code. ie.
after remount (or just before it drops out) the directories are
visible but you still can't open the files properly (even if for a
short time). i have no reason to think this is the case but it
shouldn't be overlooked. ie. why did the share go away in the first
place?

As I say, you want to make sure your disk polling code is as bullet
proof as possible and then go from there.

Reinier Zwitserloot

unread,
Jan 20, 2011, 7:00:51 AM1/20/11
to java...@googlegroups.com
Only somewhat related: There's such a thing as new new IO. Our own Carl Quinn has worked on it. This is a pretty big revamp of file I/O and might address either the OP's problem or Christian Catchpole's network speed issues with File's many many roundtrips. I confess I don't know enough about it to know for sure, but perhaps you two may want to look at it, if only to dream of a future where you can rely on JDK7 to be the standard java version and use it.

Wayne Fay

unread,
Jan 20, 2011, 10:55:02 AM1/20/11
to java...@googlegroups.com
> Only somewhat related: There's such a thing as new new IO. Our own Carl
> Quinn has worked on it. This is a pretty big revamp of file I/O and might
> address either the OP's problem or Christian Catchpole's network speed

Thanks Reinier and Christian for weighing in with your comments. I
will let you know how it turns out once I've nailed it down with a
unit test and resolved it. And yes, I am looking forward to jdk7,
whenever it lands...

Wayne

Reply all
Reply to author
Forward
0 new messages