We had a couple of failures in a server application that we cannot yet
reproduce in a simple case. Analysis of the code suggests that the
only possible explanation is that the following occurs,
os.remove(x)
.... stuff
if os.path.isfile(x):
raise "Ooops, how did we get here?"
file(x, "wb").write(content)
We end up in the raise. By the time we get to look at the system the
file is actually gone, presumably because of the os.remove.
The "stuff" is a handful of lines of code which don't take any
significant time to perform. There are no "try" blocks to mask a
failure in the os.remove call.
The application is multithreaded so it is possible that another thread
writes to the file between the "remove" and the "isfile", but at the
end of the failure the file is actually not on the filesystem and I
don't believe there is a way that the file could be removed again in
this scenario.
Regards,
Paul
The os.remove implementation (in posixmodule.c) uses
the DeleteFileW/A API call (depending on Unicode or not):
http://msdn2.microsoft.com/En-US/library/aa363915.aspx
No suggestion there that it might return before completion,
but I'd be surprised if it did.
The most likely bet would seem to be a race condition
as you suggest below. Doesn't have to be from a thread
in your program, although I assume you know best about
your own filesystem ;)
Another possibility -- I suppose, though without any
authority -- is that the .remove is silently swallowing
an error (ie failing at OS level without telling Python
so no exception is raised). Much more likely is that
somewhere in that "...stuff" is something which
inadvertently recreates the file.
Don't suppose you've got some kind of flashy software
running which intercepts OS file-manipulation calls for
Virus or Archiving purposes?
TJG
As an afterthought, have you tried NTFS auditing, or
directory monitoring, such as:
to see the sequence of events on the directory? At least
that might confirm whether you are seeing
delete-create-delete or some other pattern.
TJG
>
> The application is multithreaded so it is possible that another thread
> writes to the file between the "remove" and the "isfile", but at the
> end of the failure the file is actually not on the filesystem and I
> don't believe there is a way that the file could be removed again in
> this scenario.
>
This sure sounds like a thread race condition. In theory, the os.remove call
failing to actually remove the file before returning might also do this,
but it seems unlikely that a bug that blatant in a fundamental OS call could
survive very long, even in Windoze.
I'd take the time to really examine the multiple threads of work you're running
to make sure one of them isn't removing the file just as another creates it.
Better still, use a locking semaphore around the code the creates/deletes the file
to guarantee mutual exclusion.
> The most likely bet would seem to be a race condition
> as you suggest below. Doesn't have to be from a thread
> in your program, although I assume you know best about
> your own filesystem ;)
My first thought, after discounting the os.remove early return, was
that it has to be from a thread in our application. But,
a) it is highly unlikely due to the way tasks are scheduled
b) even if it did occur I don't see a code path that ends with the
file not there
But, until I read the next part of your note, it was still the only
credible explanation ...
>
> Don't suppose you've got some kind of flashy software
> running which intercepts OS file-manipulation calls for
> Virus or Archiving purposes?
>
... I'm wondering if this is the culprit. I now recall that the
Spambayes project saw a weird error due to Google Desktop Search where
GDS would intervene at such a low level that some file system level
"invariants" ... aren't! I don't remember the details but I think you
delete or create a file and GDS jumps in to backup / index it and you
don't have the access you thought you had moments ago.
I don't think GDS is running on this server but it is running a lot of
other enterprise monitoring apps and maybe they are doing a similar
thing.
I'm off to investigate more on this front!
Thanks,
Paul
>
> I'd take the time to really examine the multiple threads of work you're running
> to make sure one of them isn't removing the file just as another creates it.
> Better still, use a locking semaphore around the code the creates/deletes the file
> to guarantee mutual exclusion.
The locking-semaphore idea is a good one - it would remove any
possibility that this kind of race condition is causing the problem.
Thanks,
Paul
No, but the server is actually a VMWare VM and the drive is a virtual
drive.
I'm thinking that this may be significant as it may be that the VMWare
VHD driver is the "flashy software running which intercepts OS file-
manipulation calls" that Tim Golden suggested.
Paul
As I mentioned in another reply, this server is virtual and so is the
drive. I'm wondering if this might also be significant.
Paul
Yes. If another application has opened the file with FILE_SHARE_DELETE,
os.remove succeeds but the file doesn't actually disappear until the last open
handle to it is closed.
Roger
----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
Roger,
Thanks - you've hit the nail on the head. This is the final piece of
the puzzle and I've now been able to reproduce the problem!
The cause is ...
- a TSVCache.exe (Tortoise SVN) process is scanning the file with
FILE_SHARE_DELETE access at the moment that the os.remove occurs
- this causes os.remove to return but the file is still there while
the scan completes
- next, os.path.isfile returns True and the app raises an exception
- a short while later the scan is complete and Windows deletes the
file
Thanks to everyone who responded - I didn't expect to be able to get
to the bottom of this so quickly!
Thanks,
Paul