I want my script to generate a ~1KB status file several times a second.
The script may be terminated at any time but the status file must not
be corrupted.
When the script is started next time the status file will be read to
check what needs to be done.
My initial solution was a thread that writes status to a tmp file
first and then renames:
You algorithm may not write and flush all data to disk. You need to do
additional work. You must also store the tmpfile on the same partition
(better: same directory) as the status file
with open(tmp_file, "w") as f:
f.write(status)
# flush buffer and write data/metadata to disk
f.flush()
os.fsync(f.fileno())
# now rename the file
os.rename(tmp_file, status_file)
# finally flush metadata of directory to disk
dirfd = os.open(os.path.dirname(status_file), os.O_RDONLY)
try:
os.fsync(dirfd)
finally:
os.close(dirfd)
What are you keeping in this status file that needs to be saved
several times per second? Depending on what type of state you're
storing and how persistent it needs to be, there may be a better way
to store it.
On Sun, Jul 8, 2012 at 7:53 AM, Christian Heimes <li...@cheimes.de> wrote:
> Am 08.07.2012 13:29, schrieb Richard Baron Penman:
>> My initial solution was a thread that writes status to a tmp file
>> first and then renames:
> You algorithm may not write and flush all data to disk. You need to do
> additional work. You must also store the tmpfile on the same partition
> (better: same directory) as the status file
> with open(tmp_file, "w") as f:
> f.write(status)
> # flush buffer and write data/metadata to disk
> f.flush()
> os.fsync(f.fileno())
> # now rename the file
> os.rename(tmp_file, status_file)
> # finally flush metadata of directory to disk
> dirfd = os.open(os.path.dirname(status_file), os.O_RDONLY)
> try:
> os.fsync(dirfd)
> finally:
> os.close(dirfd)
On Sun, 8 Jul 2012 21:29:41 +1000, Richard Baron Penman <richar...@gmail.com> declaimed the following in gmane.comp.python.general:
>> and then on startup read from tmp_file if status_file does not exist.
>> But this seems awkward.
> It also violates your requirement -- since the "crash" could take
> place with a partial "temp file".
> I'd suggest that, rather than deleting the old status file, you
> rename IT -- and only delete it IF you successfully rename the temp
> file.
Yes, this is much better. Almost perfect. Don't forget to consult your system documentation, and check if the rename operation is atomic or not. (Most probably it will only be atomic if the original and the renamed file are on the same physical partition and/or mount point).
But even if the rename operation is atomic, there is still a race condition. Your program can be terminated after the original status file has been deleted, and before the temp file was renamed. In this case, you will be missing the status file (although your program already did something just it could not write out the new status).
Here is an algorithm that can always write and read a status (but it might not be the latest one). You can keep the last two status files.
Writer:
* create temp file, write new status info
* create lock file if needed
* flock it
* try:
* delete older status file
* rename temp file to new status file
* finally: unlock the lock file
Reader:
* flock the lock file
* try:
* select the newer status file
* read status info
* finally: unlock the lock file
It is guaranteed that you will always have a status to read, and in most cases this will be the last one (because the writer only locks for a short time). However, it is still questionable, because your writer may be waiting for the reader to unlock, so the new status info may not be written immediatelly.
It would really help if you could tell us what are you trying to do that needs status.
> But even if the rename operation is atomic, there is still a race
> condition. Your program can be terminated after the original status file
> has been deleted, and before the temp file was renamed. In this case,
> you will be missing the status file (although your program already did
> something just it could not write out the new status).
You are contradicting yourself. Either the OS is providing a fully
atomic rename or it doesn't. All POSIX compatible OS provide an atomic
rename functionality that renames the file atomically or fails without
loosing the target side. On POSIX OS it doesn't matter if the target exists.
You don't need locks or any other fancy stuff. You just need to make
sure that you flush the data and metadata correctly to the disk and
force a re-write of the directory inode, too. It's a standard pattern on
POSIX platforms and well documented in e.g. the maildir RFC.
You can use the same pattern on Windows but it doesn't work as good and
doesn't guaranteed file integrity for two reasons:
1) Windows's rename isn't atomic if the right side exists.
2) Windows locks file when a program opens a file. Other programs can't
rename or overwrite the file. (You can get around the issue with some
extra work, though.)
> What are you keeping in this status file that needs to be saved
> several times per second? Depending on what type of state you're
> storing and how persistent it needs to be, there may be a better way
> to store it.
> Michael
This is for a threaded web crawler. I want to cache what URL's are
currently in the queue so if terminated the crawler can continue next
time from the same point.
> > and then on startup read from tmp_file if status_file does not exist.
> > But this seems awkward.
> It also violates your requirement -- since the "crash" could take
> place with a partial "temp file".
Can you explain why?
My thinking was if crash took place when writing the temp file this
would not matter because the status file would still exist and be read
from. The temp file would only be renamed when fully written.
On Sun, 08 Jul 2012 22:57:56 +0200, Laszlo Nagy wrote:
> Yes, this is much better. Almost perfect. Don't forget to consult your
> system documentation, and check if the rename operation is atomic or not.
> (Most probably it will only be atomic if the original and the renamed file
> are on the same physical partition and/or mount point).
On Unix, rename() is always atomic, and requires that source and
destination are on the same partition (if you want to "move" a file across
partitions, you have to copy it then delete the original).
> But even if the rename operation is atomic, there is still a race
> condition. Your program can be terminated after the original status file
> has been deleted, and before the temp file was renamed. In this case, you
> will be missing the status file (although your program already did
> something just it could not write out the new status).
In the event of abnormal termination, losing some data is to be expected.
The idea is to only lose the most recent data while keeping the old copy,
rather than losing everything. Writing to a temp file then rename()ing
achieves that.
> Is there a better way? Or do I need to use a database?
Using a database would seem to meet a lot of your needs. Don't forget that Python comes with a sqlite database engine included, so it shouldn't take you more than a few lines of code to open the database once and then write out your status every few seconds.
import sqlite3
con = sqlite3.connect('status.db')
...
with con:
cur = con.cursor()
cur.execute('UPDATE ...', ...)
and similar code to restore the status or create required tables on startup.
> You are contradicting yourself. Either the OS is providing a fully
> atomic rename or it doesn't. All POSIX compatible OS provide an atomic
> rename functionality that renames the file atomically or fails without
> loosing the target side. On POSIX OS it doesn't matter if the target exists.
Rename on some file system types (particularly NFS) may not be atomic.
> You don't need locks or any other fancy stuff. You just need to make
> sure that you flush the data and metadata correctly to the disk and
> force a re-write of the directory inode, too. It's a standard pattern on
> POSIX platforms and well documented in e.g. the maildir RFC.
> You can use the same pattern on Windows but it doesn't work as good.
That's because you're using the wrong approach. See how to use
ReplaceFile under Win32:
Renaming files is the wrong way to synchronize a
crawler. Use a database that has ACID properties, such as
SQLite. Far fewer I/O operations are required for small updates.
It's not the 1980s any more.
I use a MySQL database to synchronize multiple processes
which crawl web sites. The tables of past activity are InnoDB
tables, which support transactions. The table of what's going
on right now is a MEMORY table. If the database crashes, the
past activity is recovered cleanly, the MEMORY table comes back
empty, and all the crawler processes lose their database
connections, abort, and are restarted. This allows multiple
servers to coordinate through one database.
Please consider batching this data and doing larger writes. Thrashing
the hard drive is not a good plan for performance or hardware
longevity. For example, crawl an entire FQDN and then write out the
results in one operation. If your job fails in the middle and you
have to start that FQDN over, no big deal. If that's too big of a
chunk for your purposes, perhaps break each FQDN up into top-level
directories and crawl each of those in one operation before writing to
disk.
There are existing solutions for managing job queues, so you can
choose what you like. If you're unfamiliar, maybe start by looking at
celery.
On Mon, Jul 9, 2012 at 1:52 AM, Plumo <richar...@gmail.com> wrote:
>> What are you keeping in this status file that needs to be saved
>> several times per second? Depending on what type of state you're
>> storing and how persistent it needs to be, there may be a better way
>> to store it.
>> Michael
> This is for a threaded web crawler. I want to cache what URL's are
> currently in the queue so if terminated the crawler can continue next
> time from the same point.
> --
> http://mail.python.org/mailman/listinfo/python-list
"The ReplaceFile function combines several steps within a single
function. An application can call ReplaceFile instead of calling
separate functions to save the data to a new file, rename the original
file using a temporary name, rename the new file to have the same name
as the original file, and delete the original file."
About the best you can get in Windows, I think, is MoveFileTransacted,
but you need to be running Vista or later:
> You are contradicting yourself. Either the OS is providing a fully
> atomic rename or it doesn't. All POSIX compatible OS provide an atomic
> rename functionality that renames the file atomically or fails without
> loosing the target side. On POSIX OS it doesn't matter if the target exists.
This is not a contradiction. Although the rename operation is atomic, the whole "change status" process is not. It is because there are two operations: #1 delete old status file and #2. rename the new status file. And because there are two operations, there is still a race condition. I see no contradiction here.
> You don't need locks or any other fancy stuff. You just need to make
> sure that you flush the data and metadata correctly to the disk and
> force a re-write of the directory inode, too. It's a standard pattern on
> POSIX platforms and well documented in e.g. the maildir RFC.
It is not entirely true. We are talking about two processes. One is reading a file, another one is writting it. They can run at the same time, so flushing disk cache forcedly won't help.
> Renaming files is the wrong way to synchronize a
> crawler. Use a database that has ACID properties, such as
> SQLite. Far fewer I/O operations are required for small updates.
> It's not the 1980s any more.
I agree with this approach. However, the OP specifically asked about "how to update status file".
> This is not a contradiction. Although the rename operation is atomic,
> the whole "change status" process is not. It is because there are two
> operations: #1 delete old status file and #2. rename the new status
> file. And because there are two operations, there is still a race
> condition. I see no contradiction here.
Sorry, but you are wrong. It's just one operation that boils down to
"point name to a different inode". After the rename op the file name
either points to a different inode or still to the old name in case of
an error. The OS guarantees that all processes either see the first or
second state (in other words: atomic).
POSIX has no operation that actually deletes a file. It just has an
unlink() syscall that removes an associated name from an inode. As soon
as an inode has no names and is not references by a file descriptor, the
file content and inode is removed by the operating system. rename() is
more like a link() followed by an unlink() wrapped in a system wide
global lock.
> It is not entirely true. We are talking about two processes. One is
> reading a file, another one is writting it. They can run at the same
> time, so flushing disk cache forcedly won't help.
You need to flush the data to disk as well as the metadata of the file
and its directory in order to survive a system crash. The close()
syscall already makes sure that all data is flushed into the IO layer of
the operating system.
With POSIX semantics the reading process will either see the full
content before the rename op or the full content after the rename op.
The writing process can replace the name (rename op) while the reading
process reads the status file because its file descriptor still points
to the old status file.
>> You are contradicting yourself. Either the OS is providing a fully
>> atomic rename or it doesn't. All POSIX compatible OS provide an atomic
>> rename functionality that renames the file atomically or fails without
>> loosing the target side. On POSIX OS it doesn't matter if the target
>> exists.
> This is not a contradiction. Although the rename operation is atomic,
> the whole "change status" process is not. It is because there are two
> operations: #1 delete old status file and #2. rename the new status
> file. And because there are two operations, there is still a race
> condition. I see no contradiction here.
On Posix systems, you can avoid the race condition. The trick is to
skip step #1. The rename will implicitly delete the old file, and
it will still be atomic. The whole process now consists of a single
stop, so the whole process is now atomic.
>> You don't need locks or any other fancy stuff. You just need to make
>> sure that you flush the data and metadata correctly to the disk and
>> force a re-write of the directory inode, too. It's a standard pattern on
>> POSIX platforms and well documented in e.g. the maildir RFC.
> It is not entirely true. We are talking about two processes. One is
> reading a file, another one is writting it. They can run at the same
> time, so flushing disk cache forcedly won't help.
On Posix systems, it will work, and be atomic, even if one process is
reading the old status file while another process is writing the new
one. The old file will be atomically removed from the directory by
the rename operation; it will continue to exists on the hard drive, so
that the reading process can continue reading it. The old file will
be deleted when the reader closes it. Or, if the system crashed before
the old file is closed, it will deleted when the system is restarted.
> This is not a contradiction. Although the rename operation is atomic,
> the whole "change status" process is not. It is because there are two
> operations: #1 delete old status file and #2. rename the new status
> file. And because there are two operations, there is still a race
> condition. I see no contradiction here.
> Sorry, but you are wrong. It's just one operation that boils down to
> "point name to a different inode". After the rename op the file name
> either points to a different inode or still to the old name in case of
> an error. The OS guarantees that all processes either see the first or
> second state (in other words: atomic).
> POSIX has no operation that actually deletes a file. It just has an
> unlink() syscall that removes an associated name from an inode. As soon
> as an inode has no names and is not references by a file descriptor, the
> file content and inode is removed by the operating system. rename() is
> more like a link() followed by an unlink() wrapped in a system wide
> global lock.
Then please help me understand this.
"Good" case:
process #1: unlink(old status file)
process #1: rename(new status file)
process#2: open(new status file)
process#2: read(new status file)
"Bad" case:
process #1: unlink(old status file)
process#2: open(???) -- there is no file on disk here, this system call returns with an error!
process #1: rename(new status file)
If it would be possible to rename + unlink in one step, then it would be okay. Can you please explain what am I missing?
>> This is not a contradiction. Although the rename operation is atomic,
>> the whole "change status" process is not. It is because there are two
>> operations: #1 delete old status file and #2. rename the new status
>> file. And because there are two operations, there is still a race
>> condition. I see no contradiction here.
> On Posix systems, you can avoid the race condition. The trick is to
> skip step #1. The rename will implicitly delete the old file, and
> it will still be atomic. The whole process now consists of a single
> stop, so the whole process is now atomic.
Well, I didn't know that this is going to work. At least it does not work on Windows 7 (which should be POSIX compatible?)
>>> f = open("test.txt","wb+")
>>> f.close()
>>> f2 = open("test2.txt","wb+")
>>> f2.close()
>>> import os
>>> os.rename("test2.txt","test.txt")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
WindowsError: [Error 183] File already exists
>>>
I have also tried this on FreeBSD and it worked.
Now, let's go back to the original question:
>>>This works well on Linux but Windows raises an error when status_file already exists.
It SEEMS that the op wanted a solution for Windows....
> Well, I didn't know that this is going to work. At least it does not
> work on Windows 7 (which should be POSIX compatible?)
Nope, Windows's file system layer is not POSIX compatible. For example
you can't remove or replace a file while it is opened by a process.
Lot's of small things work slightly differently on Windows or not at all.
On Jul 12, 2:39 pm, Christian Heimes <li...@cheimes.de> wrote:
> Windows's file system layer is not POSIX compatible. For example
> you can't remove or replace a file while it is opened by a process.
Sounds like a reasonable fail-safe to me. Not much unlike a car
ignition that will not allow starting the engine if the transmission
is in any *other* gear besides "park" or "neutral", OR a governor (be
it mechanical or electrical) that will not allow the engine RPMs to
exceed a maximum safe limit, OR even, ABS systems which "pulse" the
brakes to prevent overzealous operators from loosing road-to-tire
traction when decelerating the vehicle.
You could say: "Hey, if someone is dumb enough to shoot themselves in
the foot then let them"... however, sometimes fail-safes not only save
the dummy from a life of limps, they also prevent catastrophic
"collateral damage" to rest of us.