for root, dirs, files in os.walk(basedir):
run program
Being a huge directory set the traversal is taking days to do a
traversal.
Sometimes it is the case there is a crash because of a programming
error.
As each directory is processed the name of the directory is written to
a file
I want to be able to restart the walk from the directory where it
crashed.
Is this possible?
> I am traversing a large set of directories using
>
> for root, dirs, files in os.walk(basedir):
> run program
>
> Being a huge directory set the traversal is taking days to do a
> traversal.
> Sometimes it is the case there is a crash because of a programming
> error.
> As each directory is processed the name of the directory is written to a
> file
What, a proper, honest-to-goodness core dump?
Or do you mean an exception?
> I want to be able to restart the walk from the directory where it
> crashed.
>
> Is this possible?
Quick and dirty with no error-checking:
# Untested
last_visited = open("last_visited.txt", 'r').read()
for root, dirs, files in os.walk(last_visited or basedir):
open("last_visited.txt", 'w').write(root)
run program
--
Steven
If the 'dirs' list were guaranteed to be sorted, you could remove at each
level all previous directories already traversed. But it's not :(
Perhaps a better approach would be, once, collect all directories to be
processed and write them on a text file -- these are the pending
directories. Then, read from the pending file and process every directory
in it. If the process aborts for any reason, manually delete the lines
already processed and restart.
If you use a database instead of a text file, and mark entries as "done"
after processing, you can avoid that last manual step and the whole
process may be kept running automatically. In some cases you may want to
choose the starting point at random.
--
Gabriel Genellina
Wouldn't this only walk the directory the exception occured in and not
the remaining unwalked dirs from basedir?
Something like this should work:
import os
basedir = '.'
walked = open('walked.txt','r').read().split()
unwalked = ((r,d,f) for r,d,f in os.walk(basedir) if r not in walked)
for root, dirs, files in unwalked:
# do something
print root
walked.append(root)
open('walked.txt','w').write('\n'.join(walked))
I would write my own walker which sorts the directory entries it finds
before walking them and can skip the entries until it gets to the
desired starting path, eg if I want to start at "/foo/bar" then skip
over the entries in the root directory which start with a name before
"foo" and the entries in the subdirectory "/foo" which start with a name
before "bar".
I assume it's the operation that you are doing on each file that is
expensive, not the walk itself.
If that's the case, then you might be able to get away with just
leaving some kind of breadcrumbs whenever you've successfully
processed a directory or a file, so you can quickly short-circuit
entire directories or files on the next run, without having to
implement any kind of complicated start-where-I-left-off before
algorithm.
The breadcrumbs could be hidden files in the file system, or an easy-
indexable list of files that you persist, etc.
What are you doing that takes so long?
Also, I can understand why the operations on the files themselves
might crash, but can't you catch an exception and keep on chugging?
Another option, if you do not do some kind of pruning on the fly, is
to persist the list of files that you need to process up front to a
file, or a database, and persist the index of the last successfully
processed file, so that you can restart as needed from where you left
off.
> Steven D'Aprano <ste...@REMOVE.THIS.cybersource.com.au> wrote:
>> # Untested
>> last_visited = open("last_visited.txt", 'r').read()
>> for root, dirs, files in os.walk(last_visited or basedir):
>> open("last_visited.txt", 'w').write(root) run program
>
> Wouldn't this only walk the directory the exception occured in and not
> the remaining unwalked dirs from basedir?
Only if you have some sort of branching file hierarchy with multiple sub-
directories in the one directory, instead of a nice simple linear chain
of directories a/b/c/d/.../z as nature intended.
*wink*
Yes, good catch. I said it was untested. You might be able to save the
parent of the current directory, and restart from there instead.
--
Steven
Unless you're indexing a read-only device (whether hardware
read-only like a CD, or permission-wise read-only like a network
share or a non-priv user walking system directories)...
> Also, I can understand why the operations on the files themselves
> might crash, but can't you catch an exception and keep on chugging?
I also wondered this one, perhaps logging the directory in which
the exception happened to later revisit. :)
-tkc
Something like this (symbolically):
lastrun = map(string.strip, logfile.readlines())
newlog = ... open logfile in append mode ...
for root, dirs, files in os.walk(basedir):
if root not in lastrun:
run program
newlog.write(root)
newlog.flush()
--
Piet van Oostrum <pi...@vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]
Nu Fair Trade woonwaar op http://www.zylja.com