[Python-Dev] Planning on removing cache invalidation for file finders

2 views
Skip to first unread message

Brett Cannon

unread,
Mar 1, 2013, 9:31:02 PM3/1/13
to python-dev
As of right now, importlib keeps a cache of what is in a directory for its file finder instances. It uses mtime on the directory to try and detect when it has changed to know when to refresh the cache. But thanks to mtime granularities of up to a second, it is only a heuristic that isn't totally reliable, especially across filesystems on different OSs.

This is why importlib.invalidate_caches() came into being. If you look in our test suite you will see it peppered around where a module is created on the fly to make sure that mtime granularity isn't a problem. But it somewhat negates the point of the mtime heuristic when you have to make this function call regardless to avoid potential race conditions.

http://bugs.python.org/issue17330 originally suggested trying to add another heuristic to determine when to invalidate the cache. But even with the suggestion it's still iffy and in no way foolproof.

So the current idea is to just drop the invalidation heuristic and go full-blown reliance on calls to importlib.invalidate_caches() as necessary. This makes code more filesystem-agnostic and protects people from hard-to-detect errors when importlib only occasionally doesn't detect new modules (I know it drove me nuts for a while when the buildbots kept failing sporadically and only on certain OSs).

I would have just made the change but Antoine wanted it brought up here first to make sure that no one was heavily relying on the current setup. So if you have a good, legitimate reason to keep the reliance on mtime for cache invalidation please speak up. But since the common case will never care about any of this (how many people generate modules on the fly to being with?) and to be totally portable you need to call importlib.invalidate_caches() anyway, it's going to take a lot to convince me to keep it.

Nick Coghlan

unread,
Mar 2, 2013, 10:36:04 AM3/2/13
to Brett Cannon, python-dev
I think you should keep it. A long running service that periodically
scans the importers for plugins doesn't care if modules take a few
extra seconds to show up, it just wants to see them eventually.
Installers (or filesystem copy or move operations!) have no way to
inform arbitrary processes that new files have been added.

It's that case where the process that added the modules is separate
from the process scanning for them, and the communication is one way,
where the heuristic is important. Explicit invalidation only works
when they're the *same* process, or when they're closely coupled so
the adding process can tell the scanning process to invalidate the
caches (our test suite is mostly the former although there are a
couple of cases of the latter).

I have no problem with documenting invalidate_caches() as explicitly
required for correctness when writing new modules which are to be read
back by the same process, or when there is a feedback path between two
processes that may be confusing if the cache invalidation is delayed.
The implicit invalidation is only needed to pick up modules written by
*another* process.

In addition, it may be appropriate for importlib to offer a
"write_module" method that accepts (module name, target path,
contents). This would:

1. Allow in-process caches to be invalidated implicitly and
selectively when new modules are created
2. Allow importers to abstract write access in addition to read access
3. Allow the import system to complain at time of writing if the
desired module name and target path don't actually match given the
current import system state.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia
_______________________________________________
Python-Dev mailing list
Pytho...@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/dev-python%2Bgarchive-30976%40googlegroups.com

Nick Coghlan

unread,
Mar 2, 2013, 10:37:56 AM3/2/13
to Brett Cannon, python-dev
On Sun, Mar 3, 2013 at 1:36 AM, Nick Coghlan <ncog...@gmail.com> wrote:
> It's that case where the process that added the modules is separate
> from the process scanning for them, and the communication is one way,
> where the heuristic is important. Explicit invalidation only works
> when they're the *same* process, or when they're closely coupled so
> the adding process can tell the scanning process to invalidate the
> caches (our test suite is mostly the former although there are a
> couple of cases of the latter).

s/are/may be/ (I don't actually remember if there are or not off the
top of my head)

Brett Cannon

unread,
Mar 2, 2013, 11:16:28 AM3/2/13
to Nick Coghlan, python-dev
But if they are doing the scan they can also easily invalidate the caches before performing the scan.
 

It's that case where the process that added the modules is separate
from the process scanning for them, and the communication is one way,
where the heuristic is important. Explicit invalidation only works
when they're the *same* process, or when they're closely coupled so
the adding process can tell the scanning process to invalidate the
caches (our test suite is mostly the former although there are a
couple of cases of the latter).

That's only true if the scanning process has no idea that another process is adding modules. If there is an expectation then it doesn't matter who added the file as you just assume cache invalidation is necessary.
 

I have no problem with documenting invalidate_caches() as explicitly
required for correctness when writing new modules which are to be read
back by the same process, or when there is a feedback path between two
processes that may be confusing if the cache invalidation is delayed.

Already documented as such.
 
The implicit invalidation is only needed to pick up modules written by
*another* process.

In addition, it may be appropriate for importlib to offer a
"write_module" method that accepts (module name, target path,
contents). This would:

1. Allow in-process caches to be invalidated implicitly and
selectively when new modules are created

I don't think that's necessary. If people don't want to blindly clear all caches for a file they can write the file, search the keys in sys.path_importer_cache for the longest prefix for the newly created file, and then call the invalidate_cache() method on that explicit finder.

2. Allow importers to abstract write access in addition to read access

That's heading down the virtual filesystem path which I don't want to go down any farther than I have to. The API is big enough as it is and the more entangled it gets the harder it is to change/fix, especially with the finders having a nice, small API compared to the loaders.
 
3. Allow the import system to complain at time of writing if the
desired module name and target path don't actually match given the
current import system state.

I think that's more checking than necessary for a use case that isn't that common. 

Nick Coghlan

unread,
Mar 2, 2013, 12:24:01 PM3/2/13
to Brett Cannon, python-dev
On Sun, Mar 3, 2013 at 2:16 AM, Brett Cannon <br...@python.org> wrote:
> On Sat, Mar 2, 2013 at 10:36 AM, Nick Coghlan <ncog...@gmail.com> wrote:
>> I think you should keep it. A long running service that periodically
>> scans the importers for plugins doesn't care if modules take a few
>> extra seconds to show up, it just wants to see them eventually.
>> Installers (or filesystem copy or move operations!) have no way to
>> inform arbitrary processes that new files have been added.
>
>
> But if they are doing the scan they can also easily invalidate the caches
> before performing the scan.

"I just upgraded to Python 3.4, and now my server process isn't see new plugins"

That's a major backwards compatibility breach, and hence clearly
unacceptable in my view. Even the relatively *minor* compatibility
breach of becoming dependent on the filesystem timestamp resolution
for picking up added modules, creating a race condition between
writing the file and reading it back through the import system, has
caused people grief. When you're in a hole, the first thing to do is
to *stop digging*.

You can deprecate the heuristic if you want (and can figure out how),
but a definite -1 on removing it without at least the usual
deprecation period for backwards incompatible changes.

It may also be worth tweaking the wording of the upgrade note in the
What's New to mention the need to always invalidate the cache before
scanning for new modules if you want to reliably pick up new modules
created since the application started (at the moment the note really
only mentions it as something to do after *creating* a new module).

Brett Cannon

unread,
Mar 2, 2013, 6:38:15 PM3/2/13
to Nick Coghlan, python-dev
On Sat, Mar 2, 2013 at 12:24 PM, Nick Coghlan <ncog...@gmail.com> wrote:
On Sun, Mar 3, 2013 at 2:16 AM, Brett Cannon <br...@python.org> wrote:
> On Sat, Mar 2, 2013 at 10:36 AM, Nick Coghlan <ncog...@gmail.com> wrote:
>> I think you should keep it. A long running service that periodically
>> scans the importers for plugins doesn't care if modules take a few
>> extra seconds to show up, it just wants to see them eventually.
>> Installers (or filesystem copy or move operations!) have no way to
>> inform arbitrary processes that new files have been added.
>
>
> But if they are doing the scan they can also easily invalidate the caches
> before performing the scan.

"I just upgraded to Python 3.4, and now my server process isn't see new plugins"

That's a major backwards compatibility breach, and hence clearly
unacceptable in my view. Even the relatively *minor* compatibility
breach of becoming dependent on the filesystem timestamp resolution
for picking up added modules, creating a race condition between
writing the file and reading it back through the import system, has
caused people grief. When you're in a hole, the first thing to do is
to *stop digging*.

You can deprecate the heuristic if you want (and can figure out how),
but a definite -1 on removing it without at least the usual
deprecation period for backwards incompatible changes.

That part is easy: ImportWarning still exists so simply continuing to check the directory and noticing when a difference exists that affects subsequent imports and then raising the warning will handle that.
 

It may also be worth tweaking the wording of the upgrade note in the
What's New to mention the need to always invalidate the cache before
scanning for new modules if you want to reliably pick up new modules
created since the application started (at the moment the note really
only mentions it as something to do after *creating* a new module).


As of right now with the check that's all that is needed, but yes, if the deprecation does occur it would be worth changing it.

Erik Bray

unread,
Mar 2, 2013, 8:09:49 PM3/2/13
to Nick Coghlan, python-dev
On Sat, Mar 2, 2013 at 10:36 AM, Nick Coghlan <ncog...@gmail.com> wrote:
> In addition, it may be appropriate for importlib to offer a
> "write_module" method that accepts (module name, target path,
> contents). This would:
>
> 1. Allow in-process caches to be invalidated implicitly and
> selectively when new modules are created
> 2. Allow importers to abstract write access in addition to read access
> 3. Allow the import system to complain at time of writing if the
> desired module name and target path don't actually match given the
> current import system state.

+1 to write_module(). This would be useful in general, I think.
Though perhaps the best solution to the original problem is to more
forcefully document: "If you're writing a module and expect to be able
to import it immediately within the same process, it's necessary to
manually invalidate the directory cache."

I might go a little further and suggest adding a function to only
invalidate the cache for the relevant directory (the proposed
write_module() function could do this). This can already be done with
something like:

dirname = os.path.dirname(module_filename)
sys.path_importer_cache[dirname].invalidate_caches()

But that's a bit onerous considering that this wasn't even necessary
before 3.3. There should be an easier way to do this, as there's no
sense in invalidating all the directory caches if one is only writing
new modules to a specific directory or directories.

Erik

Antoine Pitrou

unread,
Mar 2, 2013, 8:16:30 PM3/2/13
to pytho...@python.org
On Sat, 2 Mar 2013 11:16:28 -0500
Brett Cannon <br...@python.org> wrote:
> > In addition, it may be appropriate for importlib to offer a
> > "write_module" method that accepts (module name, target path,
> > contents). This would:
> >
> > 1. Allow in-process caches to be invalidated implicitly and
> > selectively when new modules are created
>
> I don't think that's necessary. If people don't want to blindly clear all
> caches for a file they can write the file, search the keys in
> sys.path_importer_cache for the longest prefix for the newly created file,
> and then call the invalidate_cache() method on that explicit finder.

That's too complicated for non-import experts IMHO.

Regards

Antoine.

Antoine Pitrou

unread,
Mar 3, 2013, 6:29:48 AM3/3/13
to pytho...@python.org
On Sat, 2 Mar 2013 18:38:15 -0500
Brett Cannon <br...@python.org> wrote:
> >
> > You can deprecate the heuristic if you want (and can figure out how),
> > but a definite -1 on removing it without at least the usual
> > deprecation period for backwards incompatible changes.
> >
>
> That part is easy: ImportWarning still exists so simply continuing to check
> the directory and noticing when a difference exists that affects subsequent
> imports and then raising the warning will handle that.

Won't that raise spurious ImportWarnings for people who don't actually
care about that?

Regards

Antoine.

Brett Cannon

unread,
Mar 3, 2013, 12:08:00 PM3/3/13
to Antoine Pitrou, python-dev
On Sat, Mar 2, 2013 at 8:16 PM, Antoine Pitrou <soli...@pitrou.net> wrote:
On Sat, 2 Mar 2013 11:16:28 -0500
Brett Cannon <br...@python.org> wrote:
> > In addition, it may be appropriate for importlib to offer a
> > "write_module" method that accepts (module name, target path,
> > contents). This would:
> >
> > 1. Allow in-process caches to be invalidated implicitly and
> > selectively when new modules are created
>
> I don't think that's necessary. If people don't want to blindly clear all
> caches for a file they can write the file, search the keys in
> sys.path_importer_cache for the longest prefix for the newly created file,
> and then call the invalidate_cache() method on that explicit finder.

That's too complicated for non-import experts IMHO.

Which is why they can just call importlib.import_module(). 

Brett Cannon

unread,
Mar 3, 2013, 12:12:27 PM3/3/13
to Antoine Pitrou, python-dev
On Sun, Mar 3, 2013 at 6:29 AM, Antoine Pitrou <soli...@pitrou.net> wrote:
On Sat, 2 Mar 2013 18:38:15 -0500
Brett Cannon <br...@python.org> wrote:
> >
> > You can deprecate the heuristic if you want (and can figure out how),
> > but a definite -1 on removing it without at least the usual
> > deprecation period for backwards incompatible changes.
> >
>
> That part is easy: ImportWarning still exists so simply continuing to check
> the directory and noticing when a difference exists that affects subsequent
> imports and then raising the warning will handle that.

Won't that raise spurious ImportWarnings for people who don't actually
care about that?

It shouldn't. If the implementation I have in my head works (set of original files, another set of what mtime says is there to know what would not have been found w/o the cache invalidation), then it will only come up when someone would break in Python 3.5 if the cache invalidation is removed. Plus warnings are off by default as it is.
 

Regards

Antoine.


_______________________________________________
Python-Dev mailing list
Pytho...@python.org
http://mail.python.org/mailman/listinfo/python-dev

Brett Cannon

unread,
Mar 3, 2013, 12:31:56 PM3/3/13
to Antoine Pitrou, python-dev
That obviously should have said importlib.invalidate_caches(). =)

But how about this as a compromise over introducing write_module(): invalidate_caches() can take a path for something to specifically invalidate. The path can then be passed to the invalidate_caches() on sys.meta_path. In the case of PathFinder it would take that path, try to find the directory in sys.path_importer_cache, and then invalidate the most specific finder for that path (if there is one that has any directory prefix match).

Lots of little details to specify (e.g. absolute path forced anywhere in case a relative path is passed in by sys.path is all absolute paths? How do we know something is a file if it has not been written yet?), but this would prevent importlib from subsuming file writing specifically for source files and minimize performance overhead of invalidating all caches for a single file.

PJ Eby

unread,
Mar 4, 2013, 3:52:14 PM3/4/13
to Brett Cannon, Antoine Pitrou, python-dev
On Sun, Mar 3, 2013 at 12:31 PM, Brett Cannon <br...@python.org> wrote:
> But how about this as a compromise over introducing write_module():
> invalidate_caches() can take a path for something to specifically
> invalidate. The path can then be passed to the invalidate_caches() on
> sys.meta_path. In the case of PathFinder it would take that path, try to
> find the directory in sys.path_importer_cache, and then invalidate the most
> specific finder for that path (if there is one that has any directory prefix
> match).
>
> Lots of little details to specify (e.g. absolute path forced anywhere in
> case a relative path is passed in by sys.path is all absolute paths? How do
> we know something is a file if it has not been written yet?), but this would
> prevent importlib from subsuming file writing specifically for source files
> and minimize performance overhead of invalidating all caches for a single
> file.

ISTR that when we were first discussing caching, I'd proposed a
TTL-based workaround for the timestamp granularity problem, and it was
mooted because somebody already proposed and implemented a similar
idea. But my approach -- or at least the one I have in mind now --
would provide an "eventual consistency" guarantee, while still
allowing fast startup times.

However I think the experience with this heuristic so far shows that
the real problem isn't that the heuristic doesn't work for the normal
case; it works fine for that. Instead, what happens is that *it
doesn't work when you generate modules*.

And *that* problem can be fixed without even invalidating the caches:
it can be fixed by doing some extra work when writing a module - e.g.
by making sure the directory mtime changes again after the module is
written.

For example, create the module under a temporary name, verify that the
directory mtime is different than it was before, then keep renaming it
to different temporary names until the mtime changes again, then
rename it to the final name. (This would be very fast on some
platforms, much slower on others, but the OS itself would tell you
when it had worked.) A utility routine to "write_module()" or
"write_package()" would be easier to find than advice that says to
invalidate the cache under thus-and-such conditions, as it would show
up in searches for writing modules dynamically or creating modules
dynamically, where you could only search for info about the cache if
you knew the cache existed.
Reply all
Reply to author
Forward
0 new messages