In writing some non-trivial amount of Python code I keep running into an organizational issue. I will try to state the problem fairly generally, and follow up with a (contrived) example.
The root cause of my difficulties is that by default, the relationship between a module hierarchy and the structure of files on disk is too strong for my taste. I want to separate the two as much as possible, but I do not want to resort to non-conventional "hacks" to do it. I am posting this in an attempt to present what I perceive to be a practical problem, and to get suggestions for solutions, or opinions on the most practical policy for how to deal with it.
Like I said, I would like a weaker relationship between file system structure and module hierarchy. In particular there are two things I would like:
* Least importantly, I don't like jamming code into __init__.py, as a personal preference. * Most importantly, I do not like to jam large amounts of code into a single source file, just for the purpose of keeping the public interface in the same package.
An contrived but hopefully illustrative example:
We have an organization "Org", which has a library, and as part of that library is code that relates to doing something with animals. As a result, the interesting top-level package for this example is:
org.lib.animal
Suppose now that I want an initial implementation of the most important animal. I want to create the class (but see [1]):
org.lib.animal.Monkey
The public interface consists of that class only (and possibly a small handful of functions). The implementation is quite significant however - it is 500 lines of code long.
At this point, we had to jam those 500 lines of code into __init__.py. Let's ignore my personal preference of not liking to put code in __init__.py; the fact remains that we have 500 lines of code in a single source file.
Now, we want to continue working on this library, adding ten additional animals.
At this point, we have these choices (it seems to me):
(1) Simply add these to __init__.py, resulting in __init__.py being 5000 lines long[2].
(2) Put each animal into its own file, resulting in org.lib.animal.Monkey now becoming org.lib.animal.monkey.Monkey, and animal X becoming org.lib.animal.x.X.
The problem I have is that both of these solutions are, in my opinion, very ugly:
* (1) is ugly from a source code management perspective, because jamming 5000 lines of code for ten different animals into a single file is bad for obvious reasons.
* (2) is ugly because we introduce org.lib.animal.x.X for animal X, which: (a) is redundant in terms of naming (b) redundant in function since we have a single package for each animal containing nothing but a single class of the same name
Clearly, (1) is bad due to file/source structure reasons, and (2) is bad for module organizational reasons. So we are back to my original wish - I want to separate the two, so that I can solve (1) indepeendently of (2).
Now, I realize that __init__.py can contain arbitrary code, and that one can override __import__. However, I do not want to resort to "hacks" just to solve this problem; I would prefer some established convention in the community, or at least something that is elegant.
Does are people's thoughts on this problem?
Let me just shoot down one possible suggestion right away, to show you what I am trying to accomplish:
I do *not* want to simply break out X into org.lib.animal.x, and have org.lib.animal import org.lib.animal.x.X as X. While this naively solves the problem of being able to refer to X as org.lib.animal.X, the solution is anything but consistent because the *identity* of X is still org.lib.animal.x.X. Examples of way this breaks things:
* X().__class__.__name__ gives unexpected results. * Automatically generated documentation will document using the "real" package name. * Moving the *actual* classes around by way of this aliasing would break things like pickled data structure as a result of the change of actual identity, unless one *always* pre-emptively maintains this shadow hierarchy (which is a problem in and of itself).
Thus, it's not clean. It breaks the module abstraction and as a result has unintended consequences. I am looking for some kind of clean solution. What do people do about this in practice?
[1] Optionally, we might introduce an "animals" package such that it would become org.lib.animal.animals.Monkey, if we thought we were going to have a lot of public API outside of the animals themselves. This does not affect this dicussion however, as the exact same thing would apply to org.lib.animal.animals as applies to org.lib.animal in the above example.
[2] Ignoring for now that it may not be realistic that every animal implementation would be that long; in many cases a lot of code would be in common. But feel free to substitude for something else (a Zoo say).
-- / Peter Schuller
PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schul...@infidyne.com>' Key retrieval: Send an E-Mail to getpgp...@scode.org E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org
On Wed, 23 Jan 2008 03:49:56 -0600, Peter Schuller wrote: > Let me just shoot down one possible suggestion right away, to show you > what I am trying to accomplish:
> I do *not* want to simply break out X into org.lib.animal.x, and have > org.lib.animal import org.lib.animal.x.X as X.
Then you shoot down the idiomatic answer I guess. That's what most people do.
Peter Schuller <peter.schul...@infidyne.com> writes: > Let me just shoot down one possible suggestion right away, to show > you what I am trying to accomplish:
> I do *not* want to simply break out X into org.lib.animal.x, and > have org.lib.animal import org.lib.animal.x.X as X.
Nevertheless, that seems the best (indeed, the Pythonic) solution to your problem as stated. Rather than just shooting it down, we'll have to know more about ehat actual problem you're trying to solve to understand why this solution doesn't fit.
> While this naively solves the problem of being able to refer to X as > org.lib.animal.X, the solution is anything but consistent because > the *identity* of X is still org.lib.animal.x.X.
The term "identity" in Python means something separate from this concept; you seem to mean "the name of X".
Who is expecting them otherwise, and why is that a problem?
> * Automatically generated documentation will document using the > "real" package name.
Here I lose all track of what problem you're trying to solve. You want the documentation to say exactly where the class "is" (by name), but you don't want the class to actually be defined at that location? I can't make sense of that, so probably I don't understand the requirement.
-- \ "If it ain't bust don't fix it is a very sound principle and | `\ remains so despite the fact that I have slavishly ignored it | _o__) all my life." —Douglas Adams | Ben Finney
>> I do *not* want to simply break out X into org.lib.animal.x, and >> have org.lib.animal import org.lib.animal.x.X as X.
> Nevertheless, that seems the best (indeed, the Pythonic) solution to > your problem as stated. Rather than just shooting it down, we'll have > to know more about ehat actual problem you're trying to solve to > understand why this solution doesn't fit.
That is exactly what my original post was trying very hard to explain. The problem is the discrepancy that I described between the organization desired in terms of file system structure, and the organization required in terms of module hierarchy. The reason it is a problem is that, by default, there is an (in my opinion) too strong connection between file system structure and module hierarchy in Python.
>> While this naively solves the problem of being able to refer to X as >> org.lib.animal.X, the solution is anything but consistent because >> the *identity* of X is still org.lib.animal.x.X.
> The term "identity" in Python means something separate from this > concept; you seem to mean "the name of X".
Not necessarily. In part it is the name, in that __name__ will be different. But to the extent that calling code can potentially import them under differents names, it's identity. Because importing the same module under two names results in two distinct modules (two distinct module objects) that have no realation with each other. So for example, if a module has a single global protected by a mutex, there are suddenly two copies of that. In short: identity matters.
> Who is expecting them otherwise, and why is that a problem?
Depends on situation. One example is that if your policy is that instances log using a logger named by the fully qualified name of the class, than someone importing and using x.y.z.Class will expect to be able to grep for x.y.z.Class in the output of the log file.
>> * Automatically generated documentation will document using the >> "real" package name.
> Here I lose all track of what problem you're trying to solve. You want > the documentation to say exactly where the class "is" (by name), but > you don't want the class to actually be defined at that location? I > can't make sense of that, so probably I don't understand the > requirement.
You are baffled that what I seem to want is that the definition of the class (file on disk) be different from the location inferred by the module name. Well, this is *exactly* what I want because, like I said, I do not want the strong connection beteween file system structure and module hierarchy. The fact that this connection exists, is what is causing my problems.
Please note that this is not any kind of crazy-brained idea; lots of languages have absolutely zero relationship between file location and modules/namespaces.
I realize that technically Python does not have this either. Like I said in the original post, I do realize that I can override __import__ with any arbitrary function, and/or do magic in __init__. But I also did not want to resort to hacks, and would prefer that there be some kind of well-established solution to the problem.
Although I was originally hesitant to use an actual example for fear of giving the sense that I was trying to start a language war, your answer above prompts me to do so anyway, to show in concrete terms what I mean, for those that wonder why/how it would work.
So for example, in Ruby, there is no problem having:
File monkey.rb:
module Org module Lib module Animal class Monkey ... .. end end end end
File tiger.rb:
module Org module Lib module Animal class Tiger ... .. end end end end
This is possible because the act of addressing code to be loaded into the interpreter is not connected to the namespace/module system, but rather to the file system.
Some languages avoid (but does not eliminate) the problem I am having without having this disconnect. For example, Java does have a strong connection between file system structure and class names. However the critical difference is that in Java, everything is modeled around classes, and class names map directly to the file system structure. So in Java, you would have the class
org.lib.animal.Monkey
in
<wherever>/org/lib/animal/Monkey.java
and
org.lib.animal.Tiger
in
<wherever>/org/lib/animal/Tiger.java
In other words, introducing a separate file does not introduce a new package. This works well as long as you are fine with having everything related to a class in the same file.
The problem is that with Python, everything is not a classes, and a file translates to a module, not a class. So you cannot have your source in different files without introducing as many packages as you introduce files.
-- / Peter Schuller
PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schul...@infidyne.com>' Key retrieval: Send an E-Mail to getpgp...@scode.org E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org
On Jan 23, 4:49 am, Peter Schuller <peter.schul...@infidyne.com> wrote:
> I do *not* want to simply break out X into org.lib.animal.x, and have > org.lib.animal import org.lib.animal.x.X as X. While this naively > solves the problem of being able to refer to X as org.lib.animal.X, > the solution is anything but consistent because the *identity* of X is > still org.lib.animal.x.X. Examples of way this breaks things:
> * X().__class__.__name__ gives unexpected results. > * Automatically generated documentation will document using the "real" > package name. > * Moving the *actual* classes around by way of this aliasing would > break things like pickled data structure as a result of the change > of actual identity, unless one *always* pre-emptively maintains > this shadow hierarchy (which is a problem in and of itself).
You can reassign the class's module:
from org.lib.animal.monkey import Monkey Monkey.__module__ = 'org.lib.animal'
(Which, I must admit, is not a bad idea in some cases.)
>>> I do *not* want to simply break out X into org.lib.animal.x, and >>> have org.lib.animal import org.lib.animal.x.X as X.
>>> While this naively solves the problem of being able to refer to X as >>> org.lib.animal.X, the solution is anything but consistent because >>> the *identity* of X is still org.lib.animal.x.X.
>> The term "identity" in Python means something separate from this >> concept; you seem to mean "the name of X".
> Not necessarily. In part it is the name, in that __name__ will be > different. But to the extent that calling code can potentially import > them under differents names, it's identity. Because importing the same > module under two names results in two distinct modules (two distinct > module objects) that have no realation with each other. So for > example, if a module has a single global protected by a mutex, there > are suddenly two copies of that. In short: identity matters.
That's not true. It doesn't matter if you Import a module several times at different places and with different names, it's always the same module object.
py> from xml.etree import ElementTree py> import xml.etree.ElementTree as ET2 py> import xml.etree py> ET3 = getattr(xml.etree, 'ElementTree') py> ElementTree is ET2 True py> ET2 is ET3 True
Ok, there is one exception: the main script is loaded as __main__, but if you import it using its own file name, you get a duplicate module. You could confuse Python adding a package root to sys.path and doing imports from inside that package and from the outside with different names, but... just don't do that!
> I realize that technically Python does not have this either. Like I > said in the original post, I do realize that I can override __import__ > with any arbitrary function, and/or do magic in __init__. But I also > did not want to resort to hacks, and would prefer that there be some > kind of well-established solution to the problem.
I don't really understand what your problem is exactly, but I think you don't require any __import__ magic or arcane hacks. Perhaps the __path__ package attribute may be useful to you. You can add arbitrary directories to this list, which are searched for submodules of the package. This way you can (partially) decouple the file structure from the logical package structure. But I don't think it's a good thing...
> In other words, introducing a separate file does not introduce a new > package. This works well as long as you are fine with having > everything related to a class in the same file.
> The problem is that with Python, everything is not a classes, and a > file translates to a module, not a class. So you cannot have your > source in different files without introducing as many packages as you > introduce files.
Isn't org.lib.animal a package, reflected as a directory on disk? That's the same both for Java and Python. Monkey.py and Tiger.py would be modules inside that directory, just like Monkey.java and Tiger.java. Aren't the same thing?
>> Not necessarily. In part it is the name, in that __name__ will be >> different. But to the extent that calling code can potentially import >> them under differents names, it's identity. Because importing the same >> module under two names results in two distinct modules (two distinct >> module objects) that have no realation with each other. So for >> example, if a module has a single global protected by a mutex, there >> are suddenly two copies of that. In short: identity matters.
> That's not true. It doesn't matter if you Import a module several times > at different places and with different names, it's always the same module > object.
Sorry, this is all my stupidity. I was being daft. When I said importing under different names, I meant exactly that. As in, applying hacks to import a module under a different name by doing it relative to a different root directory. This is however not what anyone is suggesting in this discussion. I got my wires crossed. I fully understand that "import x.y.z" or "import x.y.z as B", and so one do not affect the identity of the module.
> Ok, there is one exception: the main script is loaded as __main__, but if > you import it using its own file name, you get a duplicate module. > You could confuse Python adding a package root to sys.path and doing > imports from inside that package and from the outside with different > names, but... just don't do that!
Right :)
> I don't really understand what your problem is exactly, but I think you > don't require any __import__ magic or arcane hacks. Perhaps the __path__ > package attribute may be useful to you. You can add arbitrary directories > to this list, which are searched for submodules of the package. This way > you can (partially) decouple the file structure from the logical package > structure. But I don't think it's a good thing...
That sounds useful if I want to essentially put the contents of a directory somewhere else, without using a symlink. In this case my problem is more related to the "file == module" and "directory == module" semantics, since I want to break contents in a single module out into several files.
> Isn't org.lib.animal a package, reflected as a directory on disk? That's > the same both for Java and Python. Monkey.py and Tiger.py would be modules > inside that directory, just like Monkey.java and Tiger.java. Aren't the > same thing?
No, because in Java Monkey.java is a class. So we have class Monkey in package org.lib.animal. In Python we would have class Monkey in module org.lib.animal.monkey, which is redundant and does not reflect the intended hierarchy. I have to either live with this, or put Monkey in .../animal/__init__.py. Neither option is what I would want, ideally.
Java does still suffer from the same problem since it forces "class == file" (well, "public class == file"). However it is less of a problem since you tend to want to keep a single class in a single file, while I have a lot more incentive to split up a module into different files (because you may have a lot of code hiding behind the public interface of a module).
So essentially, Java and Python have the same problem, but certain aspects of Java happens to mitigate the effects of it. Languages like Ruby do not have the problem at all, because the relationship between files and modules is non-existent.
-- / Peter Schuller
PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schul...@infidyne.com>' Key retrieval: Send an E-Mail to getpgp...@scode.org E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org
En Thu, 24 Jan 2008 11:57:49 -0200, Peter Schuller <peter.schul...@infidyne.com> escribió:
> In this case my > problem is more related to the "file == module" and "directory == > module" semantics, since I want to break contents in a single module > out into several files.
You already can do that, just import the public interfase of those several files onto the desired container module. See below for an example.
>> Isn't org.lib.animal a package, reflected as a directory on disk? That's >> the same both for Java and Python. Monkey.py and Tiger.py would be >> modules >> inside that directory, just like Monkey.java and Tiger.java. Aren't the >> same thing?
> No, because in Java Monkey.java is a class. So we have class Monkey in > package org.lib.animal. In Python we would have class Monkey in module > org.lib.animal.monkey, which is redundant and does not reflect the > intended hierarchy. I have to either live with this, or put Monkey in > .../animal/__init__.py. Neither option is what I would want, ideally.
You can also put, in animal/__init__.py: from monkey import Monkey and now you can refer to it as org.lib.animal.Monkey, but keep the implementation of Monkey class and all related stuff into .../animal/monkey.py
"Gabriel Genellina" <gagsl-...@yahoo.com.ar> writes: > You can also put, in animal/__init__.py: > from monkey import Monkey > and now you can refer to it as org.lib.animal.Monkey, but keep the > implementation of Monkey class and all related stuff into > .../animal/monkey.py
This (as far as I can understand) is exactly the solution the original poster desired to "shoot down", for reasons I still don't understand.
-- \ "Reichel's Law: A body on vacation tends to remain on vacation | `\ unless acted upon by an outside force." -- Carol Reichel | _o__) | Ben Finney
On Jan 25, 6:45 pm, Ben Finney <bignose+hates-s...@benfinney.id.au> wrote:
> "Gabriel Genellina" <gagsl-...@yahoo.com.ar> writes: > > You can also put, in animal/__init__.py: > > from monkey import Monkey > > and now you can refer to it as org.lib.animal.Monkey, but keep the > > implementation of Monkey class and all related stuff into > > .../animal/monkey.py
> This (as far as I can understand) is exactly the solution the original > poster desired to "shoot down", for reasons I still don't understand.
Come on, the OP explained it quite clearly in his original post. Did you guys even read it?