a persistent object store for Python

Guido van Rossum

unread,

Nov 14, 1994, 10:44:28 AM11/14/94

to

> I have often thought that it would be useful to be able to
> save the state of Python objects in a database of some
> sort. I had a look at the module 'persist' but have to
> say that I was disappointed since it simply isn't that
> useful. I know that there are plans to support persistence
> at the language level but I couldn't really wait for that.
> Therefore, I have written a module called 'pos' that provides
> a persistent store for Python objects. While it isn't fully
> orthogonal (meaning that ANY type can be made persistent) it
> serves most of my needs.

Cool! I especially like the idea of compressing the database (and
your almost completely automatic interface for specifying
compression). You also show what needs to be done for making class
instances persistent (i.e. make the module name available to the
persistency code and have an __init__ method that can be called
without arguments).

If you want to read something the persistency discussion held at the
recent Python Workshop, have a look at this URL:

http://www.eeel.nist.gov/python/workshop11-94/persistency.html

I added a pointer to your page there.

Some notes:

- You define a module called "types.py". Unfortunately, starting with
release 1.1, Python has a standard module with the same name and
somewhat similar functionality, so I would suggest to rewrite your
code using that module or rename your version...

- I think there's a bug in your assumptions about recursive objects.
While the following, trivial, recursive list can be saved just fine:

l = []
l.append(l)
db = pos.create('spam')
db.setroot(l)
db.save()

The following causes problems:

l = []
l.append(l)
t = (l, l)
db = pos.create('spam')
db.setroot(t)
db.save()

It seems to run into an infinite loop because it thinks that immutable
objects cannot be recursive. I'm sure you can fix this.

- Another bug, of a more mundane type:

>>> db = pos.create('spam')
Traceback (innermost last):
File "<stdin>", line 1, in ?
File "./pos.py", line 538, in create
raise PosError, PosDbExists
NameError: PosDbExists

Note that apparently PosDbExists doesn't exist! And after this,
creating a different database raises an internal error:

>>> db = pos.create('spam2')
Traceback (innermost last):
File "<stdin>", line 1, in ?
File "./pos.py", line 532, in create
raise PosError, PosGeneric
PosError: PosGeneric

Anders Lindstrom

unread,

Nov 14, 1994, 12:45:14 AM11/14/94

to

I have often thought that it would be useful to be able to
save the state of Python objects in a database of some
sort. I had a look at the module 'persist' but have to
say that I was disappointed since it simply isn't that
useful. I know that there are plans to support persistence
at the language level but I couldn't really wait for that.
Therefore, I have written a module called 'pos' that provides
a persistent store for Python objects. While it isn't fully
orthogonal (meaning that ANY type can be made persistent) it
serves most of my needs.

If you have access to the WWW, the documentation is available
from the http://www.gh.cs.su.oz.au/Users/anders/python/pos.html.

If you want a copy of the module get the files pos.py and types.py
from ftp.gh.cs.su.oz.au:pub/anders/python and install them in a
place where Python can find them.

Here's an excerpt from the WWW description page:

The 'pos' module provides a way to make almost arbitrary Python
objects persistent. It implements the Pos class (for Persistent Object Store).
Each instance of a Pos (simply called a database from now on) has a 'root
object'. This object is settable using the 'setroot' method described below.
Other than setting the root object, there is no explicit command to make an
object persistent. Rather, one simply has to make the object 'reachable'
from the root object. Reachability means that there is a path of references
from the root of the database to the object. When the database is saved, all
objects that are reachable from the root will be automatically included in
the database. For example, one could run the following script:

import pos

db = pos.create('foobar')

l = []

db.setroot(l)

l.append([1, 2, 3])
l.append({'key1':1, 'key2':2})

db.save()

and then quit Python. After this, one could write the following script:

import pos

db = pos.open('foobar')

print db.getroot()

which would produce the output:

[[1, 2, 3], {'key1': 1, 'key2': 2}]

Notice that everything referenced from the root is automatically saved.

The store preserves object identity. This means that if two objects reference
a third object before the database is saved then they will reference the
same object when the database is opened at a later stage. In other words, the
usual Python semantics for references are preserved.

Anders.

--

Anders Lindstrom Phone: +61 2 692 4174
Basser Department of Computer Science Fax: +61 2 692 3838
Madsen Building F09
University of Sydney NSW 2006
Australia

WWW: http://www.gh.cs.su.oz.au/Users/anders/index.html

Guido van Rossum

unread,

Dec 1, 1994, 6:21:03 PM12/1/94

to

Hello. This is to announce versions 1.0 of the "flatten", "pos" and
"copy" modules that I intend to become part of a future Python
release. The "flatten" module implements a basic algorithm to convert
(nearly) arbitrary Python objects into byte streams and back. The
"pos" module uses "flatten" to implement a simple persistent object
store. The "copy" module is actually unrelated except that it uses
the hooks that classes can provide to support flattening to implement
a deep copy operation.

Consider this an alpha release at most -- I am interested in any kind
of feedback, from design to implementation to performance. The
interface, implementation and file format are all subject to change.

For now, the files are only retrievable using WWW clients. (Use "save
to disk", DON'T save the text from your Mosaic window.) The URL
describing the three modules (though it mostly goes into "flatten")
is:

http://www.eeel.nist.gov/python/workshop11-94/FlattenPython.html

I am grateful to Anders Lindstrom for his pos module -- mine doesn't
use his implementation but emulates (most of) his interface. (Anders
-- I hope you don't mind, if you do I will change the module name.)

For more background material, see this URL:

http://www.eeel.nist.gov/python/workshop11-94/persistency.html

Here's the documentation string for "flatten":

__doc__ = """\
Flattening Algorithm
--------------------

This module implements a basic but powerful algorithm for "flattening"
(a.k.a. serializing or marshalling) nearly arbitrary Python objects.
This is a more primitive notion than persistency -- although flatten
reads and writes file objects, it does not handle the issue of naming
persistent objects, nor the (even more complicated) area of concurrent
access to persistent objects. The flatten module can transform a complex
object into a byte stream and it can transform the byte stream into
an object with the same internal structure. The most obvious thing to
do with these byte streams is to write them onto a file, but it is also
conceivable to send them across a network or store them in a database.

Unlike the built-in marshal module, flatten handles the following correctly:

- recursive objects
- pointer sharing
- class instances

Flatten is Python-specific. This has the advantage that there are no
restrictions imposed by external standards such as CORBA (which probably
can't represent pointer sharing or recursive objects); however it means
that non-Python programs may not be able to reconstruct flattened Python
objects.

Flatten uses a printable ASCII representation. This is slightly more
voluminous than a binary representation. However, small integers actually
take *less* space when represented as minimal-size decimal strings than
when represented as 32-bit binary numbers, and strings are only longer
if they contain control characters or 8-bit characters. The big advantage
of using printable ASCII (and of some other characteristics of flatten's
representation) is that for debugging or recovery purposes it is possible
for a human to read the flattened file with a standard text editor. (I could
have gone a step further and used a notation like S-expressions, but the
parser would have been considerably more complicated and slower, and the
files would probably have become much larger.)

Flatten doesn't handle code objects, which marshal does.
I suppose flatten could, and maybe it should, but there's probably no
great need for it right now (as long as marshal continues to be used
for reading and writing bytecode objects), and at least this avoids
the possibility of smuggling Trojan horses into a program.

For the benefit of persistency modules written using flatten, it supports
the notion of a reference to an object outside the flattened data stream.
Such objects are referenced by a name, which is an arbitrary string of
printable ASCII characters. The resolution of such names is not defined
by the flatten module -- the persistent object module will have to implement
a method "persistent_load". To write references to persistent objects,
the persistent module must define a method "persistent_id" which returns
either None or the persistent ID of the object.

There are some restrictions on the flattening of class instances.

First of all, the class must be defined at the top level in a module.
Then, to make it easy for the flattening module to find the module name,
classes should have a class or instance variable __module__ whose value
is the module name containing the class name (this will be added
automatically in future generations of the interpreter).

Next, it must normally be possible to create class instances by calling
the class without arguments. If this is undesirable, the class can
define a method __getinitargs__ (XXX not a pretty name!), which should
return a *tuple* containing the arguments to be passed to the class
constructor.

Classes can influence how they are flattened -- if the class defines
the method __getstate__, it is called and the return state is flattened
as the contents for the instance, and if the class defines the
method __setstate__, it is called with the unflattened state. (Note
that these methods can also be used to implement copying class instances.)
If there is no __getstate__ method, the instance's __dict__
is flattened. If there is no __setstate__ method, the flattened object
must be a dictionary and its items are assigned to the new instance's
dictionary. (If a class defines both __getstate__ and __setstate__,
the state object needn't be a dictionary -- these methods can do what they
want.)

Note that when class instances are flattened, their class's code and data
is not flattened along with them. Only the instance data is flattened.
This is done on purpose, so you can fix bugs in a class or add methods and
still load objects that were created with an earlier version of the
class. If you plan to have long-lived objects that will see many versions
of a class, it may be worth to put a version number in the objects so
that suitable conversions can be made by the class's __setstate__ method.

The interface is as follows:

To flatten an object x onto a file f. open for writing:

F = flatten.Flattener(f)
F.dump(x)

To unflatten an object x from a file f, open for reading:

U = flatten.Unflattener(f)
x = U.load(x)

The Flattener class only calls the method f.write with a string argument
(XXX possibly the interface should pass f.write instead of f).
The Unflattener calls the methods f.read(with an integer argument)
and f.readline(without argument), both returning a string.
It is explicitly allowed to pass non-file objects here, as long as they
have the right methods.

The following types can be flattened:

- None
- integers, long integers, floating point numbers
- strings
- tuples, lists and dictionaries containing flattenable objects
- class instances whose __dict__ or __setstate__() is flattenable

Attempts to flatten unflattenable objects will raise an exception
after having written an unspecified number of bytes to the file argument.

It is possible to make multiple calls to Flattener.dump() or to
Unflattener.load(), as long as there is a one-to-one correspondence
betwee Flattener and Unflattener objects and between dump and load calls
for any pair of corresponding Flattener and Unflatteners. WARNING: this
is intended for flattening multiple objects without intervening modifications
to the objects or their parts. If you modify an object and then flatten
it again using the same Flattener instance, the object is not flattened
again -- a reference to it is flattened and the Unflattener will return
the old value, not the modified one. (XXX There are two problems here:
(a) detecting changes, and (b) marshalling a minimal set of changes.
I have no answers. Garbage Collection may also become a problem here.)
"""

Here's the documentation string for "pos":

__doc__ = """\
Persistent Object Store
-----------------------

This module implements a limited form of persistent objects,
using the algorithm implemented by the flatten module.

This particular interface was designed by Anders Lindstrom
<and...@cs.su.oz.au>, who used a different representation of the
database on disk (a Python program -- cute but if it grows too big the
parser will choke).

The requirement of the original interface that classes must be
subclasses of PosObject is dropped here -- instead, classes should
have a class or instance variable __module__ whose value is the
module name containing the class name (this will be added automatically
in future generations of the interpreter). For backward compatibility,
a PosObject class is provide which stores this attribute on the instance.
It must also be possible to create class instances by calling the class
without arguments.

The interface is the following:

db = pos.create(filename) -- create a new database
db = pos.open(filename) -- open an existing database
db.setroot(object) -- set the root object of the database
object = db.getroot() -- get the root object of the database
db.save() -- write the database to the file
db.compress() -- compress the database
db.uncompress() -- uncompress the database

Compression is transparent to the user: once a database has
acquired the 'compressed' attribute, it is automatically uncompressed
when it is opened, and and it will automatically be saved in compressed
form. Compression uses gzip by default but also supports compress.
"""

There's no documentation yet for "copy".

--Guido van Rossum, CWI, Amsterdam <mailto:Guido.va...@cwi.nl>
<http://www.cwi.nl/cwi/people/Guido.van.Rossum.html>

Aaron Watters

unread,

Dec 2, 1994, 11:42:28 AM12/2/94

to

In article <9412012321.AA26359@tesla> Guido van Rossum <gu...@eeel.nist.gov> writes:
>Hello. This is to announce versions 1.0 of the "flatten", "pos" and
>"copy" modules that I intend to become part of a future Python
>release. The "flatten" module implements a basic algorithm to convert

>(nearly) arbitrary Python objects into byte streams and back....

Guido, this is even cooler than usual.

>....Flatten uses a printable ASCII representation....

This is the Right Way to Go. I hope it can also be editted by hand
without screwing up retrieval. [Of course this might mess up
positional pointers, but could there be an option for nonpositional or
whoops-recoverable representations? Just guessing about the
implementation here...]

Imagine implementing all that junk in /etc using python -- but still
allowing ambitious sysadmins to mess with the data files by hand when
needed (as they often like to do).

It's actually nice that you're not forcing any concurrency
control decisions/overhead on us, since, in principle, we can
wrap up your module in classes that do it the way we like!

The only problem with this that I see (having not tried it) is
that now I'll be tempted to rip up all that code I have that
uses marshal. Earlier I tried to build large data structures
by simply using straight python declarations, but the interpreter choked
when the files got large -- I trust this won't be a problem
with the new stuff? -a.

Guido van Rossum

unread,

Dec 2, 1994, 5:43:08 PM12/2/94

to

> >....Flatten uses a printable ASCII representation....
>
> This is the Right Way to Go. I hope it can also be editted by hand
> without screwing up retrieval. [Of course this might mess up
> positional pointers, but could there be an option for nonpositional or
> whoops-recoverable representations? Just guessing about the
> implementation here...]

Have a look at it yourself. It's not very readable (someone suggested
S-expressions but they seem too verbose to parse efficiently in a
protoype written in Python and will also be a lot larger) but it's
editable once you know what the code letters mean. I actually
modified the design so that there are no counts in there -- to create
a tuple, the instruction stream reads MARK <OBJECT> <OBJECT>
... TUPLE, where the original design read <OBJECT> <OBJECT> ... TUPLE
<COUNT>.

> Imagine implementing all that junk in /etc using python -- but still
> allowing ambitious sysadmins to mess with the data files by hand when
> needed (as they often like to do).

I don't know what you mean there...

> It's actually nice that you're not forcing any concurrency
> control decisions/overhead on us, since, in principle, we can
> wrap up your module in classes that do it the way we like!

That's what came out of endless discussion rounds with e.g. the
Infoseek people, whose persistency implementation is probably more
robust right now (having been used for real applications) but is a
tangled combination of flattening, persistency and sharing...

> The only problem with this that I see (having not tried it) is
> that now I'll be tempted to rip up all that code I have that
> uses marshal. Earlier I tried to build large data structures
> by simply using straight python declarations, but the interpreter choked
> when the files got large -- I trust this won't be a problem
> with the new stuff? -a.

No, flatten scales well. The unflattener reads and interprets the
file a character or line at a time. All approaches that generate
Python code (e.g. Anders' original pos.py, and also the Infoseek code)
make the interpreter choke by the time things are getting interesting.

Aaron Watters

unread,

Dec 5, 1994, 9:09:55 AM12/5/94

to

In article <9412022243.AA27983@tesla> Guido van Rossum <gu...@eeel.nist.gov> writes:
>> Imagine implementing all that junk in /etc using python -- but still
>> allowing ambitious sysadmins to mess with the data files by hand when
>> needed (as they often like to do).
>
>I don't know what you mean there...

Unix, in particular, has somewhat more than 5 buzzillion configuration
files, each with its own groovy format, most of which do very simple
things. They could all be replaced by a common general format, but
since people like to mess with these files using vi, hand editting
better be feasible (and commenting? -- if you don't have it yet add it
please!). This feature, although simple, is of *great* practical
importance for python. Congrats again on your good insight.
-a.

Quentin Stafford-Fraser

unread,

Dec 9, 1994, 7:27:14 AM12/9/94

to

aa...@cis.njit.edu (Aaron Watters) wrote:

> ...hand editting

> better be feasible (and commenting? -- if you don't have it yet add it
> please!).

Mmmm. If you have comments in the text file, and you read it and then
write it again, I think you're bound to lose them, unless all python
objects get a comment field.

Now what happened to that online help discussion we had a year ago?

Quentin

Skip Montanaro

unread,

Dec 10, 1994, 9:58:28 AM12/10/94

to

While Guido explicitly didn't do anything to make the flattened
representation accessible to non-Python environments, I think it's worth
considering how/whether to save flattened Python data in a
machine-independent binary form, preferably one with some sort of
self-documenting capability.

While at GE I had occasion to use netCDF for storing and retrieving
scientific data (see http://unidata.ucar.edu/packages/netcdf/index.html).
Naturally, scientific data tends to have lots of numbers with magnitudes
that make it very inefficient to represent as ASCII, so some of the design
assumptions were different than what Guido was working with. It was
designed for scientific data and must support FORTRAN and C interfaces, so
it wasn't very good with structs, but it might be a useful model to take a
look at. (In fact, it may support structs by now, what with FORTRAN 90 and
all.) It uses Sun's XDR for data storage and has both ASCII and binary
versions. Each file is self-describing. You can dump a file header that
tells you exactly what's in the file.

--
Skip Montanaro sk...@automatrix.com (518)372-5791
Automatrix - World-Wide Computing Solutions http://www.automatrix.com/
Check out: OWP&P Architects: http://www.automatrix.com/owp/

Ken Manheimer

unread,

Dec 10, 1994, 12:13:45 PM12/10/94

to

On Fri, 9 Dec 1994, Quentin Stafford-Fraser wrote:

> [Stuff concerning persistence of comments in pickled-object files...]

>
> Now what happened to that online help discussion we had a year ago?
>
> Quentin

We at least have a proposal for docstring-style comments for objects,
which i presented and we discussed at the NIST python meeting. I've
submitted the proposal via a web server. See:

http://www.eeel.nist.gov/python/workshop11-94/docstr-prop.html

Does that begin to address the online-help discussion to which you
refer? (I don't recall it, specifically, and may not have been on the
list at the time.)

I can forward a plain-text copy of that proposal, on request. (If you want
to look at the collection of "software management" proposals, see
'sftwr-mgmt-report.html' in the same dir. I'm quite interested in
feedback on any of those items...)

Ken
ken.ma...@nist.gov, 301 975-3539