Selecting from a self-referential mapper: recursive joins?

Eric Ongerth

unread,

Nov 25, 2008, 2:51:37 AM11/25/08

to sqlalchemy

Below, I have attached a working testcase. It works, yes -- but my
question is that I need to make an improved version of a particular
method on one of my classes. The following model will probably
explain itself for the most part. I'll let you read it first, then
offer a few explanatory notes afterward just in case. Finally, at the
end, I will describe the difference between what the method in
question does now, and what I would like it to do.

The nature of the response I am seeking is: a description of what I
need to do to build a better version of the method I'm speaking of,
including any further insight on the practice of joining at multiple
levels of a recursive / self-referential (but loop-free) graph.

---snip---

from sqlalchemy import *
from sqlalchemy.sql import *
from sqlalchemy.orm import *

engine = create_engine('sqlite://')

metadata = MetaData(bind=engine)

itemtypes = Table('itemtypes', metadata,
Column('name', Text, primary_key=True))

itemtype_inheritance = Table('itemtype_inheritance', metadata,
Column('itemtype_name', Text, ForeignKey('itemtypes.name'),
primary_key=True),
Column('parent_name', Text, ForeignKey('itemtypes.name'),
primary_key=True))

features = Table('features', metadata,
Column('id', Integer, primary_key=True),
Column('name', Text),
Column('root_itemtype_name', Text, ForeignKey('itemtypes.name')))

feature_dependencies = Table('feature_dependencies', metadata,
Column('dependent_id', Integer, ForeignKey('features.id'),
primary_key=True),
Column('determinant_id', Integer, ForeignKey('features.id'),
primary_key=True))

metadata.drop_all()
metadata.create_all()

itemtypes.insert().execute([
{'name': 'Product'},
{'name': 'Footwear'},
{'name': 'Boot'},
{'name': 'Ski'}
])

itemtype_inheritance.insert().execute([
{'itemtype_name': 'Footwear', 'parent_name': 'Product'},
{'itemtype_name': 'Boot', 'parent_name': 'Footwear'},
{'itemtype_name': 'Ski', 'parent_name': 'Product'}
])

features.insert().execute([
{'id': 1, 'name': 'Manufacturer',
'root_itemtype_name':'Product' },
{'id': 2, 'name': 'Model', 'root_itemtype_name':'Product' },
{'id': 3, 'name': 'Year', 'root_itemtype_name':'Product' },
{'id': 4, 'name': 'Gender', 'root_itemtype_name':'Footwear' },
{'id': 5, 'name': 'US Shoe Size',
'root_itemtype_name':'Footwear' },
{'id': 6, 'name': 'Length', 'root_itemtype_name':'Ski' },
{'id': 7, 'name': 'Weight', 'root_itemtype_name':'Product' }

])

feature_dependencies.insert().execute([
{'dependent_id': 7, 'determinant_id': 1},
{'dependent_id': 7, 'determinant_id': 2},
{'dependent_id': 7, 'determinant_id': 3},
{'dependent_id': 7, 'determinant_id': 4},
{'dependent_id': 7, 'determinant_id': 5},
{'dependent_id': 7, 'determinant_id': 6}
])

class Itemtype(object):

def __repr__(self):
return 'Itemtype: %s' % (self.name)

@property
def inherited_features(self):
return reduce(list.extend,
[base_itemtype.features for base_itemtype in
self.inherits],
[])

@property
def features(self):
return self.own_features.extend(self.inherited_features)

@property
def dependent_features(self):
return [f for f in self.features if f.determinants]

@property
def independent_features(self):
return [f for f in self.features if not f.determinants]

class Feature(object):

def __repr__(self):
return '%s %s' % (self.root_itemtype_name, self.name)

def determinants_in_scope_of(self, itemtype):
return (session.query(Feature)
.join(FeatureDependency.determinant)
.join(Feature.root_itemtype)
.filter(and_(FeatureDependency.dependent_id==self.id,
Itemtype.name==itemtype.name))).all()

class FeatureDependency(object):

def __repr__(self):
return "F_D: %s depends on %s" % (self.dependent.name,
self.determinant.name)

mapper(Itemtype, itemtypes, properties={
'inherits':relation(Itemtype,
secondary=itemtype_inheritance,
primaryjoin=
(itemtypes.c.name==itemtype_inheritance.c.itemtype_name),
secondaryjoin=
(itemtype_inheritance.c.parent_name==itemtypes.c.name),
backref='progeny'),
'own_features':relation(Feature,
primaryjoin=(features.c.root_itemtype_name==itemtypes.c.name),
backref=backref('root_itemtype', uselist=False))
})

mapper(Feature, features, properties={
'dependents':relation(Feature,
secondary=feature_dependencies,
primaryjoin=
(feature_dependencies.c.determinant_id==features.c.id),
secondaryjoin=
(feature_dependencies.c.dependent_id==features.c.id),
backref=backref('determinants'))
})

mapper(FeatureDependency, feature_dependencies, properties={
'dependent':relation(Feature,
uselist=False,
primaryjoin=
(feature_dependencies.c.dependent_id==features.c.id),
backref='feature_dependencies_as_dependent'),
'determinant':relation(Feature,
uselist=False,
primaryjoin=
(feature_dependencies.c.determinant_id==features.c.id),
backref='feature_dependencies_as_determinant')
})

Session = sessionmaker(bind=engine)
session = Session()

Product = session.query(Itemtype).filter_by(name='Product').one()
Ski = session.query(Itemtype).filter_by(name='Ski').one()
Footwear = session.query(Itemtype).filter_by(name='Footwear').one()
Boot = session.query(Itemtype).filter_by(name='Boot').one()
Weight = session.query(Feature).filter_by(name='Weight').one()
Gender = session.query(Feature).filter_by(name='Gender').one()
Year = session.query(Feature).filter_by(name='Year').one()

---snip---

(to anyone who wants to play with this and doesn't know how, save the
above code as testcase.py, and then run it from shell with python -i
testcase.py; you then have a Python interpreter with all of the above
objects still active, including the sqlalchemy session, with the
ability to query and experiment).

Notes on what this is supposed to model: Here are some of the facts in
the problem domain.

Each Itemtype inherits from others above it, except for root itemtypes
(the only one of which in this example is Product). The reason for
using an inheritance paradigm here is so that you can set Features on
any Itemtype and have them inherited by all of the subcategories (sub-
Itemtypes) of that Itemtype. In other words, once you give the
itemtype called 'Product' a feature called 'Manufacturer' and a
feature called 'Model', now *everything* that inherits from Product
(in fact the whole rest of the graph, here) has those Features.

What you can't see here is that there is a similar structure holding
all the possible values of these features, and mapping combinations
thereof onto product codes and actual physical items. This is
essentially a knowledge base for possible point-of-sale and inventory
management systems. I'm leaving the "values" part out for the sake of
this simple example.

The reason for feature_dependencies: D.R.Y. meaning: in the real
world if the value of one variable depends on a couple of others, you
don't need to re-record the dependent variable unless its determinants
(i.e. the variables upon which it depends) themselves change. The
testcase structure above encodes dependency links between Features in
order to model the fact that, for example, the Weight of a Product is
a function of (is determined by) that Product's Model, Year, Size, and
so forth. Better yet, these dependency relationships can easily be
seen as "cascading" or inheriting along the same paths as Itemtype
inheritance, so that feature dependencies need not be separately
entered/recorded at every node of the graph. It is sufficient to
declare a feature dependency at its salient point -- meaning, at the
highest node at which the dependent feature and its determinants first
make contact -- and then simply allow the functional relationship
(feature dependency) to propagate usefully downward through the graph
to more and more specific Itemtypes. This cascading process can pick
up additional feature dependencies as needed along the way, until
ultimately it reaches "terminal" itemtypes (i.e. leaves of the tree)
which are considerd fully described. The production version will
include multiple inheritance because the real problem domain most
definitely does NOT resolve to a simple hierarchical tree; it's a more
general graph, though at least it has no loops (whew).

In this example case I have left all but one of the Features as
independent variables and set only one Feature, 'Weight', as a
dependent of several of the rest. In actual production my terminal
itemtypes will have hundreds of Features storing a rich spread of data
about the Items that will instantiate those Itemtypes, with the bulk
of those features inherited from Itemtypes higher in the hierarchy --
the 'abstract base classes' in this system, if you will.

Anyway, just having Weight, alone, depend on several of the other
Features is enough to keep this test case simple and show what it
needs to show.

So, please observe the following results in a shell, which are as
expected/desired, and comments thereunder.

>>>Weight.determinants
[Product Manufacturer, Product Model, Product Year, Footwear Gender,
Footwear US Shoe Size, Ski Length]

comment: that's great, there we see all of the variables that
determine the Weight of something. But it's a little confusing in
that form, because the Gender of and item of Footwear has nothing to
do with the Length of a Ski. So we have a method on Feature to tune
in closer.

>>>Weight.determinants_in_scope_of(Product)
[Product Manufacturer, Product Model, Product Year]

comment: now we're getting somewhere. Those, as requested, are the
determining factors for the Weight of an item, but only the ones that
were inherited from the abstract Product itemtype.

>>>Weight.determinants_in_scope_of(Footwear)
[Footwear Gender, Footwear US Shoe Size]

comment: that's great too. We see that at the Footwear level, every
item that belongs to the Footwear branch of the tree (including Boot,
and its future descendants Ski Boot and Downhill Ski Boot, and so
forth) will have their instances' weights be a function of gender and
size -- without having to repeat myself by recording that fact over
and over again in the instances or even in those specific terminal
itemtypes.

However, here's where I'm coming up short, and the reason for this
whole post.
>>>Weight.determinants_in_scope_of(Boot)
[]

comment: That is as expected. But it's not what I'm really looking
for. What I would really rather have returned by this query is the
same determinant features reported for Footwear, because Footwear is
the parent of Boot and Boot should inherit them. Furthermore, the
response to Weight.determinants_in_scope_of(Boot) should also include
the responses from Weight.determinants_in_scope_of(Product) because
Footwear inherits from Product, therefore so does Boot.

>>>Weight.determinants_in_scope_of(Ski)
[Ski Length]

comment: This is as expected. But what I would really like to get
back is a result that reflects the inheritance of feature dependencies
from Product, the parent itemtype of Ski. In other words, I would
rather have the answer to this query be [Ski Length, Product
Manufacturer, Product Model, Product Year]. I would NOT want
'Footwear Gender' to show up because that is on a different branch of
the tree; Ski does not have itemtype inheritance from Footwear, only
from Product. Nor, conversely, should Ski Length influence the Weight
of a Boot.

So that's my question, then. How do I build a method similar to
determinants_in_scope_of() that follows on up the chain of itemtype
inheritance and returns the features on which a given Feature instance
is dependent (a.k.a. its 'determinants' in the language of this
particular model), all the way up to reach any root itemtypes no
matter how many levels of inheritance are involved?

I have been experimenting with the joins and filters in this method so
far... Where I keep failing is when I try to figure out how to have
the query climb up through an arbitrary number of levels in the
Itemtype inheritance structure.

Thank you for your patience with this long post; If you've read this
far, I hope it has been interesting. Hopefully Mike or others can
suggest a better set of joins or other techniques to tweak this
structure toward the behavior I'm aiming for.

Eric

Eric Ongerth

unread,

Nov 25, 2008, 3:42:31 AM11/25/08

to sqlalchemy

Unfortunately, I posted the wrong version of my Itemtype class above;
fortunately it wasn't important for what I was trying to show. Please
replace class Itemtype with the following, and note the additional
test lines and commentary which I also forgot to include.

class Itemtype(object):

def __repr__(self):
return 'Itemtype: %s' % (self.name)

@property
def inherited_features(self):

return ([feature for base_itemtype in self.inherits for
feature in base_itemtype.features])

@property
def features(self):
result = self.own_features[:]
if self.inherits:
result.extend(self.inherited_features[:])
return result

@property
def dependent_features(self):
return [f for f in self.features if f.determinants]

@property
def independent_features(self):
return [f for f in self.features if not f.determinants]

The code i posted the first time was not the right code for the
inherited_features and features properties on Itemtype. I must have
cut and pasted from the wrong test file.

Please observe the following interactions which may help paint a
clearer picture of what I'm doing with this stuff.

>>>Boot.features
[Footwear Gender, Footwear US Shoe Size, Product Manufacturer, Product
Model, Product Year, Product Weight]

>>>Boot.dependent_features
[Product Weight]

>>>Boot.independent_features
[Footwear Gender, Footwear US Shoe Size, Product Manufacturer, Product
Model, Product Year]

>>>Footwear.inherited_features
[Product Manufacturer, Product Model, Product Year, Product Weight]

>>>Footwear.own_features

[Footwear Gender, Footwear US Shoe Size]

These are all examples of some of the useful things I'm using this for
-- keeping track of which variables ('features'), in an inheriting/
cascading type system, are dependent and which ones are independent.
Later, when values are entered and stored for each feature, the
relationships between the values will be constrained by these same
feature dependency links, without having to repeat them at that time.

Sorry for the wrong code before. Oh yeah, I forgot to demonstrate
this too, but it's trivial:

>>>Model = session.query(Feature).filter_by(name='Model').one()
>>>Model.dependents
[Product weight]
>>>Gender.dependents
[Product weight]

... so all this works the other way around too, though requesting the
dependents of a given feature is not as interesting as requesting a
given feature's determinants.

Eric

Eric Ongerth

unread,

Nov 25, 2008, 4:57:40 AM11/25/08

to sqlalchemy

Well! I guess that's exactly why we post sometimes -- the process of
producing the test case bumps the unconscious forward a few steps. I
quit and did some pleasure reading for a while then came back. Here's
my own answer that does exactly what I needed it to do.

Add the following property on Itemtype:

@property
def full_heritage(self):

> ...
>
> read more »

Eric Ongerth

unread,

Nov 25, 2008, 5:04:28 AM11/25/08

to sqlalchemy

Arghh. Accidentally hitting 'Tab' in google groups takes you to the
'Send' button, then your next spacebar press prematurely sends your
post.

Ok, add the following property on Itemtype:

@property
def full_heritage(self):
result = self.inherits[:]
if result:
for inherited in self.inherits:
result.extend(inherited.full_heritage)
return result

... this is just recursively building a list of all nodes involved in
inheritance from a given Itemtype node upward through the graph.

Then change the method on Feature:

def determinants_in_scope_of(self, itemtype):
targets = map(lambda x: x.name, itemtype.full_heritage)
targets.append(itemtype.name)

return (session.query(Feature)
.join(FeatureDependency.determinant)
.join(Feature.root_itemtype)
.filter(and_(FeatureDependency.dependent_id==self.id,

Itemtype.name.in_(targets)))).all()

Now this machinery does exactly what I want. I look forward to
showing you what it's really used for eventually. Ciao!

Eric

> ...
>
> read more »

a...@svilendobrev.com

unread,

Nov 25, 2008, 5:42:55 AM11/25/08

to sqlal...@googlegroups.com

Eric
i'm not sure i follow all your mapping setup as it's too detail. but:

i've been battling along similar data/feature inheritance+shading
stuff along a branchy, recursive directed graph (not a pure tree as
it has alternative short-cut paths in it), all over bitemporal
objects and values (i.e. m2m relations all over, + grouping by
time/objid), for almost half an year. see these posts of mine:
- Many to many self referential relationship /15.03
- multiple-aspects revisited /23.06
- tree/graph persistency, concurency-safe? 13.07
- unions in query?
- and probably most others
- as well as this thread in pgsq...@postgresql.org:
"optimizing a query over tree-like structure", 2008-09-30
my setup (law,company, department(s)/recursive, position,
workplace, employment) is all explained there, less the
bitemporalism.

also see the thread "Is there a simple way to let records have the
same groups as it parents", or just look up "data inheritance" in the
group.

and so far i've reached these decisions, based on all the experiments
(i don't come from sql background, and OO/practicalism doesnt give
much insights on what sql can handle):
- the root-most branches are separate queries, and a pseudo
multi-query mimes a plain one over all those (it can also be a union
of all ot them, or one single query - but single one has 20+tables in
the From, and union fails here-there). it also came that different
root-most branches have slightly diff. meaning hence it's good if
they are loaded separately.
- recursion is handled by expanding it on say 3 levels deep, hoping
that noone will go further (i.e. a.x or a.a and (a.a.x or a.a.a and
(a.a.a.x or ...))) etc.
- everything is generated by a node-type-walking on class level, and
the strategy of alternativ'ing on each level can be different (i.e.
it can start as multiquery, follow as union on branch A and as single
query on branch B). i can give this code if anyone dares read it..
- the query returns all values whereever reachable/associated with
some end-node
- actual inheritance/shading etc is done after that in python. it can
(probably) be done at sql-level by a very specific order-by, but it's
nightmarish bitemporal query already so no need to go hell any
further

the times i got are of this kind: 10 nodes, with 10 values each, x100
changes each, for about 20sec, on a relatively slow machine /
postgres.

maybe we can work together to get something out of it.

Eric Ongerth

unread,

Nov 25, 2008, 12:58:49 PM11/25/08

to sqlalchemy

What I don't like about my own solution, after all (see my 3rd post on
this thread, after the accidental 2nd one prematurely submitted), is
that it recursively traverses the Itemtype graph to make a list of
itemtypes to constrain the scope of a request for the list of features
upon which a given feature is dependent. That traversal makes it
pretty slow, and moreover it represents "other" activity outside of
the query itself.

Intuitively I feel that there ought to be some query syntax that
expresses the same request that I am trying to make entirely in sql,
without this pre-traversal step to build an "in_() list" of Itemtype
nodes. On the other hand, I have seen Mike write that if sqlalchemy
were expected to traverse arbitrary m:m structures to arbitrary depth
in creating joins to satisfy a query, things could get out of hand too
quickly. The standard problem of having too many possibilities in the
space of characterizing the desired solutions, and those possibilities
increasing faster than the means of describing them, leaving too much
ambiguity to be worth attempting to support a general solution.

So maybe my current solution with this recursively built "in_()
list" (of itemtypes that are the root_itemtype of features, following
the itemtype inheritance graph 'upward' from a given starting feature)
is about as good as it gets. I'm just reaching out intuitively to see
if anyone says something that sparks a leap forward in my
understanding of the shape of this type of problem.

Will now respond to Svilen's post.

Eric

> ...
>
> read more »

Eric Ongerth

unread,

Nov 25, 2008, 2:43:13 PM11/25/08

to sqlalchemy

Svil,

Thanks for your reply. I have been following your posts with interest
over the past half year (or I thought even longer). At first I
thought you were crazy. But now I've found myself creating a model of
similar complexity, as necessary to express the domain I'm working on.

The purpose of my model is to "ingest" all of the easily expressible
facts about the characteristics ('features') of categories and
(recursive) subcategories of items within specific sub-regions of a
domain of physical items, and to encode and store those facts in a
minimally redundant and maximally searchable / indexable form. This
supports an "instant search" / "search while you type" widget that is
far aware of the conceptual structure within the domain being
searched, unlike a mere full-text search over a flatfile, or ordinary
(non-category-structure-aware) indexings of a flatfile. This
awareness should bring significant benefits in terms of reducing the
search to its minimal consistent combinations of targets and a sense
of "just bringing up exactly what the user was looking for".

In the weeks ahead I will revisit some of the threads you listed.
Thank you for the conclusions and suggestions you mentioned; they seem
reasonable.

Eric

On Nov 25, 2:42 am, a...@svilendobrev.com wrote:
> Eric
> i'm not sure i follow all your mapping setup as it's too detail. but:
>
> i've been battling along similar data/feature inheritance+shading
> stuff along a branchy, recursive directed graph (not a pure tree as
> it has alternative short-cut paths in it), all over bitemporal
> objects and values (i.e. m2m relations all over, + grouping by
> time/objid), for almost half an year. see these posts of mine:
> - Many to many self referential relationship /15.03
> - multiple-aspects revisited /23.06
> - tree/graph persistency, concurency-safe? 13.07
> - unions in query?
> - and probably most others

> - as well as this thread in pgsql-...@postgresql.org:

> ...
>
> read more »

a...@svilendobrev.com

unread,

Nov 25, 2008, 4:04:11 PM11/25/08

to sqlal...@googlegroups.com

> Thanks for your reply. I have been following your posts with
> interest over the past half year (or I thought even longer). At
> first I thought you were crazy. But now I've found myself creating
> a model of similar complexity, as necessary to express the domain
> I'm working on.

i think you're better than me, as u dont have the burden of
bitemporal/multiversion stuff, and the completely different NodeTypes
(hence polymorphic-assoc to values). seems your domain is better
mathematicalizable - or u havent hit the edge of the door yet.

having less types is also a relief, u dont need a dbcook layer to
automate/hide mappings and value-type-related stuff (right now in a
prototype scenario i have say 100 tables, 20 m2m assoc, 220 foreign
keys - finally will probably be 200/100/600 or similar... it's
unthinkable to handle them one by one, esp. if most of team has never
seen sql).

1st make sure your model is correct. then, this full_heritage() func
(mine is called similar, order_of_inheritance()) has the trouble of
firing too many queries itself. once u start counting the queries
fired around your current model, then u'll realize that such
step-by-step recursive stuff - which is nice and pretty ok for
c/python/lisp/whatever - isn't going to be very fast... but this also
depends on the size of searchables. yours are simple and maybe 1000
queries arent going to be too bad. in my case, 1 awful complex
bitemporal query is 100x faster than 1000 smaller bitemporal ones.
i found it might be better in certain cases to get more data (2x) than
needed and filter it further in python, than go crazy expressing the
filter in flat set arithmetics (what sql is, more or less). (small)
compromises are not a bad thing if placed properly.

beware, my initial code about this graph-data-inheritance was the size
of yours, while currently it's like 5x bigger - but can work either
way (oh well 95% there).

the moral i got from all this is, if the whole wardrobe wont get
through the door, there are a lot of seemingly partial solutions that
at the end may do better than initial requirement - and should NOT be
ignored - but are notoriously difficult to guess as the initial
target obvious "solution" obscures them. be it turning upside down,
disassembling / cutting (the wardrobe or the door.. or another door),
replacing with 3 small cupboards... swapping rooms... burning it and
using plain hangers off the walls... moving elsewhere...

have fun, and eeer... i'm sorta looking for job.
too bad this _very_ "interesting times" project doesnt pay bills well.
svil
www.svilendobrev.com

Eric Ongerth

unread,

Nov 25, 2008, 5:43:25 PM11/25/08

to sqlalchemy

Yes, I am very glad to be free of multiversion and bitemporal
concerns, although I will eventually be setting this up for partial
multimaster asynchronous replication (but with a lot of intentional
compromises in order to avoid the major, currently unsolved, problems
with that entire field).

As for different NodeTypes and polymorphic associations to values, I
am indeed taking part in something similar, but I do not allow the
polymorphism to exist on the Python side of the code | database
distinction, nor do I even allow the polymorphism to even enter the
ORM layer in mappers. All of my polymorphism is strictly contained
within the database using a 'vertical tables' approach. All of that
dbcook stuff scares me, though I think I can see why you want it. I
will eventually need to wade into the waters of stricter type checking
and conversion on my tables, and that will get me into a lot of
similar concerns but hopefully not as deeply!

Also, yes, there are definitely a lot of 80/20 concerns and
willingness go for creative and seemingly partial solutions when
appropriate.

I found this wiki article very interesting (though I had already
reinvented most of what it refers to before I ever knew about it):

http://en.wikipedia.org/wiki/Entity-attribute-value_model

It was encouraging to invent a wheel, later to find that the design
decisions I made aligned nicely with an entire existing field of
wheels. In this case I don't consider the work of reinvention to be
wasted, because this project has been my vehicle for learning Python,
sqlalchemy and much more.

Wish I could offer to work with you, but this is my spare-time
project, I don't even have a coding job yet. I'm supporting myself
with manual labor while rolling this out. I think my project will at
least get me a decent job once it's far enough into beta stage to show
to anyone; or possibly it could become my own business next year.

Eric

> ...
>
> read more »

Eric Lemoine

unread,

Nov 26, 2008, 2:38:34 AM11/26/08

to sqlal...@googlegroups.com

Eric, a friendly comment: you do sound as crazy as Svil :-)

Eric

2008/11/25, Eric Ongerth <erico...@gmail.com>:

a...@svilendobrev.com

unread,

Nov 26, 2008, 9:15:58 AM11/26/08

to sqlal...@googlegroups.com

> All of that dbcook stuff scares me, though I think I can see
> why you want it.

heh. your model will look this way:
---------------
import dbcook.usage.plainwrap as o2r
class Text( o2r.Type): pass

class Itemtype( o2r.Base):
name = Text()
inherits = o2r.Association.Hidden( 'Itemtype', backref='progeny')
own_features = o2r.Collection( 'Feature', backref='root_itemtype')

class Feature( o2r.Base):
name = Text()
dependents = o2r.Association.Hidden( 'Feature',
backref='determinants')

class FeatureDependency( o2r.Base):
dependent = o2r.Reference( Feature,
backref='feature_dependencies_as_dependent')
determinant = o2r.Reference( Feature,
backref='feature_dependencies_as_determinant')

#eo model...

#sa-setup
import sqlalchemy,sys
meta = sqlalchemy.MetaData( sqlalchemy.create_engine('sqlite:///',
echo= 'echo' in sys.argv ))
# map attr-types to sa-column-types
fieldtypemap = {
Text: dict( type= sqlalchemy.String(100), ),
}

mybuild = o2r.Builder( meta,
locals(), #just scan anything here that looks like subclass of Base
fieldtypemap,
generator =True #how this would look in plain sqlalchemy
)

if mybuild.generator:
print '========= generated SA set-up'
print mybuild.generator.out
print '========= eo generated SA set-up'

--------------
thats it. u can run it and see generated sa-tables, sa-mappers etc.
or just go ahead with actual usage. if u use the QueryX replacement,
u can use the references, collections and associations in same way
(a.somerel == c), even via filter_by, and use plain python funcs like
lambda self: self.dependent.name.startswith('a') as .filter()s...
and noone will notice if u change a 1:m or m:1 into m2m somewhere in
the model.
i still can't get why noone's using it. maybe it's too easy... and
cannot see the cogwheels.

> I will eventually need to wade into the waters of stricter
> type checking and conversion on my tables, and that will get me
> into a lot of similar concerns but hopefully not as deeply!

type checking/conversion has nothing to do with dbcook, it's a
separate layer, which has/needs its own dbcook-reflector.

have fun, and sorry for the spam
svil

rich

unread,

Jan 16, 2009, 11:15:00 PM1/16/09

to sqlalchemy

Eric,

Any additional conclusions to recursively querying your dependency
graph? I just found SA, and have been wondering how feasible it would
be to manage and query a graph structure somewhat similar to the one
you've described. My problem is that I have a number of classes that
build expensive data attributes, each of which is one step to
producing many possible different solutions. The classes are
organized so that all attributes built by that class depend on the
same set of parameters, so that if I know the set of parameter values,
I know that all of the data attributes built by that class are
congruent. As each class instance performs one or more
transformations, it needs to be able to retrieve the correct data
structures, i.e. those that have been built with the same parameter
values.

Currently I keep the data attributes in python shelves, which I plan
on continue doing, as they are too large to stick in sql (numpy arrays
of up to a Gb or two), and manage a half-baked sql db to track the
class instances, their parameter values, and shelf file names. I want
to use SA to manage the shelf file names and directories by mirroring
the dependency structure of the classes.

The problem is that there are a bunch of parameters, and the more
dependent the classes become (deep children in a graph), the more
parameter values are needed to exactly define the data attributes. In
addition, I don't know the complete dependency structure until run-
time, as some classes can depend on an arbitrary number of other
classes (these other classes have different methods to produce the
needed data).

If SA could manage a dependency graph as I build the data attributes,
I could then query it to check whether a needed data attribute with
the right specifications already existed, and in which shelf in which
directory (one shelf per class instance holding the keyed data
attributes). Each class instance would not need to store the
parameter values of all the class instances it depended on (there may
be scores of such parameter values), but only a small subset directly
used by its methods, as a query could traverse the graph and check
which parameter values had been used at each step.

I thought if I made a simple class structure that mirrored my working
classes that all inherited from a class that would serve as nodes in
the graph, that I might be able to reasonably query to see if a data
structure had already been built with the correct parameter values. I
guess this means that this class would be self-referential, and that I
would be performing some recursive traversal for the query.

Currently my graph would be less than ten nodes deep, occasionally
wide (a class might require a thousand instances of another class),
and speed is not much of an issue, as manipulating these big data
attributes takes the vast majority of time/resources. Data integrity,
however, does matter, as if I lose the parameter values used to build
an object, it could become useless.

I am being conservative by creating new classes to mirror the
dependency structure, as I don't want to take the deep plunge yet (I
have RAM issues and need to let the working classes die easily (maybe
not a problem), plus I've overloaded __getattr__ to fetch data
attributes from the shelves, and don't want to mess with this until I
understand SA a bit more).

Any lessons? I found your posts useful, though apparently from the
crazed fringe... :)

Rich

a...@svilendobrev.com

unread,

Jan 17, 2009, 3:52:01 AM1/17/09

to sqlal...@googlegroups.com

u can look at my posts on the topic and all related too (-:)
search for recursive/recursion/graph/sets

in short:

- storage:
- store graph in pure form (nodes/edges only) - easy to
maintain/update, very difficult to query
- store some preprocessed (cached) denormalized form (all paths per
node or all reachable nodes or ...) - hard/expensive to
maintain/update, easy for some queries, hard for others - depends on
the form used;

- retrieving:
- try making sql do the recursive stuff for yourself - not very
scalable unless u use recursive sql (??)
- do recursion yourself and load one level at a time - can be slow;
i.e. 10 levels = 10 queries
- load whole thing and do all math in memory - may or may not be
feasible

10 nodes deep is a lot. i mean, if u store only pure nodes and edges,
and need to find the paths/stuff related to A1, it would be like:
A1 and stuff-there
or
A1 -> A1' and stuff-there
or
A1 -> A1" -> A1"' and stuff-there
or ...
which will make all these appear in FROM clause...; u can make a union
of these to avoid FROM explosion but still this has to be expanded
beforehand, as plain sql is not recursive (well some sqls are).

so u have to store the graph in denormalized/cached form - nested sets
or whatever was the other way; i.e. u do the recursive graph math
(using kjbuckets or similar) before saving and save the intermediate
non-recursive (and possibly repeating) results. Depending on what
queries u need u can/should store different stuff - e.g. storing _all
reachable nodes per node gives u ability for certain class of
queries, other things give u other abilities - think of it like
moving down one direction and not another. Choosing which
representation also depends on cost of updating it - adding one node
or one edge may cause updating 99% of the data.

u can also store the graph in pure form, and use sql only as storage,
i.e. always load the whole graph, do the recursive math yourself,
then query the db for extra data, according to results. i have one
working case where this is much faster/easier than any other approach
(users/roles/modules/funcs and permissions across each/all of these).

although sql is about sets math internaly, that part of it isn't very
popular and well developed - apart of having union, intersection,
substract etc operators. at least i could not find a reasonable way
to make it do my set arithmetics... well, i'm not sql guru at all,
just using it as the least evil ;-).

ciao
svilen
www.svilendobrev.com

Reply all

Reply to author

Forward