Relationship operator vs join and filter

503 views
Skip to first unread message

Wouter Overmeire

unread,
Mar 21, 2011, 7:21:23 AM3/21/11
to sqlalchemy
New to sql and sqlalchemy.
Using an in memory sqlite database.

Trying to do some queries based on the examples in the sqlalchemy
documentation.

I have classes: Tag, Article, Author, Award
article <--> tag: many to many
article <--> author: many to many (so one article can have multiple
authors)
award <--> author: many to many

Find all articles written by 'author1'
articles =
session.query(Article).filter(Article.authors.any(Author.name ==
'author1')).all()
This works fine, until i start increasing the number of articles and
authors.
For 50 authors with each 1000 articles and each article has 10 tags
(out of 100 tags in total), the above query takes forever, i never let
it complete and interrupt it after 15 minutes. The above query is just
as in the documentation (http://www.sqlalchemy.org/docs/orm/
tutorial.html?highlight=blogpost#building-a-many-to-many-
relationship), is it suppose to work on larger tables? Very likely i`m
doing something wrong.

If i do:
articles =
session.query(Article).join(Article.authors).filter(Author.name ==
'author1').all() this is done in a fraction of a second.

Now trying to find articles written by 'author1' and having tag
'tag1'.
qArticleByTag =
session.query(Article).join(Article.tags).filter(Tag.tag == 'tag1')
qArticleByAuthor =
session.query(Article).join(Article.authors).filter(Author.name ==
'author1')
articles = qArticleByTag.intersect(qArticleByAuthor).all()

This works fine, but somehow i have the feeling there are more
efficient ways. Since i think the above query goes twice through all
the articles. How do i search for articles written by author one in
the articles i have selected by tag1? I tried what`s in the
documentation, and that works fine for small number of articles but
not when i have like 50k articles.

I haven`t found yet a fast way to query for articles with tag 'tag1'
written by any author who has won award 'award1'.

If anyone can help, very much appreciated.

Thanks.



Michael Bayer

unread,
Mar 21, 2011, 6:42:40 PM3/21/11
to sqlal...@googlegroups.com

On Mar 21, 2011, at 7:21 AM, Wouter Overmeire wrote:

> New to sql and sqlalchemy.
> Using an in memory sqlite database.
>
> Trying to do some queries based on the examples in the sqlalchemy
> documentation.
>
> I have classes: Tag, Article, Author, Award
> article <--> tag: many to many
> article <--> author: many to many (so one article can have multiple
> authors)
> award <--> author: many to many
>
> Find all articles written by 'author1'
> articles =
> session.query(Article).filter(Article.authors.any(Author.name ==
> 'author1')).all()
> This works fine, until i start increasing the number of articles and
> authors.
> For 50 authors with each 1000 articles and each article has 10 tags
> (out of 100 tags in total), the above query takes forever, i never let
> it complete and interrupt it after 15 minutes. The above query is just
> as in the documentation (http://www.sqlalchemy.org/docs/orm/
> tutorial.html?highlight=blogpost#building-a-many-to-many-
> relationship), is it suppose to work on larger tables? Very likely i`m
> doing something wrong.

Not really, the EXISTS format is known to be generally slower than a straight INNER JOIN. The operator is convenient, especially when doing a NOT EXISTS, but if looking for positive inclusion of sets the more verbose join() approach is typically more efficient.

>
> If i do:
> articles =
> session.query(Article).join(Article.authors).filter(Author.name ==
> 'author1').all() this is done in a fraction of a second.
>
> Now trying to find articles written by 'author1' and having tag
> 'tag1'.
> qArticleByTag =
> session.query(Article).join(Article.tags).filter(Tag.tag == 'tag1')
> qArticleByAuthor =
> session.query(Article).join(Article.authors).filter(Author.name ==
> 'author1')
> articles = qArticleByTag.intersect(qArticleByAuthor).all()
>
> This works fine, but somehow i have the feeling there are more
> efficient ways.

the SQL set operators like UNION, INTERSECT, etc. are not the most efficient and are cumbersome to use. Using two joins() at once would be more straightforward:

session.query(Article).join(Article.tags).filter(Tag.tag=='tag1').join(Article.authors).filter(Author.name=='author1')

three sets of rows are intersected above, returning only those rows which are correlated and meet all criterion.

> Since i think the above query goes twice through all
> the articles.

with INTERSECT, probably, though its possible some query optimizers could figure out a more efficient pattern here (I'm not a deep expert in query optimizers to know for sure, however). Sticking with chains of joins is the way to go in any case.


Wouter Overmeire

unread,
Mar 22, 2011, 3:35:22 AM3/22/11
to sqlalchemy
Before using the intersect i tried also the query you suggest but it
takes very long to complete, compared to the intersect.
articles =
session.query(Article).join(Article.tags).filter(Tag.tag=='tag1').join(Article.authors).filter(Author.name=='author1').all()
takes 75seconds.

qArticleByTag =
session.query(Article).join(Article.tags).filter(Tag.tag == 'tag1')
qArticleByAuthor =
session.query(Article).join(Article.authors).filter(Author.name ==
'author1')
articles = qArticleByTag.intersect(qArticleByAuthor).all() takes 0.8
seconds

That`s almost two decades of difference in speed for the same result!
This i don`t understand, i would expect the chain of joins to be
faster, since the join with Article.authors is on a smaller table
compared to qArticleByAuthor and no intersect is needed.
Could it be the chain of join is making a monster table first joining
all tags and then joining all authors and from this table select the
ones with tag='tag1' and author='author1'. Instead of doing the join
of Article.authors on the filtered articles table (by tag)
>>> print session.query(Article).join(Article.tags).filter(Tag.tag=='tag1').join(Article.authors).filter(Author.name=='author1')
SELECT "tArticles".id AS "tArticles_id", "tArticles".txt AS
"tArticles_txt"
FROM "tArticles" JOIN "tArticleTags" AS "tArticleTags_1" ON
"tArticles".id = "tArticleTags_1"."articleId" JOIN "tTags" ON
"tTags".id = "tArticleTags_1"."tagId" JOIN "tAuthorArticles" AS
"tAuthorArticles_1" ON "tArticles".id =
"tAuthorArticles_1"."articleId" JOIN "tAuthors" ON "tAuthors".id =
"tAuthorArticles_1"."authorId"
WHERE "tTags".tag = ? AND "tAuthors".name = ?





Michael Bayer

unread,
Mar 23, 2011, 10:31:17 AM3/23/11
to sqlal...@googlegroups.com
On Mar 22, 2011, at 3:35 AM, Wouter Overmeire wrote:


Before using the intersect i tried also the query you suggest but it
takes very long to complete, compared to the intersect.
articles =
session.query(Article).join(Article.tags).filter(Tag.tag=='tag1').join(Article.authors).filter(Author.name=='author1').all()
takes 75seconds.

qArticleByTag =
session.query(Article).join(Article.tags).filter(Tag.tag == 'tag1')
qArticleByAuthor =
session.query(Article).join(Article.authors).filter(Author.name ==
'author1')
articles = qArticleByTag.intersect(qArticleByAuthor).all() takes 0.8
seconds

That`s almost two decades of difference in speed for the same result!
This i don`t understand, i would expect the chain of joins to be
faster, since the join with Article.authors is on a smaller table
compared to qArticleByAuthor and no intersect is needed.
Could it be the chain of join is making a monster table first joining
all tags and then joining all authors and from this table select the
ones with tag='tag1' and author='author1'. Instead of doing the join
of Article.authors on the filtered articles table (by tag)

well the best way to figure that kind of thing out is to take a look at the plan your database is coming up with.  On Postgresql this would be via EXPLAIN.   For Oracle, MySQL, the commands are different, take a look at the docs for the DB you're using.   It will show how indexes and such are being used (or not).   It may mean that you just need to adjust your indexes for this particular query.



Wouter Overmeire

unread,
Mar 23, 2011, 10:43:53 AM3/23/11
to sqlalchemy


On Mar 22, 8:35 am, Wouter Overmeire <wouter.overme...@scarlet.be>
wrote:
Anyway this is SQL related, and not SA.
I switched from sqlite to mysql and rebuild the database.
The query that took 75second using sqlite only takes 0.04 (and i type
again 0.04) seconds using mysql as database. This was done on the same
machine. (Thanks to sqlalchemy i could reuse my code :-) )
So bye bye sqlite and welcome mysql.

Reply all
Reply to author
Forward
0 new messages