Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[OT] large db question about no joins

0 views
Skip to first unread message

Daniel Fetchinson

unread,
Apr 16, 2009, 2:45:49 PM4/16/09
to Python
[off but interesting topic]

Hi folks, I've come across many times the claim that 'joins are bad'
for large databases because they don't scale. Okay, makes sense, we
agree. But what I don't get, although I watched a couple of online
videos on this topic (one by the creator of flickr who gave a talk at
djangoconf and another one by a dev guy from facebook, I think the
links were posted here too) is how do people solve most basic problems
without joins? I guess the whole database layout has to be designed
with this in mind but suppose I have the following layout, how would I
redesign it so that I can have the same functionalities?

Let's say I have a table called zoo, another one called cage, another
one called animal and another one called leg. Each animal has some
number of legs, each cage has animals in it and each zoo has a number
of cages in it. Obviously these are represented by foreign keys. If I
need the total number of animals in a zoo, or the total number of legs
in the zoo I would use a query with join(s).

What would be the corresponding database layout that would scale and I
could get the total number of legs in the zoo or total number of
animals in the zoo without join(s)?

Cheers,
Daniel

[/off but interesting topic]

--
Psss, psss, put it down! - http://www.cafepress.com/putitdown

Aaron Brady

unread,
Apr 16, 2009, 3:20:11 PM4/16/09
to
On Apr 16, 1:45 pm, Daniel Fetchinson <fetchin...@googlemail.com>
wrote:

for cage in zoo:
for animal in cage:
for leg in animal:
count+= 1

</unhelpful>

I think that relational algebra wasn't designed to be able to
sacrifice joins.

Wait a second., how many legs in the zoo??

Aahz

unread,
Apr 16, 2009, 4:10:29 PM4/16/09
to
In article <ebaf2a84-e1b9-4a15...@o27g2000vbd.googlegroups.com>,

Aaron Brady <casti...@gmail.com> wrote:
>
>Wait a second., how many legs in the zoo??

That's so you can find out how many to pull.
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

"If you think it's expensive to hire a professional to do the job, wait
until you hire an amateur." --Red Adair

John Fabiani

unread,
Apr 16, 2009, 5:11:50 PM4/16/09
to
Daniel Fetchinson wrote:

> Hi folks, I've come across many times the claim that 'joins are bad'
> for large databases because they don't scale

IMO that's bull...

Peter Otten

unread,
Apr 16, 2009, 5:16:01 PM4/16/09
to
John Fabiani wrote:

> IMO that's bull...

>> Okay, makes sense, we agree.

;)

Paul Rubin

unread,
Apr 16, 2009, 5:39:02 PM4/16/09
to
Daniel Fetchinson <fetch...@googlemail.com> writes:
> Hi folks, I've come across many times the claim that 'joins are bad'
> for large databases because they don't scale.

I think that means joins with very large result sets and lots of
different values being matched on between the two tables. The usual
use of a join in, say, web server programming, is to look up something
with just a few results and just one value being matched. That scales
pretty well.

Daniel Fetchinson

unread,
Apr 16, 2009, 7:04:17 PM4/16/09
to pytho...@python.org
>> Hi folks, I've come across many times the claim that 'joins are bad'
>> for large databases because they don't scale.
>
> I think that means joins with very large result sets and lots of
> different values being matched on between the two tables. The usual
> use of a join in, say, web server programming, is to look up something
> with just a few results and just one value being matched. That scales
> pretty well.

Well, on the Google App Engine there is no join either, if I remember
correctly, because it uses bigtable which doesn't support join (if I'm
wrong I'd be happy to be corrected).

So for example on GAE, how would I represent my data of zoos, cages,
animals, legs? Or any other data structure that in a traditional RDBMS
uses joins?

Cheers,
Daniel

Daniel Fetchinson

unread,
Apr 16, 2009, 7:05:22 PM4/16/09
to pytho...@python.org
>> Hi folks, I've come across many times the claim that 'joins are bad'
>> for large databases because they don't scale
>
> IMO that's bull...

If you think that's bull, why do you think the google app engine or
bigtable doesn't support joins?

Cheers,
Daniel

Martin P. Hellwig

unread,
Apr 16, 2009, 7:10:34 PM4/16/09
to
Daniel Fetchinson wrote:
> [off but interesting topic]
<cut>

> What would be the corresponding database layout that would scale and I
> could get the total number of legs in the zoo or total number of
> animals in the zoo without join(s)?
>
> Cheers,
> Daniel
>
> [/off but interesting topic]
>

That all comes down to the keywords, efficiency, robustness and
performance and you can only pick one of them. So which two can you
sacrifice? The good news is that it is only theoretical, if you have
your requirements (i.e. the query with the right results has to return
within an acceptable time for the user not to get frustrated) who cares
if it is not as fast as it theoretically could be? Especially if this
means you can sacrifice the theoretically performance for easier
deployment and maintenance.

To get back to the layout question, if I need to have a count of the
legs at any given time, I would like to see the business process that
requires this operation. Using the business process I could probably
narrow down the scope quite a bit like, it is not necessary to have a
precise count at time *now* but it is good enough to have an exact count
which is no more then 4 seconds old but the query does need to return
within 50 ms.

One solution could be to build a central data warehouse where all info
is stored in. Then have satellite db's that does ETL syncing with the
central one. The layout of the satellites contain an optimised table
version for the queries you want to throw at it.

If you need to scale it, that is a question of adding another satellite.
Efficiency is right out of the window because you store in essence just
multiple copies of the same data just in another order. However it is
robust (by having multiple copies) and has a predictable performance figure.

hth
--
mph

Robert Kern

unread,
Apr 16, 2009, 7:31:46 PM4/16/09
to pytho...@python.org
> If you think that's bull, why do you think the google app engine or
> bigtable doesn't support joins?

"Large database" is not synonymous with "distributed database".

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

norseman

unread,
Apr 16, 2009, 8:05:22 PM4/16/09
to Robert Kern, pytho...@python.org
Robert Kern wrote:
>
...(snip)

> "Large database" is not synonymous with "distributed database".
>

=======================
True!

And cross-code lookup tables can make otherwise very large 'bytes on
disk' rather small overall.

Z3 in common_names.dbf African Pygmy Zebra
Z3 in zoological.dbf ThatBigLongNameNobodyCanPronounce
^
|
Stored in MAIN dbf along with head count, bin number and
"Zoo" (location) code to use the Zoo thing.

Pgm knows where to find the dbfs and which fields to match.

But then everybody already knew that.


Steve

Paul Rubin

unread,
Apr 16, 2009, 8:28:13 PM4/16/09
to
Daniel Fetchinson <fetch...@googlemail.com> writes:
> If you think that's bull, why do you think the google app engine or
> bigtable doesn't support joins?

Join is a relational database concept. Bigtable is not a relational
database. Of course it does lookups, but they are not the same as
relational joins.

Daniel Fetchinson

unread,
Apr 16, 2009, 9:12:19 PM4/16/09
to Python
>>>> Hi folks, I've come across many times the claim that 'joins are bad'
>>>> for large databases because they don't scale
>>> IMO that's bull...
>>
>> If you think that's bull, why do you think the google app engine or
>> bigtable doesn't support joins?
>
> "Large database" is not synonymous with "distributed database".

Yes, sorry for using the wrong words, my question then is how do I
solve a problem similar to my zoo/cage/animal/leg problem with
distributed databases?

Cheers,
Daniel

Daniel Fetchinson

unread,
Apr 16, 2009, 9:22:42 PM4/16/09
to Python
>> If you think that's bull, why do you think the google app engine or
>> bigtable doesn't support joins?
>
> Join is a relational database concept. Bigtable is not a relational
> database. Of course it does lookups, but they are not the same as
> relational joins.

True! I gave a use case that can easily be represented in a relational
database, and what I'm wondering now is how I would represent it in a
bigtable (not relational database) in such a way that I can easily do
the lookups I did when I represented it in a relational database.

Daniel Fetchinson

unread,
Apr 16, 2009, 9:26:21 PM4/16/09
to Python
>> [off but interesting topic]
> <cut>
>
>> What would be the corresponding database layout that would scale and I
>> could get the total number of legs in the zoo or total number of
>> animals in the zoo without join(s)?
>>
>> Cheers,
>> Daniel
>>
>> [/off but interesting topic]
>>
>
> That all comes down to the keywords, efficiency, robustness and
> performance and you can only pick one of them. So which two can you
> sacrifice? The good news is that it is only theoretical, if you have
> your requirements (i.e. the query with the right results has to return
> within an acceptable time for the user not to get frustrated) who cares
> if it is not as fast as it theoretically could be? Especially if this
> means you can sacrifice the theoretically performance for easier
> deployment and maintenance.
>
> To get back to the layout question, if I need to have a count of the
> legs at any given time, I would like to see the business process that
> requires this operation.

Well, I gave the concrete example of zoo/cage/animal/leg because this
*is* the business logic. I need to know for example the total number
of animals, this is pretty understandable if you have a zoo. Or you
mean that I should give another example?

> Using the business process I could probably
> narrow down the scope quite a bit like, it is not necessary to have a
> precise count at time *now* but it is good enough to have an exact count
> which is no more then 4 seconds old but the query does need to return
> within 50 ms.

No, I'd like to have the exact number at any given time.

> One solution could be to build a central data warehouse where all info
> is stored in. Then have satellite db's that does ETL syncing with the
> central one. The layout of the satellites contain an optimised table
> version for the queries you want to throw at it.

Let's say I need to implement this on the google app engine, so I
can't make too many choices on my own. On the GAE, how would I do
this?

Cheers,
Daniel

Paul Rubin

unread,
Apr 16, 2009, 10:02:48 PM4/16/09
to
Daniel Fetchinson <fetch...@googlemail.com> writes:
> Yes, sorry for using the wrong words, my question then is how do I
> solve a problem similar to my zoo/cage/animal/leg problem with
> distributed databases?

In the case of BigTable, you could write a suitable MapReduce task.
You might look at the Wikipedia articles about MapReduce or Hadoop, or
the Hadoop docs, to see how this sort of thing is done.

Martin P. Hellwig

unread,
Apr 17, 2009, 5:42:25 AM4/17/09
to
Daniel Fetchinson wrote:
<cut>

> Well, I gave the concrete example of zoo/cage/animal/leg because this
> *is* the business logic. I need to know for example the total number
> of animals, this is pretty understandable if you have a zoo. Or you
> mean that I should give another example?

It might be the business logic but not the process.
http://en.wikipedia.org/wiki/Business_process

>> Using the business process I could probably
>> narrow down the scope quite a bit like,

<cut>


> No, I'd like to have the exact number at any given time.

Fair enough, but this is then the *exact* scope you like to answer me in.
>
<cut>


> Let's say I need to implement this on the google app engine, so I
> can't make too many choices on my own. On the GAE, how would I do
> this?

All other data modeling (e.g. EAV, relational, object orientated) will
result in a structure requiring extra calculation when doing the query
(and thus according your requirement isn't acceptable).

So I need to store the data in such a way that it is accessible before
the query with the minimal amount of processing.

Thus the only thing left is writing a subroutine in my code which
recalculates the amount of legs when there is a change in it and store
it just as a pair value:
amount_of_legs = Positive Integer

The thing with bogus requirements for an imaginary example is that the
solution is imaginary bogus too, leaving little room for applicability
in the real world.

hth
--
mph

andrew cooke

unread,
Apr 17, 2009, 8:46:16 AM4/17/09
to pytho...@python.org

on the more general point about exactly how to handle large data sets, i
found this article interesting -
http://highscalability.com/unorthodox-approach-database-design-coming-shard

andrew

Daniel Fetchinson

unread,
Apr 17, 2009, 11:12:11 AM4/17/09
to Python
>> Well, I gave the concrete example of zoo/cage/animal/leg because this
>> *is* the business logic. I need to know for example the total number
>> of animals, this is pretty understandable if you have a zoo. Or you
>> mean that I should give another example?
>
> It might be the business logic but not the process.
> http://en.wikipedia.org/wiki/Business_process
>
>>> Using the business process I could probably
>>> narrow down the scope quite a bit like,
> <cut>
>> No, I'd like to have the exact number at any given time.
> Fair enough, but this is then the *exact* scope you like to answer me in.
>>
> <cut>
>> Let's say I need to implement this on the google app engine, so I
>> can't make too many choices on my own. On the GAE, how would I do
>> this?
> All other data modeling (e.g. EAV, relational, object orientated) will
> result in a structure requiring extra calculation when doing the query
> (and thus according your requirement isn't acceptable).
>
> So I need to store the data in such a way that it is accessible before
> the query with the minimal amount of processing.
>
> Thus the only thing left is writing a subroutine in my code which
> recalculates the amount of legs when there is a change in it and store
> it just as a pair value:
> amount_of_legs = Positive Integer

Actually, this is a pretty good idea! In my case there are not so many
writes, but tons of reads. So doing this is actually absolutely
feasible.

> The thing with bogus requirements for an imaginary example is that the
> solution is imaginary bogus too, leaving little room for applicability
> in the real world.

It's not so bogus I think. But here is another example, even more
concrete, maybe that will show more clearly what I'm talking about,
sorry if I was not clear enough: let's say you have a music store and
you have cd's in the store.

In an relational database setting you would have a table for artists,
a table for cd's and a table for songs and a table for comments where
people can comment on songs. All of this with obvious foreign keys.
Now you want to display on your website the total number of cd's, the
total number of songs and the total number of comments because that
looks very cool on top of every page: 1242342 cd's and 134242342342
songs and 284284728347284728 comments!

Okay, so you need to issue a query with joins in your relational db.
Or in another query, you want to select all comments of an artist,
meaning all comments that were comments on a song belonging to cd's of
a given artist. This would also involve joins.

How would I do these queries if I can't have joins or in other words
how would I lay out my data storage?

Thanks by the way for your help and replies,

J. Cliff Dyer

unread,
Apr 17, 2009, 2:53:56 PM4/17/09
to jfab...@yolo.com, pytho...@python.org

OK. That makes four legs so far....

> --
> http://mail.python.org/mailman/listinfo/python-list
>

Martin P. Hellwig

unread,
Apr 17, 2009, 4:16:43 PM4/17/09
to
Daniel Fetchinson wrote:
<cut>

> In an relational database setting you would have a table for artists,
> a table for cd's and a table for songs and a table for comments where
> people can comment on songs. All of this with obvious foreign keys.
> Now you want to display on your website the total number of cd's, the
> total number of songs and the total number of comments because that
> looks very cool on top of every page: 1242342 cd's and 134242342342
> songs and 284284728347284728 comments!
>
> Okay, so you need to issue a query with joins in your relational db.
> Or in another query, you want to select all comments of an artist,
> meaning all comments that were comments on a song belonging to cd's of
> a given artist. This would also involve joins.
>
> How would I do these queries if I can't have joins or in other words
> how would I lay out my data storage?
>
> Thanks by the way for your help and replies,
> Daniel
>

What actually is happening here is that feature x which is usually in an
abstraction layer (like a RDBM) is not available to you. Which means you
have to write code which does the same but on your side of the code.

I played a bit around with the GAE some time ago and designed the
storage of all data in an sort of EAV model
(http://en.wikipedia.org/wiki/Entity-Attribute-Value_model)
The abstraction layer had all my meta code, counting stuff up was a
matter of creating a list of id's from one section and with that list
selecting the appropriate records on the other section, then count up
the number of results or use the result again, it is very likely you
have to do quite a lot of queries before you get the result.

So in other words, just lay out the data which makes the most sense to
you, the pain of recreating SQL like logic is there no matter what
layout you choose.

--
mph

Daniel Fetchinson

unread,
Apr 17, 2009, 8:36:44 PM4/17/09
to pytho...@python.org

Thanks, this wikipedia entry was actually very useful as well as your
other comments.

Thanks again,

Aaron Brady

unread,
Apr 17, 2009, 9:39:39 PM4/17/09
to
On Apr 17, 3:16 pm, "Martin P. Hellwig" <martin.hell...@dcuktec.org>
wrote:

Instead of creating a new table, just create an iterator that returns
successive entries in it. Just don't change the size of either table
during iteration.

Martin P. Hellwig

unread,
Apr 18, 2009, 12:04:22 PM4/18/09
to
Daniel Fetchinson wrote:
<cut>

> Thanks, this wikipedia entry was actually very useful as well as your
> other comments.
>
> Thanks again,
> Daniel
>

Your welcome, I usually take quite a lot of effort into designing before
I start coding. One tool I found very helpful was DIA, especially the
UML section. Have a look at it you might find it useful.
http://projects.gnome.org/dia/

--
MPH
http://blog.dcuktec.com

Tim Wintle

unread,
Apr 19, 2009, 1:04:46 AM4/19/09
to pytho...@python.org
On Fri, 2009-04-17 at 21:16 +0100, Martin P. Hellwig wrote:
> So in other words, just lay out the data which makes the most sense
> to
> you, the pain of recreating SQL like logic is there no matter what
> layout you choose.

I have to say that given the amount of pain most people seem to go
through when they first learn about relational databases, it seems
ironic how much people come to expect that that's the standard way to
query data!

At the end of the day it's all just one long sequential load of 0s and
1s on disk, and some code somewhere has to do all the logic.

(I'm a relational database user btw, but it seems to happen with object
database people too - we get so used to our own paradigm that we don't
think about other ways of doing the task)

Tim Wintle

0 new messages