Is one big table table quicker than 8 smaller tables?

lock

unread,

Dec 5, 2008, 12:54:54 AM12/5/08

to Google App Engine

I've been desperately trying to optimise my code to get rid of those
'High CPU' requests. Some changes worked others didn't, in the end
I've really only gained a marginal improvement. So I'm now
considering some significant structural changes and are wondering if
anyone has tried something similar and can share their experience.

The apps pretty simple it just geo-tags data points using the geoHash
algorithm, so basically each entry in the table is the geoHash of the
given lat/long with some associated meta data. Queries are then done
by a bounding box that is also geohashed and used as datastore query
filters. Due do some idiosyncrasies with using geoHash, any given
query may be split into up to 8 queries (by lat 90,0,-90 by long 180
90, 0, -90, 180), but generally the bounds fall into only one/two
division(s) and therefore only result in one datastore query.

All these queries are currently conducted on the one large datastore,
I'm wondering if it would be more efficient to break down this one
datastore into 8 separate tables (all containing the same type) and
query the table relevant to the current bounding box.

In summary I guess what I'm trying to ask is (sorry for the ramble),
does the query performance degrade significantly as the size of the
database increases?

Xavier Mathews

unread,

Dec 5, 2008, 10:32:18 AM12/5/08

to google-a...@googlegroups.com

Yes It Does Depending On The Code etc.

--
Xavier A. Mathews
Student/Browser Specialist/Developer/Web-Master
Google Group Client Based Tech Support Specialist
Hazel Crest Illinois
xavier...@gmail.com¥xavier...@hotmail.com¥trues...@yahoo.com
"Fear of a name, only increases fear of the thing itself."

ryan

unread,

Dec 5, 2008, 1:20:34 PM12/5/08

to Google App Engine

hi lachlan! i'll take these a little out of order.

On Dec 4, 9:54 pm, lock <lachlan.hu...@gmail.com> wrote:
> In summary I guess what I'm trying to ask is (sorry for the ramble),
> does the query performance degrade significantly as the size of the
> database increases?

in general, no. the high level rule of thumb is that query performance
is linear in the number of *results* you fetch, not the total
datastore size, number of property values in a list property, or
anything else. this is because queries are implemented as single dense
index scans...

...except for merge join, which is one caveat. first, the constant
factor for merge join queries, ie queries with multiple equals filters
and no sort orders, is higher than for queries that use a built-in or
composite index. also, in fully sparse cases, where there are many
entities that match each individual equals filter but only a few that
match all of them, merge join can sometime degrades to worse than
linear in the number of results. when that happens, we detect it and
return a NeedIndexError with the composite index that the developer
should add. the query will then use the composite index instead of
merge joining, and its performance will be deterministic.

the IN operator is another caveat, since it fans out to multiple
queries under the hood. there are a few other minor implementation
details too, e.g. de-duping. still, queries generally perform
deterministically. you will occasionally see timeouts on datastore
queries, but those are due to other factors that we're constantly
working on improving. they're not inherent in the datastore.

my google i/o talk has lots of background:
http://sites.google.com/site/io/under-the-covers-of-the-google-app-engine-datastore

> All these queries are currently conducted on the one large datastore,
> I'm wondering if it would be more efficient to break down this one
> datastore into 8 separate tables (all containing the same type) and
> query the table relevant to the current bounding box.

hmm. i'm not entirely sure. the fact that you fan out to multiple
datastore queries definitely jumped out at me. if this lets you reduce
that fan-out, then it might be worth it. otherwise, i wouldn't expect
to see a difference.

yejun

unread,

Dec 5, 2008, 4:10:12 PM12/5/08

to Google App Engine

Query speed doesn't not depends on your data size, since entire GAE
cluster is in one single table.

lock

unread,

Dec 5, 2008, 8:47:41 PM12/5/08

to Google App Engine

Thanks guys, that's what I was hoping to hear, you saved me a couple
hours trying to prove it for myself (not to mention the frustration).
After I went away and thought about it some more I figured there must
be some 'smarts' in the database to prevent the query time from
increasing. Otherwise how could any database scale well...

No merge joins or IN operators in my code, so nothing to worry about
there.

After a _lot_ more testing I'm finding that query time does scale with
the number of fetched _results_, not the DB size. During early
testing I convinced myself that increasing the DB size was slowing my
query down, when really the number of results were increasing as I
added more data, doh (it was getting late ;-) ).

The overall solution that seems to be working well for me at the
moment is to have different tables for different resolutions. As the
size of the geometric bounds increases I switch between a few tables,
each one with a lower fidelity therefore reducing the number of
results that can be returned. Visually it works similar to Level Of
Detail techniques you see in some 3D modeling packages.

yejun

unread,

Dec 5, 2008, 9:31:39 PM12/5/08

to Google App Engine

I think the max number of queries for one location should be 4 search
boxes. I could be wrong.

Nick Johnson

unread,

Dec 6, 2008, 5:46:39 AM12/6/08

to Google App Engine

On Dec 6, 1:47 am, lock <lachlan.hu...@gmail.com> wrote:
> Thanks guys, that's what I was hoping to hear, you saved me a couple
> hours trying to prove it for myself (not to mention the frustration).
> After I went away and thought about it some more I figured there must
> be some 'smarts' in the database to prevent the query time from
> increasing. Otherwise how could any database scale well...
>
> No merge joins or IN operators in my code, so nothing to worry about
> there.
>
> After a _lot_ more testing I'm finding that query time does scale with
> the number of fetched _results_, not the DB size. During early
> testing I convinced myself that increasing the DB size was slowing my
> query down, when really the number of results were increasing as I
> added more data, doh (it was getting late ;-) ).

One thing to bear in mind is that the dev_appserver performance is
_not_ representative of the production performance. The dev_appserver
holds the entire dataset in memory and does linear scans over an
entity type for queries, so performance there _will_ degrade with
respect to the size of an entity type.

>
> The overall solution that seems to be working well for me at the
> moment is to have different tables for different resolutions. As the
> size of the geometric bounds increases I switch between a few tables,
> each one with a lower fidelity therefore reducing the number of
> results that can be returned. Visually it works similar to Level Of
> Detail techniques you see in some 3D modeling packages.

I'm curious how you're doing this with only a limited number of
queries. Geohashing isn't ideally suited to satisfying bounding box
queries (though it's certainly better than storing plain lat/longs).

lock

unread,

Dec 6, 2008, 9:40:15 AM12/6/08

to Google App Engine

On Dec 6, 9:46 pm, Nick Johnson <arach...@notdot.net> wrote:
> On Dec 6, 1:47 am, lock <lachlan.hu...@gmail.com> wrote:
>
> > Thanks guys, that's what I was hoping to hear, you saved me a couple
> > hours trying to prove it for myself (not to mention the frustration).
> > After I went away and thought about it some more I figured there must
> > be some 'smarts' in the database to prevent the query time from
> > increasing. Otherwise how could any database scale well...
>
> > No merge joins or IN operators in my code, so nothing to worry about
> > there.
>
> > After a _lot_ more testing I'm finding that query time does scale with
> > the number of fetched _results_, not the DB size. During early
> > testing I convinced myself that increasing the DB size was slowing my
> > query down, when really the number of results were increasing as I
> > added more data, doh (it was getting late ;-) ).
>
> One thing to bear in mind is that the dev_appserver performance is
> _not_ representative of the production performance. The dev_appserver
> holds the entire dataset in memory and does linear scans over an
> entity type for queries, so performance there _will_ degrade with
> respect to the size of an entity type.

Oh really! That may have also contributed to my initial theory about
DB
performance being adversely affected by its size. Thanks for the tip,
definitely
something to keep in mind.
Hopefully in future versions of the SDK, the dev server will start to
better mimic
the behavior of the actual app engine framework. It would be really
great if
for example it gave similar CPU usage warnings.

>
>
> > The overall solution that seems to be working well for me at the
> > moment is to have different tables for different resolutions. As the
> > size of the geometric bounds increases I switch between a few tables,
> > each one with a lower fidelity therefore reducing the number of
> > results that can be returned. Visually it works similar to Level Of
> > Detail techniques you see in some 3D modeling packages.
>
> I'm curious how you're doing this with only a limited number of
> queries. Geohashing isn't ideally suited to satisfying bounding box
> queries (though it's certainly better than storing plain lat/longs).

Please tell me if I'm wrong but isn't geohashing the only way you can
do a
bounding box type query with a datastore query? I must admit during
early
development I just assumed I was going to be able to do a query
something
like:
'SELECT * WHERE lat < top AND lat > bottom AND long > right AND long <
left ...'
Got a bit of a shock when I found I could only query based on one
field.

The only other way I though of doing it was to query based on
longitude,
then just filter the results by lat in a loop after. Knowing what I
do now
(queries that return a lot of results chew up CPU cycles), I'd say
this would
be the wrong approach.

As for the level of detail stuff, it's nothing too sophisticated, I'll
try to
elaborate. It's unrelated to geohash.

My app has 4 tables, 1 contains all data points, the other 3 are of
varying
resolutions (LOD tables).

When adding a point (lat/long) it gets put into the table containing
all data
points. Next I start adding this same point to the appropriate LOD
tables, for
the 'high res' one I round the lat/long to 2 decimal places and
compute the
geohash. If the geohash is present in the table then this point has
been fully
added, otherwise it is added to the 'high res' table and we continue.
The same
lat/long is then rounded to 1 decimal place, the geohash of this is
then calculated
and checked if it is in the 'medium res' LOD table, if present then
just return.
If not then do something similar again for the 'High res' LOD table.

Points are obtained from the app by a bounding box, the lat/long of
the NE and SW
corners. From this we can calculate a rough size unit for the
bounding box. At
the moment I'm using the diagonal length in degrees squared. From
this number we
determine which table to query. For large bounding boxes the 'low
res' LOD table
is used, for small boxes the 'high res' LOD table is used. For even
smaller
bounding boxes I just get the results out of the table containing all
data points.

Hope that made sense.

Anyway, if you want to see it in action checkout
'bikebingle.appspot.com'. Please
enter as much random data as you want, all the stuff in there at the
moment is just
test data and will be removed soonish. If you find any bugs while
your there, I'd
love to know about them :-), hopefully BikeBingle will be 'going live'
in the next
couple days. BTW, I wouldn't click the 'Make random' button (for
debugging purposes),
it fires off a 100 post requests.... Of course by saying that I know
someones
going to click it ;-)

Reply all

Reply to author

Forward