Page Iteration

Emanuele Ziglioli

unread,

Apr 19, 2012, 10:07:21 PM4/19/12

to Siena

Hi everyone,

I'm constatly running out of memory when fetching more than 20,000
rows.
I haven't been able to find a memory profiler so I'm just trying with
trial and error with a unit test and an increasing number of rows.

One thing I'm trying is queries with pagination.
I think there's a mistake in the example here:
https://github.com/mandubian/siena/blob/master/source/documentation/manuals/first_model.textile

Iterate<Person> iter = query.iterPerPage(10);
while(iter.iterator().hasNext()){
System.out.println(iter.next());
}

If I call iterator() every time, a new iterator is created which is
very slow, instead doing this is very fast:

Iterator<Person> iter = q.iterPerPage(10).iterator();
while(iter.hasNext()){
System.out.println(iter.next());
}

I hope that would also reduce my memory requirements (by not having to
map all the entities at once). At least my test is no longer running
out of memory.

E.

PS I've noticed this warning about chunk sizes (I paginate 1000
entities). Does anyone know anything about it?

WARNING: This query does not have a chunk size set in FetchOptions and
has returned over 1000 results. If result sets of this size are
common for this query, consider setting a chunk size to improve
performance.
To disable this warning set the following system property in
appengine-web.xml (the value of the property doesn't matter):
'appengine.datastore.disableChunkSizeWarning'

Pascal Voitot Dev

unread,

Apr 20, 2012, 3:04:34 AM4/20/12

to siena-...@googlegroups.com

lets try a few things with features I almost forgot :);)

On Fri, Apr 20, 2012 at 4:07 AM, Emanuele Ziglioli <the...@emanueleziglioli.it> wrote:

Hi everyone,

I'm constatly running out of memory when fetching more than 20,000
rows.
I haven't been able to find a memory profiler so I'm just trying with
trial and error with a unit test and an increasing number of rows.

One thing I'm trying is queries with pagination.
I think there's a mistake in the example here:
https://github.com/mandubian/siena/blob/master/source/documentation/manuals/first_model.textile

Iterate<Person> iter = query.iterPerPage(10);
while(iter.iterator().hasNext()){
System.out.println(iter.next());
}

If I call iterator() every time, a new iterator is created which is
very slow, instead doing this is very fast:

Iterator<Person> iter = q.iterPerPage(10).iterator();
while(iter.hasNext()){
System.out.println(iter.next());
}

yet iterator should be rebuilt at each turn!!! It creates a structure each time...
error certainly in the example!

I hope that would also reduce my memory requirements (by not having to
map all the entities at once). At least my test is no longer running
out of memory.

I hope there is no leak...
anyway I can propose you to use the famous STATEFUL feature of Siena (I'm certainly the only guy in the world knowing it and having used it once :) )

Stateful mode in Siena is not good if you want to do web in stateless mode as I always do (with play) because it's stateful and states are evil as anyone knows with respect to HTTP ;)

anyway, in your case, you are going through the whole table in the same scope and the stateful is great for you.
What does stateful will bring to you? it will provide GAE CURSOR when iterating the table which is far better in terms of performance and efficiency apparently.

how to use it?

Iterator<Person> iter = q.stateful().iterPerPage(10).iterator();
while(iter.hasNext()){
System.out.println(iter.next())
}

I think you can increase iterPerPage(10) to something bigger. 10 represents the number of entities you get for each page before moving the cursor. iterPerPage is a special iterable as it is an iterable of iterables. It hides the fact that it fetches a page of 10, iterates on this page of 10 and when it has ended, it fetches another page of 10 and iterates etc...
The 1000 entities limitation is an historical limitation in GAE afaik.
at the beginning it couldn't return more than 1000 entities.

do you have this warning in iterPerPage?
Please try with stateful mode first and tell me if it's better.
about the warning, we'll see after that ;)

regards
Pascal

E.

PS I've noticed this warning about chunk sizes (I paginate 1000
entities). Does anyone know anything about it?

WARNING: This query does not have a chunk size set in FetchOptions and
has returned over 1000 results. If result sets of this size are
common for this query, consider setting a chunk size to improve
performance.
To disable this warning set the following system property in
appengine-web.xml (the value of the property doesn't matter):
'appengine.datastore.disableChunkSizeWarning'

--
You received this message because you are subscribed to the Google Groups "Siena" group.
To post to this group, send email to siena-...@googlegroups.com.
To unsubscribe from this group, send email to siena-discus...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/siena-discuss?hl=en.

Emanuele Ziglioli

unread,

Apr 20, 2012, 7:31:23 AM4/20/12

to Siena

Well I thought iterPerPage() does use cursors and does what you just
described for stateful():

Iterator<Person> iter = q.iterPerPage(1000).iterator();
while(iter.hasNext()){
System.out.println(iter.next());
}

I iterate with pages of 1000 entities and it seems to be working! I
haven't seen any errors since I've implemented it that way this
afternoon.

I don't really understand what stateful() would do differently, I
think your paged queries implementation is very effective!
And I'm happy to be one of the first ones in the world to use it :-)

Pascal Voitot Dev

unread,

Apr 20, 2012, 8:05:52 AM4/20/12

to siena-...@googlegroups.com

Stateful allows cursor forward pagination which is the only way gae allows but it also allows backward as it stores cursors! This is something provided by siena but not by gae:)
Stateless pagination doesn't use cursors but offsets which is less efficient than cursors afaik!

Pascal

Emanuele Ziglioli

unread,

Apr 20, 2012, 4:23:01 PM4/20/12

to Siena

> Stateless pagination doesn't use cursors but offsets which is less
> efficient than cursors afaik!

Ok, I'll see if I can profile the statefull vs stateless!

Emanuele Ziglioli

unread,

Apr 23, 2012, 5:29:18 AM4/23/12

to Siena

You were right!

I'm profiling with AppStats a paged iteration over 9200 entities with
a page size of 5000. Here are the results:

stateful: (116 RPCs)
service.call #RPCs real time api time
memcache.Get 33 27ms 0ms
memcache.Set 33 24ms 0ms
datastore_v3.Next 29 119ms 0ms
datastore_v3.Get 16 671ms 0ms
datastore_v3.RunQuery 4 3046ms 0ms
datastore_v3.Put 1 2ms 0ms

stateless: (577 RPCs)
ervice.call #RPCs real time api time
datastore_v3.Next 490 4669ms 0ms
memcache.Get 33 25ms 0ms
memcache.Set 33 34ms 0ms
datastore_v3.Get 16 13ms 0ms
datastore_v3.RunQuery 4 2703ms 0ms
datastore_v3.Put 1 3ms 0ms

I can now play with different page sizes and see if it makes any
difference.
Ultimately I'd like to do a get key only query followed by batch gets,
in order to use memcache.

Thanks,
Emanuele

On Apr 21, 8:23 am, Emanuele Ziglioli <theb...@emanueleziglioli.it>
wrote:

Pascal Voitot Dev

unread,

Apr 23, 2012, 5:52:26 AM4/23/12

to siena-...@googlegroups.com

quite impressive gain :D... you're the first to test it in real env, congrats ;)

for keys only, you can't use iter or iterPerPage but you can use fetchKeys in stateful mode also and activate pagination mode...

Query<YourClass> q = YourClass.all().paginate(500).stateful();

q.fetchKeys();
q.nextPage().fetchKeys();
q.nextPage().fetchKeys();
q.nextPage().fetchKeys();
q.nextPage().fetchKeys();
previous works also and this is special to Siena :)
q.previousPage().fetchKeys();

when there is nothing more, it returns an empty List...

regards
Pascal

Emanuele Ziglioli

unread,

Apr 23, 2012, 4:41:20 PM4/23/12

to Siena

On Apr 23, 9:52 pm, Pascal Voitot Dev <pascal.voitot....@gmail.com>
wrote:

> quite impressive gain :D... you're the first to test it in real env,
> congrats ;)

even better when you cycle over more than 30,000 entries. So
pagination has reduced memory consumption, while stateful/cursors have
reduced costs by 80%!
Not bad... when can I buy you a beer :-)

>
> for keys only, you can't use iter or iterPerPage but you can use fetchKeys
> in stateful mode also and activate pagination mode...
>
> Query<YourClass> q = YourClass.all().paginate(500).stateful();
>
> q.fetchKeys();
> q.nextPage().fetchKeys();
> q.nextPage().fetchKeys();
> q.nextPage().fetchKeys();
> q.nextPage().fetchKeys();
> previous works also and this is special to Siena :)
> q.previousPage().fetchKeys();
>
> when there is nothing more, it returns an empty List...

I'll try, my previous attempts to fetch key only and then fetch by
keys have actually made things worse.
If there's an effective improvement, I'll see if Siena can do that
behind the scenes, as an option.

Pascal Voitot Dev

unread,

Apr 23, 2012, 4:50:12 PM4/23/12

to siena-...@googlegroups.com

This is so great :D
80%!!!! huge!
If I come to NZ on day, you'll pay me a beer ;););)

Emanuele Ziglioli

unread,

Apr 23, 2012, 7:31:32 PM4/23/12

to Siena

On Apr 24, 8:50 am, Pascal Voitot Dev <pascal.voitot....@gmail.com>
wrote:

> This is so great :D
> 80%!!!! huge!
> If I come to NZ on day, you'll pay me a beer ;););)

Pas de problem! when we were in Wellington we used to go to a Belgian
pub, I love their Loeve. Plus you could eat 2kg of mussles for the
price of 1 :-)

More results, iterating over 9200 entities

stateful, page size = 5000: 115 RPCs
stateful, page size = 9200: 115 RPCs
stateful, page size = 1000: 121 RPCs
stateful, page size = 500: 121 RPCs

stateless, page size = 1000: 833 RPCs
stateless, page size = 9200: 560 RPCs

I wish appstats could track the memory usage, that's important in
order to be able to use cheaper instances.
When I load too many entities at once I run out of memory, the
instance gets killed and the task restarts (forever! until I clear the
task queue).

I've noticed with stateful() the memory usage has increased but I
can't see heap errors in the logs. With stateless the memory usage was
constant so I don't think there's a leak, at least for that case.

Pascal Voitot Dev

unread,

Apr 24, 2012, 2:42:16 AM4/24/12

to siena-...@googlegroups.com

On Tue, Apr 24, 2012 at 1:31 AM, Emanuele Ziglioli <the...@emanueleziglioli.it> wrote:

On Apr 24, 8:50 am, Pascal Voitot Dev <pascal.voitot....@gmail.com>
wrote:

> This is so great :D
> 80%!!!! huge!
> If I come to NZ on day, you'll pay me a beer ;););)

Pas de problem! when we were in Wellington we used to go to a Belgian
pub, I love their Loeve. Plus you could eat 2kg of mussles for the
price of 1 :-)

I love many belgium beers ;)
mussels with french fries ? This is the way you must eat them in North of France or Belgium ;)

More results, iterating over 9200 entities

stateful, page size = 5000: 115 RPCs
stateful, page size = 9200: 115 RPCs
stateful, page size = 1000: 121 RPCs
stateful, page size = 500: 121 RPCs

stateless, page size = 1000: 833 RPCs
stateless, page size = 9200: 560 RPCs

I wish appstats could track the memory usage, that's important in
order to be able to use cheaper instances.
When I load too many entities at once I run out of memory, the
instance gets killed and the task restarts (forever! until I clear the
task queue).

I've noticed with stateful() the memory usage has increased but I
can't see heap errors in the logs. With stateless the memory usage was
constant so I don't think there's a leak, at least for that case.

Offsets is not very good because I think they rebuild a new query each time on datastore side which certainly explains why the memory usage stays the same. But this requires lots of resources and is slow but I don't see why there are heap errors.
With cursors, I think GAE keep the resources alive as they do in SQL DB cursors and when you ask the nextPage, it simply begins from last cursor without creating any new resources.
My cursors implementation in Siena is quite brutal: I keep all cursors in memory and that's why you can do backward/forward whereas in GAE, you can only do forward.

Pascal

Emanuele Ziglioli

unread,

Apr 25, 2012, 6:55:40 PM4/25/12

to Siena

> Offsets is not very good because I think they rebuild a new query each time
> on datastore side which certainly explains why the memory usage stays the
> same. But this requires lots of resources and is slow but I don't see why
> there are heap errors.

No no, heap errors occurred when I was loading the whole table at once
(30,000 entries but also with fewer entries).
I can hit the heap limit doing a unit test too although the local db
is not a good reflection of the GAE datastore, in fact from what I've
read it uses even more resources, being an in memory db.

Stateless pagination shows a constant usage of memory usage and, also,
lower than what I see for stateful pagination.

On the other side, I haven't been able to find a good criteria for
choosing the page size. I believe that it should be a tradeoff between
memory usage and RPCs. Although as I've reported before, the number of
RCPs doesn't change much with the page size. Definitely the RCP number
is bigger when the page size is lower, 500 vs 1000 entried per page.
Above 1000 it doesn't change much. So I think GAE generates multiple
RPCs given a certain maximum amount of data that can be fetched at
once. I could play with the 'chunk size' option but it's not well
documented.
I think I'd be better off with memcache or, rather, by moving that
type of long running batch processing to MapReduce.
I've noticed MapReduce jobs are very quick at iterating over a whole
table, after the first run. And that's a sign it might use memcache
internally.

Pascal Voitot Dev

unread,

Apr 26, 2012, 2:44:57 AM4/26/12

to siena-...@googlegroups.com

At least I think map/reduce is mastered by Google, they invented it someway :D

> With cursors, I think GAE keep the resources alive as they do in SQL DB
> cursors and when you ask the nextPage, it simply begins from last cursor
> without creating any new resources.
> My cursors implementation in Siena is quite brutal: I keep all cursors in
> memory and that's why you can do backward/forward whereas in GAE, you can
> only do forward.

Emanuele Ziglioli

unread,

May 11, 2012, 1:27:03 AM5/11/12

to Siena

Hi everyone,

I've done some tests with stateful fetchKeys().
On Apr 24, 11:31 am, Emanuele Ziglioli <theb...@emanueleziglioli.it>
wrote:

> > > > Query<YourClass> q = YourClass.all().paginate(500).stateful();
>
> > > > q.fetchKeys();
> > > > q.nextPage().fetchKeys();
> > > > q.nextPage().fetchKeys();
> > > > q.nextPage().fetchKeys();
> > > > q.nextPage().fetchKeys();
> > > > previous works also and this is special to Siena :)
> > > > q.previousPage().fetchKeys();
>
> > > > when there is nothing more, it returns an empty List...

It works but it looks like the minum amount of RPC and the maximum use
of memcache occurs when I fetch all keys first and then I do batch
gets on subsets of those keys.

I've implemented a generic Iterator that does that. I haven't tested
it fully so take it as a draft

public class KeyBatchIterator<T extends Model> implements Iterator<T>
{public class KeyBatchIterator<T extends Model> implements Iterator<T>
{

int pagesz = 1000;
List<T> keys;
List<T> page;

public KeyBatchIterator(Query<T> q) {
super();
keys = q.fetchKeys();
page = nextPage();
}

private List<T> nextPage() {
if (pagesz > keys.size())
pagesz = keys.size();

List<T> page = keys.subList(0, pagesz);
Batch batch = Model.batch(page.get(0).getClass());
batch.get(page);

return page;
}

@Override
public boolean hasNext() {
if (page.size() == 0 && keys.size() > 0)
page = nextPage();

return page.size() > 0 || keys.size() > 0;
}

@Override
public T next() {
T t = page.get(0);
page.remove(t);

return t;
}

@Override
public void remove() {
// TODO Auto-generated method stub
}
}

Emanuele Ziglioli

unread,

May 11, 2012, 8:29:42 AM5/11/12

to Siena

> I've done some tests with stateful fetchKeys().

> It works but it looks like the minum amount of RPC and the maximum use
> of memcache occurs when I fetch all keys first and then I do batch
> gets on subsets of those keys.

I'll take it back. The amount of RCSs varies wildly on the production
server and I have no idea why.
(2) 2012-05-11 12:14:49.823 "POST /_ah/queue/projectupdate" 200
real=44302ms api=0ms overhead=0ms (7292 RPCs)
(3) 2012-05-11 12:14:00.281 "POST /_ah/queue/projectupdate" 200
real=49337ms api=0ms overhead=0ms (142 RPCs)
(5) 2012-05-11 12:13:07.146 "POST /_ah/queue/projectupdate" 200
real=52378ms api=0ms overhead=0ms (650 RPCs)

I would expect around 142 based on around 20,000 entries..
Those occurences with thousands of RPCs are killing my quota with
millions of RPCs counted over one day.
I've set up a cron job to try and limit access to datastore.

Pascal Voitot Dev

unread,

May 11, 2012, 8:38:51 AM5/11/12

to siena-...@googlegroups.com

This is crazy :)
why could it vary so much???? it takes the same amount of time with much more RPCs...
If it was a problem on Siena side, it would be constant but it's completely different at each call...

pascal

Emanuele Ziglioli

unread,

May 11, 2012, 8:48:38 AM5/11/12

to Siena

On May 12, 12:38 am, Pascal Voitot Dev <pascal.voitot....@gmail.com>
wrote:

> This is crazy :)
> why could it vary so much???? it takes the same amount of time with much
> more RPCs...
> If it was a problem on Siena side, it would be constant but it's completely
> different at each call...

it is, on the development server it's constant. I see that only on GAE
servers. I just couldn't understand that quota usage until I saw
thousands of RPCs being generated at once.
I'll do some research to see whether other people are experiencing
that. I've tried with different algorithms with similar results.

Reply all

Reply to author

Forward