Sitemaps Memory Usage

5 views
Skip to first unread message

David Cramer

unread,
Sep 29, 2008, 12:21:13 AM9/29/08
to Django developers
We launched the new iBegin.com yesterday, and congrats to Django we've
pushed over 600 req/s on a server shared with several other websites
and daemons (including sphinx and memcached).

BUT I noticed a huge issue right after launch with memory usage. It
was randomly spiking (and staying) high. The problem was not only an
issue of me not using select_related (although both would create
equally deathly problems), but that there's no kind of generator
expression usage in outputting the sitemap.xml.

So, Yahoo, and Google, both hit our new sitemaps, which consist of 12
million entries (paged automatically, yay Django), and began eating up
5%-10% memory per sitemap.

Now what do we do to solve this?

I would think that it could be approached by somehow creating a file
like object which could yield the urls (and then passing that to
HttpResponse), but I'm not very familiar with how generators and file
objects work.

Malcolm Tredinnick

unread,
Sep 29, 2008, 12:40:41 AM9/29/08
to django-d...@googlegroups.com

On Sun, 2008-09-28 at 21:21 -0700, David Cramer wrote:
[...]

> So, Yahoo, and Google, both hit our new sitemaps, which consist of 12
> million entries (paged automatically, yay Django), and began eating up
> 5%-10% memory per sitemap.
>
> Now what do we do to solve this?
>
> I would think that it could be approached by somehow creating a file
> like object which could yield the urls (and then passing that to
> HttpResponse), but I'm not very familiar with how generators and file
> objects work.

Using generators for template output went very badly the last time we
tried it. There were all sorts of unexpected side-effects because
template rendering involves database accesses a lot of the time (due to
lazy evaluation of querysets).

This type of memory usage hit for querysets that are only going to be
used exactly once is something I've been thinking about a bit over the
past few months. There's a case for some kind of "don't cache in memory"
attribute on querysets so that __iter__ won't populate the internal
cache. It's usually going to be far more dangerous than useful, but in
some situations it will be promising. When we branch off from 1.0.x and
trunk can take features again, I'll probably add something along those
lines.

More specifically for your case, though, in a way that won't require
changing anything in core (and it on becomes a django-users type of
problem): try subclassing the GenericSitemap class and replace the
items() method to return a subclass of QuerySet with an overridden
__iter__ that doesn't cache the results in memory. So, you're going to
write something like this:

SkinnyQueryset(QuerySet):
def __iter__(self):
.... # <-- your stuff here

LessGenericSitemap(GenericSitemap):
def items():
# <-- return QuerySet subclass here.

I've left the details up to you, since it's just a matter of looking at
the existing code.

Regards,
Malcolm

David Cramer

unread,
Sep 29, 2008, 12:43:10 AM9/29/08
to Django developers
Also in regards to this, using a template is overkill in my opinion.
Especially if it's going to cause extra headaches.

I'll try out your recommendation for now, thanks Malcolm.

On Sep 28, 11:40 pm, Malcolm Tredinnick <malc...@pointy-stick.com>
wrote:
Reply all
Reply to author
Forward
0 new messages