Message from discussion Multiprocess Queryset memory consumption increase (DEBUG=False and using iterator)
Received: by 10.52.29.193 with SMTP id m1mr18047489vdh.4.1338234552513;
Mon, 28 May 2012 12:49:12 -0700 (PDT)
Received: by 10.220.119.137 with SMTP id z9ls2536290vcq.2.gmail; Mon, 28 May
2012 12:48:52 -0700 (PDT)
Received: by 10.52.23.226 with SMTP id p2mr403117vdf.2.1338234532743; Mon, 28
May 2012 12:48:52 -0700 (PDT)
Authentication-Results: ls.google.com; spf=pass (google.com: domain of
akaar...@gmail.com designates internal as permitted sender)
Received: by x21g2000vbc.googlegroups.com with HTTP; Mon, 28 May 2012 12:48:52
Date: Mon, 28 May 2012 12:48:52 -0700 (PDT)
X-HTTP-UserAgent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0)
Subject: Re: Multiprocess Queryset memory consumption increase (DEBUG=False
and using iterator)
From: akaariai <akaar...@gmail.com>
To: Django users <email@example.com>
Content-Type: text/plain; charset=ISO-8859-1
On May 28, 5:42=A0pm, pc <pie...@musmato.com> wrote:
> I am stumped. I am trying process a lot of data (2 million records and
> up) and once I have a QuerySet, I immediately feed it to a
> queryset_iterator that fetched results in chunks of 1000 rows each. I
> use MySQL and the DB server is on another machine (so I don't think it
> is MySQL caching).
> I kick off 4 sub-processes using multiprocessing.Process, but even if
> I keep it to 1, eventually, I will run out of memory.
> RAM usage just steadily seems to increase. When I finish processing a
> resultset, I would allocate a new resultset to the variable and I
> would expect gc to get my memory back.
> Any ideas?
> def run(self):
> =A0 =A0 =A0.....
> =A0 =A0 =A0 =A0 =A0 =A0 self.process_messages(job,messages)
> =A0 =A0 =A0 =A0 =A0 =A0 messages=3DNone
> =A0 =A0 =A0 =A0 =A0 =A0 gc.collect()
> def queryset_iterator(queryset, chunksize=3D1000):
> =A0 =A0 pk =3D 0
> =A0 =A0 last_pk =3D queryset.order_by('-pk').pk
> =A0 =A0 queryset =3D queryset.order_by('pk')
> =A0 =A0 while pk < last_pk:
> =A0 =A0 =A0 =A0 for row in queryset.filter(pk__gt=3Dpk)[:chunksize]:
> =A0 =A0 =A0 =A0 =A0 =A0 pk =3D row.pk
> =A0 =A0 =A0 =A0 =A0 =A0 yield row
> =A0 =A0 =A0 =A0 gc.collect()
Are you sure the leak is not in process_messages?
I tested something similar on PostgreSQL, and the queryset_iterator
doesn't seem to leak memory:
def queryset_iterator(queryset, chunksize=3D100):
pk =3D 0
last_pk =3D queryset.order_by('-pk').pk
queryset =3D queryset.order_by('pk')
while pk < last_pk:
for row in queryset.filter(pk__gt=3Dpk)[:chunksize]:
pk =3D row.pk
for i in queryset_iterator(TestModel.objects.all()):
where memory() is from http://stackoverflow.com/questions/938733/python-tot=
The result seems stable, and don't indicate any memory leak.
TestModel contains 100000 objects.