Message from discussion
Multiprocess Queryset memory consumption increase (DEBUG=False and using iterator)
Received: by 10.52.29.193 with SMTP id m1mr18047489vdh.4.1338234552513;
Mon, 28 May 2012 12:49:12 -0700 (PDT)
X-BeenThere: django-users@googlegroups.com
Received: by 10.220.119.137 with SMTP id z9ls2536290vcq.2.gmail; Mon, 28 May
2012 12:48:52 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.52.23.226 with SMTP id p2mr403117vdf.2.1338234532743; Mon, 28
May 2012 12:48:52 -0700 (PDT)
Authentication-Results: ls.google.com; spf=pass (google.com: domain of
akaar...@gmail.com designates internal as permitted sender)
smtp.mail=akaar...@gmail.com; dkim=pass
header...@gmail.com
Received: by x21g2000vbc.googlegroups.com with HTTP; Mon, 28 May 2012 12:48:52
-0700 (PDT)
Date: Mon, 28 May 2012 12:48:52 -0700 (PDT)
In-Reply-To: <8816a4e5-2766-4e2d-b14f-0bba7a95f4be@3g2000vbx.googlegroups.com>
References: <8816a4e5-2766-4e2d-b14f-0bba7a95f4be@3g2000vbx.googlegroups.com>
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:12.0)
Gecko/20100101 Firefox/12.0,gzip(gfe)
Message-ID: <697bd21b-337a-46bf-a390-e83a3d6aee80@x21g2000vbc.googlegroups.com>
Subject: Re: Multiprocess Queryset memory consumption increase (DEBUG=False
and using iterator)
From: akaariai <akaar...@gmail.com>
To: Django users <django-users@googlegroups.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On May 28, 5:42=A0pm, pc <pie...@musmato.com> wrote:
> I am stumped. I am trying process a lot of data (2 million records and
> up) and once I have a QuerySet, I immediately feed it to a
> queryset_iterator that fetched results in chunks of 1000 rows each. I
> use MySQL and the DB server is on another machine (so I don't think it
> is MySQL caching).
>
> I kick off 4 sub-processes using multiprocessing.Process, but even if
> I keep it to 1, eventually, I will run out of memory.
>
> RAM usage just steadily seems to increase. When I finish processing a
> resultset, I would allocate a new resultset to the variable and I
> would expect gc to get my memory back.
>
> Any ideas?
>
> def run(self):
> =A0 =A0 =A0.....
>
> messages=3Dqueryset_iterator(MessageDAO.get_all_messages_for_date(now))
> =A0 =A0 =A0 =A0 =A0 =A0 self.process_messages(job,messages)
> =A0 =A0 =A0 =A0 =A0 =A0 messages=3DNone
> =A0 =A0 =A0 =A0 =A0 =A0 gc.collect()
> ...
>
> def queryset_iterator(queryset, chunksize=3D1000):
> =A0 =A0 pk =3D 0
> =A0 =A0 last_pk =3D queryset.order_by('-pk')[0].pk
> =A0 =A0 queryset =3D queryset.order_by('pk')
> =A0 =A0 while pk < last_pk:
> =A0 =A0 =A0 =A0 for row in queryset.filter(pk__gt=3Dpk)[:chunksize]:
> =A0 =A0 =A0 =A0 =A0 =A0 pk =3D row.pk
> =A0 =A0 =A0 =A0 =A0 =A0 yield row
> =A0 =A0 =A0 =A0 gc.collect()
Are you sure the leak is not in process_messages?
I tested something similar on PostgreSQL, and the queryset_iterator
doesn't seem to leak memory:
def queryset_iterator(queryset, chunksize=3D100):
pk =3D 0
last_pk =3D queryset.order_by('-pk')[0].pk
queryset =3D queryset.order_by('pk')
while pk < last_pk:
print len(connection.queries)
for row in queryset.filter(pk__gt=3Dpk)[:chunksize]:
pk =3D row.pk
yield row
gc.collect()
for i in queryset_iterator(TestModel.objects.all()):
print memory()
where memory() is from http://stackoverflow.com/questions/938733/python-tot=
al-memory-used
The result seems stable, and don't indicate any memory leak.
TestModel contains 100000 objects.