Possible interest in a webcast/presentation about Django site with 40mil+ rows of data??

134 views
Skip to first unread message

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 22, 2011, 9:15:48 AM6/22/11
to django...@googlegroups.com
Hi all,

Some of you may have noticed, in the last few months I've done quite a few posts/snippets about handling large data sets in Django. At the end of this month (after what seems like a lifetime of trial and error), we're finally going to be releasing a new site which holds around 40mil+ rows of data, grows by about 300-500k rows each day, handles 5GB of uploads per day, and can handle around 1024 requests per second on stress test on a moderately spec'd server.

As the entire thing is written in Django (and a bunch of other open source products), I'd really like to give something back to the community. (stack incls Celery/RabbitMQ/Sphinx SE/PYQuery/Percona MySQL/NGINX/supervisord/debian etc)

Therefore, I'd like to see if there would be any interest in webcast in which I would explain how we handle such large amounts of data, the trial and error processes we went through, some really neat tricks we've done to avoid bottlenecks, our own approach to smart content filtering, and some of the valuable lessons we have learned. The webcast would be completely free of charge, last a couple of hours (with a short break) and anyone can attend. I'd also offer up a Q&A session at the end.

If you're interested, please reply on-list so others can see.

Thanks

Cal

Michał Sawicz

unread,
Jun 22, 2011, 9:20:39 AM6/22/11
to django...@googlegroups.com
Dnia 2011-06-22, śro o godzinie 14:15 +0100, Cal Leeming [Simplicity
Media Ltd] pisze:

> If you're interested, please reply on-list so others can see.

Sure, I'd attend.
--
Michał (Saviq) Sawicz <mic...@sawicz.net>

signature.asc

Thomas Weholt

unread,
Jun 22, 2011, 9:31:44 AM6/22/11
to django...@googlegroups.com
Yes! I'm in.

Out of curiosity: When inserting lots of data, how do you do it? Using
the orm? Have you looked at http://pypi.python.org/pypi/dse/2.1.0 ? I
wrote DSE to solve inserting/updating huge sets of data, but if
there's a better way to do it that would be especially interesting to
hear more about ( and sorry for the self promotion ).

Regards,
Thomas

> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To post to this group, send email to django...@googlegroups.com.
> To unsubscribe from this group, send email to
> django-users...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/django-users?hl=en.
>

--
Mvh/Best regards,
Thomas Weholt
http://www.weholt.org

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 22, 2011, 9:36:16 AM6/22/11
to django...@googlegroups.com
Hey Thomas,

Yeah we actually spoke a little while ago about DSE. In the end, we actually used a custom approach which analyses data in blocks of 50k rows, builds a list of rows which need changing to the same value, then applied them in bulk using update() + F().

Here's our benchmark:

(42.11s) Found 49426 objs (match: 16107) (db writes: 50847) (range: 72300921 ~ 72350921), (avg 13.8 mins/million) - [('is_checked', 49426), ('is_image_blocked', 0), ('has_link', 1420), ('is_spam', 1)]
(44.50s) Found 49481 objs (match: 16448) (db writes: 50764) (range: 72350921 ~ 72400921), (avg 14.6 mins/million) - [('is_checked', 49481), ('is_image_blocked', 0), ('has_link', 1283), ('is_spam', 0)]
(55.78s) Found 49627 objs (match: 18516) (db writes: 50832) (range: 72400921 ~ 72450921), (avg 18.3 mins/million) - [('is_checked', 49627), ('is_image_blocked', 0), ('has_link', 1205), ('is_spam', 0)]
(42.03s) Found 49674 objs (match: 17244) (db writes: 51655) (range: 72450921 ~ 72500921), (avg 13.6 mins/million) - [('is_checked', 49674), ('is_image_blocked', 0), ('has_link', 1971), ('is_spam', 10)]
(51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range: 72500921 ~ 72550921), (avg 16.9 mins/million) - [('is_checked', 49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]

Could you let me know if those benchmarks are better/worse than using DSE? I'd be interested to see the comparison!

Cal

Shawn Milochik

unread,
Jun 22, 2011, 9:37:08 AM6/22/11
to django...@googlegroups.com
Cal,

That sounds awesome. I wish you could present it at DjangoCon US too. :o/

Shawn

Thomas Weholt

unread,
Jun 22, 2011, 9:45:10 AM6/22/11
to django...@googlegroups.com
On Wed, Jun 22, 2011 at 3:36 PM, Cal Leeming [Simplicity Media Ltd]
<cal.l...@simplicitymedialtd.co.uk> wrote:
> Hey Thomas,
> Yeah we actually spoke a little while ago about DSE. In the end, we actually
> used a custom approach which analyses data in blocks of 50k rows, builds a
> list of rows which need changing to the same value, then applied them in
> bulk using update() + F().

Hmmm, what do you mean by "bulk using update() + F()? Something like
"update sometable set somefield1 = somevalue1, somefield2 = somevalue2
where id in (1,2,3 .....)" ? Does "avg 13.8 mins/million" mean you
processed 13.8 million rows pr minute? What kind of hardware did you
use?

Thomas

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 22, 2011, 9:52:55 AM6/22/11
to django...@googlegroups.com
Sorry, let me explain a little better.

(51.98s) Found 49659 objs (match: 16563) (db writes: 51180) (range: 72500921 ~ 72550921), (avg 16.9 mins/million) - [('is_checked', 49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]

map(lambda x: (x[0], len(x[1])), _obj_incs.iteritems()) = [('is_checked', 49659), ('is_image_blocked', 0), ('has_link', 1517), ('is_spam', 4)]

In the above example, it has found 49659 rows which need 'is_checked' changing to the value '1' (same principle applied to the other 3), giving a total of 51,130 database writes, split into 4 queries.

Those 4 fields have the IDs assigned to them:

                                    if _f == 'block_images':
                                        _obj_incs.get('is_image_blocked').append(_hit_id)
                                        if _parent_id:
                                            _obj_incs.get('is_image_blocked').append(_parent_id)

Then I loop through those fields, and do an update() using the necessary IDs:

                    # now apply the obj changes in bulk (massive speed improvements)
                    for _key, _value in _obj_incs.iteritems():
                        # update the child object
                        Post.objects.filter(
                            id__in = _value
                        ).update(
                            **{
                                _key : 1
                            }
                        )

So in simple terms, we're not doing 51 thousand update queries, instead we're grouping them into bulk queries based on the row to be updated. It doesn't yet to grouping based on key AND value, simply because we didn't need it at the time, but if we release the code for public use, we'd definitely add this in.

Hope this makes sense, let me know if I didn't explain it very well lol.

Cal

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 22, 2011, 9:56:40 AM6/22/11
to django...@googlegroups.com
Also, the 13.8 minutes per million, is basically a benchmark based on the amount of db writes, and the total amount of time it took to execute (which was 51s).

Please also note, this code is doing a *heavy* amount of content analysis, but if you were to strip that out, the only overheads would be the map/filter/lambda, the time it takes to transmit to MySQL, and the time it takes for MySQL to perform the writes.

The database hardware spec is:

1x X3440 quad core (2 cores assigned to MySQL).
12GB memory (4 GB assigned to MySQL).
/var/lib/mysql mapped to 2x Intel M3 SSD drives in RAID 1.

Cal

Thomas Weholt

unread,
Jun 22, 2011, 10:17:55 AM6/22/11
to django...@googlegroups.com

Actually, I started working on something similar, but tried to find
sets of fields, instead of just updating one field pr update, but
didn't finish it because the actual grouping of the fields seem to
take alot of time/cpu/memory. Perhaps if I focused on updating one
field at the time it would be simpler. Might look at it again for DSE
3.0 ;-)

Thomas

Andre Terra

unread,
Jun 22, 2011, 10:25:49 AM6/22/11
to django...@googlegroups.com
Hello, Cal

First of all, congrats on the newborn! The Django community will surely benefit from having yet another success story, especially considering how big this project sounds. Is there any chance you could open-source some of your custom made improvements so that they could eventually be merged to trunk?

I definitely noticed how you mentioned large dbs in the past few months. I, along with many others I assume, would surely like to attend the webcast, with the only impediment being my schedule/timezone.

I recently asked about working with temporary tables for filtering/grouping data from uploads and inserting queries from that temporary table onto a permanent database. To make matters worse, I wanted to make this as flexible as possible (i.e. dynamic models) so that everything could be managed from a web app. Do you have any experience you could share about any of these use cases? As far as I know, there's nothing in the ORM that replicates PostgreSQL's CREATE TEMPORARY TABLE. My experience with SQL is rather limited, but from asking around, it seems like my project could indeed benefit from such a feature. If I had to guess, I would assume other DBMSs would offer something similar, but being limited to Postgres is okay for me, for now, anyway.



Cheers,
André

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 22, 2011, 10:32:53 AM6/22/11
to django...@googlegroups.com
Hmm, that's odd, the grouping (map/reduce/filter/lambda) is extremely quick for me (even on a heavy data set).

My guess is that grouping would need to be done on a combination of field name+value, and would need to allow the user to specify what bulk to use (to prevent MemoryError exception - or find some way to reduce the bulk when MemoryError is encountered).

If you end up introducing it into 3.0, I'll definitely be interested in taking a look at the code :)

Cal


--

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 22, 2011, 10:47:15 AM6/22/11
to django...@googlegroups.com
On Wed, Jun 22, 2011 at 3:25 PM, Andre Terra <andre...@gmail.com> wrote:
Hello, Cal

First of all, congrats on the newborn! The Django community will surely benefit from having yet another success story, especially considering how big this project sounds. Is there any chance you could open-source some of your custom made improvements so that they could eventually be merged to trunk?

Thank you! Yeah, the plan is to release as much of the improvements as open source as possible. Although I'd rely heavily on the community to make them 'patch worthy' for the core, as the amount of spare time I have is somewhat limited. 

The improvements list is growing by the day, and I usually try and post as many snippets as I can, and/or tickets etc. 

It sounds like Thomas's DSE might be the perfect place for the bulk update code too.
 

I definitely noticed how you mentioned large dbs in the past few months. I, along with many others I assume, would surely like to attend the webcast, with the only impediment being my schedule/timezone.

Once we've got a list of all the people who want to attend, I'll send out a mail asking for everyones timezone and availability, so we can figure out what is best for everyone.
 

I recently asked about working with temporary tables for filtering/grouping data from uploads and inserting queries from that temporary table onto a permanent database. To make matters worse, I wanted to make this as flexible as possible (i.e. dynamic models) so that everything could be managed from a web app. Do you have any experience you could share about any of these use cases? As far as I know, there's nothing in the ORM that replicates PostgreSQL's CREATE TEMPORARY TABLE. My experience with SQL is rather limited, but from asking around, it seems like my project could indeed benefit from such a feature. If I had to guess, I would assume other DBMSs would offer something similar, but being limited to Postgres is okay for me, for now, anyway.

I haven't had any exposure to Postgres, but my experience with temporary tables hasn't been a nice one (in regards to MySQL at least). MySQL has many gotchas when it comes to temporary tables and indexing, and on more than one occasion, I found it was actually quicker to analyse/mangle/re-insert the data via Python code, than it was to attempt the modifications within MySQL using a temporary table.

It really does depend on what your data is, and what you want to do with it, which can make planning ahead somewhat tedious lol.

For our stuff, when we need to do bulk modifications, we have a filtering rules list which is ran every hour against new rows (with is_checked=1 set on rows which have been checked). We then use bulk queries of 50k (id >= 0 AND id < 50000), rather than using LIMIT/OFFSET (because LIMIT/OFFSET gets slower and slower the larger the result set). Those queries are analysed/mangled within a transaction, and bulk updated using the method mentioned in the reply to Thomas.

Sadly though, I can't say if the methods we use would be suitable for you, as we haven't tried it against Postgres, and we've only tested it against our own data set + requirements. This is what I mean by trial and error, it's a pain in the ass :)

Andre Terra

unread,
Jun 22, 2011, 11:00:47 AM6/22/11
to django...@googlegroups.com
On Wed, Jun 22, 2011 at 11:47 AM, Cal Leeming [Simplicity Media Ltd] <cal.l...@simplicitymedialtd.co.uk> wrote:


On Wed, Jun 22, 2011 at 3:25 PM, Andre Terra <andre...@gmail.com> wrote:
Hello, Cal

First of all, congrats on the newborn! The Django community will surely benefit from having yet another success story, especially considering how big this project sounds. Is there any chance you could open-source some of your custom made improvements so that they could eventually be merged to trunk?

Thank you! Yeah, the plan is to release as much of the improvements as open source as possible. Although I'd rely heavily on the community to make them 'patch worthy' for the core, as the amount of spare time I have is somewhat limited. 

The improvements list is growing by the day, and I usually try and post as many snippets as I can, and/or tickets etc. 

It sounds like Thomas's DSE might be the perfect place for the bulk update code too.

Thanks a lot for the quick reply. I'll keep my eyes open for the code, and if unable to contribute with relevant modifications to the patches, I'll at least try to doc and test them!
 
 
I definitely noticed how you mentioned large dbs in the past few months. I, along with many others I assume, would surely like to attend the webcast, with the only impediment being my schedule/timezone.

Once we've got a list of all the people who want to attend, I'll send out a mail asking for everyones timezone and availability, so we can figure out what is best for everyone.

Definitely write me up for the list of attendees, then!
 
 
I recently asked about working with temporary tables for filtering/grouping data from uploads and inserting queries from that temporary table onto a permanent database. To make matters worse, I wanted to make this as flexible as possible (i.e. dynamic models) so that everything could be managed from a web app. Do you have any experience you could share about any of these use cases? As far as I know, there's nothing in the ORM that replicates PostgreSQL's CREATE TEMPORARY TABLE. My experience with SQL is rather limited, but from asking around, it seems like my project could indeed benefit from such a feature. If I had to guess, I would assume other DBMSs would offer something similar, but being limited to Postgres is okay for me, for now, anyway.

I haven't had any exposure to Postgres, but my experience with temporary tables hasn't been a nice one (in regards to MySQL at least). MySQL has many gotchas when it comes to temporary tables and indexing, and on more than one occasion, I found it was actually quicker to analyse/mangle/re-insert the data via Python code, than it was to attempt the modifications within MySQL using a temporary table.

It really does depend on what your data is, and what you want to do with it, which can make planning ahead somewhat tedious lol.

For our stuff, when we need to do bulk modifications, we have a filtering rules list which is ran every hour against new rows (with is_checked=1 set on rows which have been checked). We then use bulk queries of 50k (id >= 0 AND id < 50000), rather than using LIMIT/OFFSET (because LIMIT/OFFSET gets slower and slower the larger the result set). Those queries are analysed/mangled within a transaction, and bulk updated using the method mentioned in the reply to Thomas.

Sadly though, I can't say if the methods we use would be suitable for you, as we haven't tried it against Postgres, and we've only tested it against our own data set + requirements. This is what I mean by trial and error, it's a pain in the ass :)


Thanks again for your enlightening input. Even with our different requirements, this was actually quite relevant as far as solving several doubts I had on how to go about this project.


Cheers,

André


On Wed, Jun 22, 2011 at 10:56 AM, Cal Leeming [Simplicity Media Ltd] <cal.l...@simplicitymedialtd.co.uk> wrote:
Also, the 13.8 minutes per million, is basically a benchmark based on the amount of db writes, and the total amount of time it took to execute (which was 51s).

Please also note, this code is doing a *heavy* amount of content analysis, but if you were to strip that out, the only overheads would be the map/filter/lambda, the time it takes to transmit to MySQL, and the time it takes for MySQL to perform the writes.

The database hardware spec is:

1x X3440 quad core (2 cores assigned to MySQL).
12GB memory (4 GB assigned to MySQL).
/var/lib/mysql mapped to 2x Intel M3 SSD drives in RAID 1.

Cal


On Wed, Jun 22, 2011 at 2:52 PM, Cal Leeming [Simplicity Media Ltd] <cal.l...@simplicitymedialtd.co.uk> wrote:
Sorry, let me explain a little better.

(...)

Ivan Aleman

unread,
Jun 22, 2011, 11:39:48 AM6/22/11
to django...@googlegroups.com
On 22 June 2011 08:15, Cal Leeming [Simplicity Media Ltd] <cal.l...@simplicitymedialtd.co.uk> wrote:

If you're interested, please reply on-list so others can see.



Sweet! Count me in :)
 


--
Iván

Anurag Chourasia

unread,
Jun 22, 2011, 11:49:53 AM6/22/11
to django...@googlegroups.com
I am in :-)

--

creecode

unread,
Jun 22, 2011, 11:55:50 AM6/22/11
to django...@googlegroups.com
Hello SleepyCal,


On Wednesday, June 22, 2011 6:15:48 AM UTC-7, SleepyCal wrote:

If you're interested, please reply on-list so others can see.

+1

Also if the webcast could be stored for later viewing that would be grand.

Toodle-loooooooooooooo..........
creecode

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 22, 2011, 12:00:15 PM6/22/11
to django...@googlegroups.com
On Wed, Jun 22, 2011 at 4:55 PM, creecode <cree...@gmail.com> wrote:
Hello SleepyCal,


On Wednesday, June 22, 2011 6:15:48 AM UTC-7, SleepyCal wrote:

If you're interested, please reply on-list so others can see.

+1

Also if the webcast could be stored for later viewing that would be grand.

Yup, I'm planning on recording in 1080p and posting on Youtube shortly afterwards.
 

Toodle-loooooooooooooo..........
creecode

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/django-users/-/SZabiWnq_S0J.

Brian Bouterse

unread,
Jun 22, 2011, 12:00:56 PM6/22/11
to django...@googlegroups.com
+1 for making it viewable after the fact

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/django-users/-/SZabiWnq_S0J.

To post to this group, send email to django...@googlegroups.com.
To unsubscribe from this group, send email to django-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.



--
Brian Bouterse
ITng Services

creecode

unread,
Jun 22, 2011, 12:46:15 PM6/22/11
to django...@googlegroups.com
Hello SleepyCal,


On Wednesday, June 22, 2011 9:00:15 AM UTC-7, SleepyCal wrote:

Yup, I'm planning on recording in 1080p and posting on Youtube shortly afterwards.

Fantastic!

Thank you,

Toodle-loooo...........
creecode

Malcolm Box

unread,
Jun 22, 2011, 4:55:29 PM6/22/11
to django...@googlegroups.com
On 22 June 2011 14:15, Cal Leeming [Simplicity Media Ltd] <cal.l...@simplicitymedialtd.co.uk> wrote:
Hi all,

Therefore, I'd like to see if there would be any interest in webcast in which I would explain how we handle such large amounts of data, the trial and error processes we went through, some really neat tricks we've done to avoid bottlenecks, our own approach to smart content filtering, and some of the valuable lessons we have learned. The webcast would be completely free of charge, last a couple of hours (with a short break) and anyone can attend. I'd also offer up a Q&A session at the end.

If you're interested, please reply on-list so others can see.

Count me in.

Malcolm

serek

unread,
Jun 22, 2011, 5:24:51 PM6/22/11
to Django users
Great idea.
I also would like to see this webcast

On Jun 22, 6:00 pm, "Cal Leeming [Simplicity Media Ltd]"
<cal.leem...@simplicitymedialtd.co.uk> wrote:

Mishen'ka

unread,
Jun 22, 2011, 5:49:00 PM6/22/11
to Django users
I really like the idea, hope i can see it soon.

Interessting thanks

On 22 jun, 15:15, "Cal Leeming [Simplicity Media Ltd]"

Thomas Weholt

unread,
Jun 22, 2011, 6:39:35 PM6/22/11
to django...@googlegroups.com
On Wed, Jun 22, 2011 at 4:47 PM, Cal Leeming [Simplicity Media Ltd]
<cal.l...@simplicitymedialtd.co.uk> wrote:
>
>
> On Wed, Jun 22, 2011 at 3:25 PM, Andre Terra <andre...@gmail.com> wrote:
>>
>> Hello, Cal
>>
>> First of all, congrats on the newborn! The Django community will surely
>> benefit from having yet another success story, especially considering how
>> big this project sounds. Is there any chance you could open-source some of
>> your custom made improvements so that they could eventually be merged to
>> trunk?
>
> Thank you! Yeah, the plan is to release as much of the improvements as open
> source as possible. Although I'd rely heavily on the community to make them
> 'patch worthy' for the core, as the amount of spare time I have is somewhat
> limited.
> The improvements list is growing by the day, and I usually try and post as
> many snippets as I can, and/or tickets etc.
> It sounds like Thomas's DSE might be the perfect place for the bulk update
> code too.

FYI: Inspired by this discussion I've allready started on a similar
feature ( allthough somewhat simplified ) for DSE v2.2.0 and you're
right; the speed increase is huge using the method described here,
even compared to my current solution ( using cursor.executemany ),
which is considerably faster than the django orm allready. My testing
so far have been using postgresql, not sure how mysql will perform. I
expect to release DSE v.2.2.0 with this feature in the next few days.

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 22, 2011, 6:47:24 PM6/22/11
to django...@googlegroups.com
Nice, seems like there is a lot of positive feedback. 

Here's some further info about the webcast/webinar:

Expected Date: July/August
Expected Time: Somewhere between 9AM - 9PM GMT+0 (UK Time), I'll probably run a webpage vote so you guys can decide.
Via: GoToWebinar - screen share with voip.
Length: 1 hour max for actual presentation, 5 minute break, then another 1 hour max for Q&As.
Presentation Size: 1920x1080 (1080p)

There will be some basic slides (to keep things structured), but the majority of the time will be spent inside the browser + IDE + shell etc.

Cal


--
You received this message because you are subscribed to the Google Groups "Django users" group.

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 22, 2011, 6:50:04 PM6/22/11
to django...@googlegroups.com
Nice!

Nice! Looking forward to seeing this :)
 

--
Mvh/Best regards,
Thomas Weholt
http://www.weholt.org

william ratcliff

unread,
Jun 22, 2011, 7:21:22 PM6/22/11
to django...@googlegroups.com
Definitely looking forward to it!

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 22, 2011, 7:26:19 PM6/22/11
to django...@googlegroups.com
I was thinking.. what would be really nice would be a monkey patch that could catch any inserts/updates/saves from the ORM, and defer them until the user triggered the bulk update some how.. That way, DSE would be completely interchangeable with existing code, making it much more likely for people to adopt it. 


Władysław Mettler

unread,
Jun 23, 2011, 8:59:18 AM6/23/11
to django...@googlegroups.com
Definitely interested. Struggling with scaling a Django/Celery/Postgres/CouchDB/Memcach project now. Would love to see your approach.

Cheers
Vlad

higs

unread,
Jun 23, 2011, 9:23:46 AM6/23/11
to Django users
> If you're interested, please reply on-list so others can see.
>
would love to see this. thanks for posting.

Derek

unread,
Jun 23, 2011, 9:55:10 AM6/23/11
to Django users
Carl

Please post here when the recording becomes available... some of us do
not have access to HD-capable bandwidth yet!

Thanks
Derek

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 23, 2011, 10:45:51 AM6/23/11
to django...@googlegroups.com
Hi Derek,

I've had a look and apparently 4-5mbit should be sufficient enough for a HD webinar. If quite a few of you don't have this capability (or screen size), I could look into reducing it down to 1440x900 perhaps? 

Let me know your thoughts guys. (maybe it's something I need to do a web vote for, along with the timezone etc)

Cal

Chris Calitz

unread,
Jun 23, 2011, 10:49:51 AM6/23/11
to django...@googlegroups.com
Sounds really cool. I'm definitely in. 


On 22 Jun 2011, at 14:16, "Cal Leeming [Simplicity Media Ltd]" <cal.l...@simplicitymedialtd.co.uk> wrote:

Hi all,

Some of you may have noticed, in the last few months I've done quite a few posts/snippets about handling large data sets in Django. At the end of this month (after what seems like a lifetime of trial and error), we're finally going to be releasing a new site which holds around 40mil+ rows of data, grows by about 300-500k rows each day, handles 5GB of uploads per day, and can handle around 1024 requests per second on stress test on a moderately spec'd server.

As the entire thing is written in Django (and a bunch of other open source products), I'd really like to give something back to the community. (stack incls Celery/RabbitMQ/Sphinx SE/PYQuery/Percona MySQL/NGINX/supervisord/debian etc)

Therefore, I'd like to see if there would be any interest in webcast in which I would explain how we handle such large amounts of data, the trial and error processes we went through, some really neat tricks we've done to avoid bottlenecks, our own approach to smart content filtering, and some of the valuable lessons we have learned. The webcast would be completely free of charge, last a couple of hours (with a short break) and anyone can attend. I'd also offer up a Q&A session at the end.

If you're interested, please reply on-list so others can see.

Thanks

Cal

John DeRosa

unread,
Jun 23, 2011, 10:58:19 AM6/23/11
to django...@googlegroups.com, django...@googlegroups.com
Me

Steve Holden

unread,
Jun 23, 2011, 1:13:08 PM6/23/11
to django...@googlegroups.com
And, as luck would have it, the US Call For Papers was just published:


regards
 Steve

On Wed, Jun 22, 2011 at 9:37 AM, Shawn Milochik <sh...@milochik.com> wrote:
Cal,

That sounds awesome. I wish you could present it at DjangoCon US too. :o/

Shawn


--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django...@googlegroups.com.
To unsubscribe from this group, send email to django-users+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/django-users?hl=en.




--
Steve Holden        +1 571 484 6266  +1 800 494 3119
Holden Web LLC             http://www.holdenweb.com/

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 23, 2011, 1:15:16 PM6/23/11
to django...@googlegroups.com
(sorry, should have sent this reply on-list)

You know, I was actually thinking about that, but the flights would cost way too much (like $1400+) ;/ If the project has interest next year, I'll probably ask about presenting at DjangoCon EU though.

To unsubscribe from this group, send email to django-users...@googlegroups.com.

Jason

unread,
Jun 23, 2011, 1:38:34 PM6/23/11
to Django users
+1 Very interested.

On Jun 22, 3:47 pm, "Cal Leeming [Simplicity Media Ltd]"
<cal.leem...@simplicitymedialtd.co.uk> wrote:
> Nice, seems like there is a lot of positive feedback.
>
> Here's some further info about the webcast/webinar:
>
> *Expected Date:* July/August
> *Expected Time:* Somewhere between 9AM - 9PM GMT+0 (UK Time), I'll probably
> run a webpage vote so you guys can decide.
> *Via:* GoToWebinar - screen share with voip.
> *Length:* 1 hour max for actual presentation, 5 minute break, then another 1
> hour max for Q&As.
> *Presentation Size: *1920x1080 (1080p)
>
> There will be some basic slides (to keep things structured), but the
> majority of the time will be spent inside the browser + IDE + shell etc.
>
> Cal
>

Fatrix

unread,
Jun 23, 2011, 3:14:46 PM6/23/11
to Django users
I would like to attend, too.

Thank you in advance!

CU
Fatrix

Alex Grigorievskiy

unread,
Jun 24, 2011, 8:55:09 AM6/24/11
to Django users
Hi, Cal and everybody,
I'd like to see the webcast too,
thanks very much for that :)

On Jun 23, 2:50 am, "Cal Leeming [Simplicity Media Ltd]"
<cal.leem...@simplicitymedialtd.co.uk> wrote:
> Nice!
>
> On Wed, Jun 22, 2011 at 11:39 PM, Thomas Weholt <thomas.weh...@gmail.com>wrote:
>
>
>
>
>
>
>
>
>
> > On Wed, Jun 22, 2011 at 4:47 PM, Cal Leeming [Simplicity Media Ltd]
> > <cal.leem...@simplicitymedialtd.co.uk> wrote:
>
> > > On Wed, Jun 22, 2011 at 3:25 PM, Andre Terra <andrete...@gmail.com>

bruno desthuilliers

unread,
Jun 24, 2011, 9:04:01 AM6/24/11
to Django users
+1

Nolan Brubaker

unread,
Jun 24, 2011, 11:14:37 AM6/24/11
to django...@googlegroups.com
Yes, very interested.  +1

On Fri, Jun 24, 2011 at 9:04 AM, bruno desthuilliers <bruno.des...@gmail.com> wrote:
+1

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 24, 2011, 2:34:53 PM6/24/11
to django...@googlegroups.com
Really glad to see there has been so much interest in this!

If possible, can everyone please use the following form to indicate what week day / timeslot they are available, along with their screen resolution. This will be open until 1st July.


Whichever week day / timeslot / screen resolution has 100% of the votes, will be the chosen one :) If it's split, we'll look at doing the webcast on two separate occasions. Hopefully this should give everyone a fair chance.

Thanks

Cal

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 28, 2011, 11:30:54 AM6/28/11
to django...@googlegroups.com
Second call for anyone who wants to attend this webcast, 3 days left to place your vote.

Cal

Andre Terra

unread,
Jun 29, 2011, 9:16:38 AM6/29/11
to django...@googlegroups.com
Can't access google spreadsheets through my corporate proxy (go figure), but I will vote later today.

Really interested in watching this!


Andre

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 29, 2011, 11:30:43 AM6/29/11
to django...@googlegroups.com
Wow, that sucks! :X Never known a corp to block google docs

Matteius

unread,
Jun 29, 2011, 9:13:28 PM6/29/11
to Django users
If I am unable to attend at the scheduled time I absolutely must watch
the resulting recording. Thanks for your efforts.

On Jun 22, 5:50 pm, "Cal Leeming [Simplicity Media Ltd]"
<cal.leem...@simplicitymedialtd.co.uk> wrote:
> Nice!
>
> On Wed, Jun 22, 2011 at 11:39 PM, Thomas Weholt <thomas.weh...@gmail.com>wrote:
>
>
>
>
>
> > On Wed, Jun 22, 2011 at 4:47 PM, Cal Leeming [Simplicity Media Ltd]
> > <cal.leem...@simplicitymedialtd.co.uk> wrote:
>
> > > On Wed, Jun 22, 2011 at 3:25 PM, Andre Terra <andrete...@gmail.com>

Wagner Vaz

unread,
Jun 30, 2011, 12:03:09 AM6/30/11
to django...@googlegroups.com

Please, count me in.

On Jun 22, 2011 10:15 AM, "Cal Leeming [Simplicity Media Ltd]" <cal.l...@simplicitymedialtd.co.uk> wrote:
> Hi all,
>
> Some of you may have noticed, in the last few months I've done quite a few
> posts/snippets about handling large data sets in Django. At the end of this
> month (after what seems like a lifetime of trial and error), we're finally
> going to be releasing a new site which holds around 40mil+ rows of data,
> grows by about 300-500k rows each day, handles 5GB of uploads per day, and
> can handle around 1024 requests per second on stress test on a moderately
> spec'd server.
>
> As the entire thing is written in Django (and a bunch of other open source
> products), I'd really like to give something back to the community. (stack
> incls Celery/RabbitMQ/Sphinx SE/PYQuery/Percona
> MySQL/NGINX/supervisord/debian etc)
>
> Therefore, I'd like to see if there would be any interest in webcast in
> which I would explain how we handle such large amounts of data, the trial
> and error processes we went through, some really neat tricks we've done to
> avoid bottlenecks, our own approach to smart content filtering, and some of
> the valuable lessons we have learned. The webcast would be completely free
> of charge, last a couple of hours (with a short break) and anyone can
> attend. I'd also offer up a Q&A session at the end.
>
> If you're interested, please reply on-list so others can see.
>
> Thanks
>
> Cal
>

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 30, 2011, 8:43:20 AM6/30/11
to django...@googlegroups.com
Hey all,

Last call for registering your interest in this webcast (28 votes so far).

If you haven't done so already, please visit the following URL:

https://spreadsheets.google.com/spreadsheet/viewform?hl=en_GB&formkey=dENyOVFSSkhSYnhBLVZGTktiN1Z3Y2c6MQ#gid=0

Poll ends tomorrow, at which point we will decide upon a fixed time/date(s).

Cal

Benedict Verheyen

unread,
Jun 30, 2011, 9:02:30 AM6/30/11
to django...@googlegroups.com
> Hey all,
>
> Last call for registering your interest in this webcast (28 votes so far).
>
> If you haven't done so already, please visit the following URL:
>
> https://spreadsheets.google.com/spreadsheet/viewform?hl=en_GB&formkey=dENyOVFSSkhSYnhBLVZGTktiN1Z3Y2c6MQ#gid=0
> <https://spreadsheets.google.com/spreadsheet/viewform?hl=en_GB&formkey=dENyOVFSSkhSYnhBLVZGTktiN1Z3Y2c6MQ#gid=0>
>
> Poll ends tomorrow, at which point we will decide upon a fixed time/date(s).
>

I'm interested, visited the URL above already.
Time isn't going to be that big of an issue for me as I'm only 1 timezone away :)

Cheers,
Benedict

Cal Leeming [Simplicity Media Ltd]

unread,
Jul 2, 2011, 6:35:21 AM7/2/11
to django...@googlegroups.com
Hi guys,

Alright, vote is closed and here are the results:


It looks like I might have to do two separate webcasts to cater for all timezones, but we'll see.

I'll post up a fixed time/date (or two) shortly.

Cal

Cal Leeming [Simplicity Media Ltd]

unread,
Jul 2, 2011, 8:20:41 AM7/2/11
to django...@googlegroups.com

gradja

unread,
Jul 3, 2011, 10:55:38 AM7/3/11
to Django users

> If you're interested, please reply on-list so others can see.

Count me in, and thanks to share your experience with others.
Graziella

Andre Santos

unread,
Jul 3, 2011, 11:28:15 AM7/3/11
to django...@googlegroups.com
Maybe someone can record it and upload somewhere for the ones who cant be online when you do the presentation...

2011/7/3 gradja <graztou...@yahoo.fr>

> If you're interested, please reply on-list so others can see.

Count me in, and thanks to share your experience with others.
Graziella

Cal Leeming [Simplicity Media Ltd]

unread,
Jul 3, 2011, 11:46:41 AM7/3/11
to django...@googlegroups.com
This has already been discussed earlier in the thread :)

Andre Santos

unread,
Jul 3, 2011, 2:19:24 PM7/3/11
to django...@googlegroups.com
Oh, sorry :).

2011/7/3 Cal Leeming [Simplicity Media Ltd] <cal.l...@simplicitymedialtd.co.uk>

Cal Leeming [Simplicity Media Ltd]

unread,
Jul 12, 2011, 9:48:46 AM7/12/11
to django...@googlegroups.com
Hi all,

Great response to this, 45 registered votes in total.

The webcast will take place on Monday 29th August 2011 - (Minimum resolution 1920x1080)

However, because the time zone results are almost split 50/50, I'm going to allow users to select from two time slots, if the results are still split 50/50, then I'll do both an afternoon and evening session.

Please use this form to register your place.

Cal

Venkatraman S

unread,
Jul 12, 2011, 1:31:14 PM7/12/11
to django...@googlegroups.com
Can you please ..please...please..please record this session!?

-V

Cal Leeming [Simplicity Media Ltd]

unread,
Jul 12, 2011, 1:41:10 PM7/12/11
to django...@googlegroups.com
Of course :) I'll make sure to take a high quality recording and stick it on YT.

Cal

On Tue, Jul 12, 2011 at 6:31 PM, Venkatraman S <venk...@gmail.com> wrote:
Can you please ..please...please..please record this session!?

-V

--

AlexH

unread,
Jul 13, 2011, 3:31:57 AM7/13/11
to django...@googlegroups.com
> If you're interested, please reply on-list so others can see.

Would be very interested in attending. Thanks!

Andre Terra

unread,
Jul 13, 2011, 6:37:14 AM7/13/11
to django...@googlegroups.com
Unfortunately, I'm on GMT-0300 and none of those timezones work for me. Looking forward to the recorded version, though!

Cheers,
André

--

Cal Leeming [Simplicity Media Ltd]

unread,
Jul 25, 2011, 8:42:15 AM7/25/11
to django...@googlegroups.com
Second call for register your attendance on this webcast, if you haven't already done so.

Thanks

Cal

nicolas HERSOG

unread,
Jul 25, 2011, 9:00:58 AM7/25/11
to django...@googlegroups.com
I won t be able to follow the live session but I can't wait to watch your record on YT.

Thx!

Cal Leeming [Simplicity Media Ltd]

unread,
Aug 11, 2011, 1:31:24 PM8/11/11
to django...@googlegroups.com
Last call on this guys:


If you want to register your attendance, please do so now.

Thanks

Cal Leeming [Simplicity Media Ltd]

unread,
Aug 19, 2011, 7:32:26 PM8/19/11
to django...@googlegroups.com
Hi all,

Okay, good news and bad news. 

The bad news, is that the site/project which sparked this webcast to be made, has had some legal complications and is being shut down - less than 1 month after being released :L

The good news, is that the webcast will still be going ahead.

For those of you that want to see the site it was related to, it's http://www.iliketochan.com . 

The site itself was a complete 4chan archive, dating back to 2010, peaking at 50 million posts and 17 million images.

Particularly sad about this because so much time and effort went into the site, but at the same time, it has been a great learning experience (both in code and business). So being able to share that on the webcast will be of some small comfort :)

Cal

Cal Leeming [Simplicity Media Ltd]

unread,
Aug 27, 2011, 4:33:58 PM8/27/11
to django...@googlegroups.com
Hey guys,

I'm sending out the webcast invitations now (should receive them in about 10 minutes from GoToMeeting). Last chance to jump on if you haven't already.

Cal

william ratcliff

unread,
Aug 27, 2011, 4:46:51 PM8/27/11
to django...@googlegroups.com
Any problem if we register for both?

Cal Leeming [Simplicity Media Ltd]

unread,
Aug 27, 2011, 4:50:03 PM8/27/11
to django...@googlegroups.com
Nope, we're allowed up to 100 attendees per time slot, and we are at 55 at the moment, so feel free to register for both.

Cal Leeming [Simplicity Media Ltd]

unread,
Aug 27, 2011, 4:54:13 PM8/27/11
to django...@googlegroups.com
Bloody thing, had to send it directly to the list cos it doesn't have the option of bulk inviting people. How annoying :X

Anyone who didn't receive the invite email, please let me know.

Cal
Reply all
Reply to author
Forward
0 new messages