What does scalability look like for Django once you get to the limit of
your initial serving capacity. I would be interested in what is occuring
with some of these large newspaper sites to handle the load.
The data that you are serving is one factor where undoubtably much can
be accomplished with caching etc. My question, however is more to do
with what to do when you are beyond the limits of what a single apache2
instance can do for you.
Many thanks.
Regards,
David
Hi David,
The general way of scaling a Django app is through caching and
hardware. For caching, we highly recommend memcached, which is capable
of spreading cache over several machines without having to duplicate
cache per machine. For hardware, you can load-balance your
Web-application servers. You can probably also do some sort of
database replication, but I haven't had to deal with that in my
experience.
Adrian
--
Adrian Holovaty
holovaty.com | djangoproject.com
Regards,
David
http://groups.google.com/group/django-users/browse_frm/thread/56698424ae3708ea
Thanx,
Z
I'd be glad to:
I have three web servers hitting the same database server. Some
content is shared; some is not. It works perfectly.
I really don't know what else to say about this; you've complained that
you're having problems, but I don't see any details of what those
problems are, and I've never experienced any of them in practice.
In your example::
obj1 = objects.get_object(pk=1)
obj2 = objects.get_object(pk=1)
obj1.data = 1
obj1.save()
Of *course* you'd expect that obj2.data != obj1.data -- Django's not
going to be able to hide the fact that you're using a database from you
(nor should it).
If this is actually a problem for you -- and not just a theoretical one
-- please give me more details so I don't have to assume that this is
just FUD.
Jacob
Django's original install base was a few sites that, between them, do
a few million hits. Django's current install base is everything from
personal blogs to high-traffic applications at the Washington Post.
While I don't mean to be insulting, this would seem to indicate that
if you can't get Django to scale, your problem is not in Django.
--
"May the forces of evil become confused on the way to your house."
-- George Carlin
Think distributed: two requests updating the same data concurrently.
Last write wins. Data might not be what you expect, as you can't make
sure that you have the version you directly read before updating. This
is the simple scenario. Often this doesn't matter, as "last write wins"
is quite acceptable. But sometimes - for example if you do financial
transactions or shop stuff - this might be quite undesirable. But even
then it might not hit you, as this is a solely update-related problem -
it won't happen with inserts, as you usually have automatically
generated id's and so concurrent inserts will get separate id's.
And it can't be really solved without transactions, as you would need
to be able to make sure that your fetched object is still the correct
version that's in the database. Think for example about something like
this:
account = banking.get_account(pk=4711)
... do some calculation on what to store in the account
account.amount += result_of_calcuation
... do maybe some more calculations for other stuff
account.save()
You can't be sure that your account still has the same amount when you
do the += and the save is done, so you can't make sure that you don't
stomp on updates others have already done in between. And just "move
the read/update/save calls into one spot in the code" won't cut it - it
would just make it less probable to hit this problem, but given a high
enough update rate you will still hit it.
Actually I had exactly that problem in a small project where I used the
Django ORM out of laziness and did hit exactly this - I was collecting
sum data on object values in a cumulative table. Of course I switched
that to a simple SQL trigger mechanism, as it wasn't much more than
just a python-written triggern (it was actually a _post_save code on
the detail object that kept tallies of object kinds on the collection
object). The SQL trigger was executed in the same transaction as the
update SQL and so the problem was solved.
But the account/amount problem above can't be solved on SQL level alone
- it's fully application code depending and so you are currently fubar.
At least unless we get transactions ;-)
Granted, in most content-oriented and presentation-oriented
applications you don't hit this problem - I did hit it because the
relevant application needed a fast overview of data in the catalog and
the items themselves where too many to count them on view time, so I
needed the tallies. Other content and presentation oriented sites
aren't that update-heavy to start with, and if they are, it's mostly
people updating different parts of the database (and if two update the
same thing, "last write wins" is the expected outcome anyway). So this
is more a problem for people who do more processing-oriented sites.
bye, Georg
I'm sure there are many very large websites using Django, but from what
I see many are newspaper-style (many reads, few if any writes except by
the admins). I'd be curious how may sites are doing dynamic updates by
many concurrent users? In a read-only / content / presentation oriented
site (as Hugo calls them) you won't run into this problem.
The banking example above or my example across two machines is exactly
the sort of problem that Django doesn't support.
"Of *course* you'd expect that obj2.data != obj1.data"
No ... I'd *absolutely* expect obj2.data == obj1.data or it should
throw an exception on the second write. That's what Hibernate supports
beautifully.
>I'm sure there are many very large websites using Django, but from what
>I see many are newspaper-style (many reads, few if any writes except by
>the admins). I'd be curious how may sites are doing dynamic updates by
>many concurrent users?
>
It's still not a problem for a typical web app (think of eshops) that
indeed does many concurrent updates. The thing is that these updates are
targeted at different data: each user updates its own piece. So another
condition that should be applied to your problem description is that
there should be many concurrent updates to shared data.
And even with such updates the problem would exist only with
applications that keep modified data in memory for later access. Again
it's not the case for a typical web app where each request does a short
cycle of read-process-update. Most of the time most of the state is
stored in a database and not distributed across many concurrent processes.
I do agree that the problem exist. However I don't agree that it's
common enough to cliam that Django can only support single user scenario :-)
Perhaps single-user is a an unqualified overstatement, but I don't
think you have to look too far to find the cases where it is correct.
That said, it is nice to see someone admit that this is a problem. :-)
Actually _that_ was never the problem: the ticket on transactions
(whose absence is the actual culprit in this problem area) is ticket #9
:-)
bye, Georg
Perhaps we need another trouble ticket for versioning?
My question is, what happens if I try to use Django to do the same
thing at the New York Daily News?
With, like, 10,000 players.
PS I'm still interested in hearing from people who could help execute
the thing.
Let's just put it this way. I'm using Django to power a database of
more than 4 million records (and growing every week):
http://projects.washingtonpost.com/congress/ . No problems. And
if/when there are problems, they'll be at the database level, not at
the Django level.
Aren't you just talking about the problem that one user may modify a
record while a different user is editing that record at the same time,
and the user editing the record will not automatically see the changes
that were committed by the first user?
If so, then this is not a problem unique to Django. This is how web
development -- and any sort of n-tier application development -- works.
These are stateless apps where the clients maintain no connection to
the server. How do you expect a web app to be immediately updated with
changes on server without continually polling the server for a refresh?
And do you know what kind of hit on scalability that's going to have?
Thousands and thousands of heavily used web apps are being used without
this automatic refresh that you're talking about. It's a design
constraint that app designers work with, in part because it's not a big
deal and in part because it would be a big drag on scalability to
provide that sort of auto-refresh.
Sorry, if you're not talking about any sort of auto-refresh at all, and
are just talking about fact that in Django the second user's object
will automatically overwrite the object that the first user just saved
in the db, without any warning or error message, then I agree this is a
problem.
I'm a newbie, Is that really the way Django works? If so, I agree that
in the typical business app (i.e., not just cms-type apps) there
should be some way to return this warning to the user.
That is, there should be some way to simulate optimistic locking in the
db. Easiest way I can think of would be to add a timestamp to each
record and to make sure they match when doing updates or deletes, and
to return error message warning user that object has changed if there
is no match. Of course, I expect it'd be a little more difficult than
that.. .
-- Herb
And it can't be really solved without transactions, as you would need
to be able to make sure that your fetched object is still the correct
version that's in the database."
If you're talking about starting a transaction that locks the record on
the db when a user begins to view/edit a record, and commits
transaction when user posts changes, then this would be a terrible way
to go for a web app, in my opinion. Works okay for client/server
situations, but it's hard to implement and drastically affects the
scalability advantage a web app has over c/s.
The best method is to simulate optimistic locking in the db.
The two ways I've seen the problem of simulating optimistic locking
solved are:
(1) have timestamp field in db that is used to identify record, along
with primary key. E.g., UPDATE _tbl_ . . . WHERE primarykeyval =
[value] and lastchangedval = [timestampval]. This way the record won't
be found if it has been updated since second user got his copy
(assuming timestamp is updated in every update).
(2) the second way doesn't involve using a timestamp field at all, it
just requires putting the pre-edited value of every non-blob field in
the where clause (won't work for blobs) whenever you do an update or
delete. This way if one of the fields has changed in meantime the
record simply won't be found and an error can be returned. This
requires having pre-edited value of any changed fields to include when
the updates where clause is created.
-- Herb
It is too bad that Django doesn't support transactions. If you need
them, then you need them badly, and you should find a framework that
supports them to build your application. Many web applications don't
require them, and Django is great for them. I'm hoping Django will
support them soon, because it will broaden the class of applications it
can be used for.
--Ned.
--
Ned Batchelder, http://nedbatchelder.com
Someone needs to be paying attention to the Django PyCon sprint channel ;)
I'm about the check this in :) I've got a few things to fix before I
do, but expect Django to have transaction support by noon.
Jacob
Ned makes a good comment that having this functionality will broaden
the scope of applications that Django can address.
I'm going to dive into the magic stream this weekend, start to learn
the model code and try and lend a hand.
-Z
As Adrian wrote, that's currently not in the scope of Django. But if
you write some code for that, you might want to share it with others.
It shouldn't be too hard if you restrict your models with regard to
layout of fields - for example allways name that version field
"version" and put the code to prevent writing outdated objects into the
.save() of the model class.
Oh, and transactions actually will solve a very large part of those
problems, because within a transaction a database makes sure that
others won't stomp on your data, either by implicit locking, versioning
or exception-raising. So only stored objects outside of transactional
control will pose problems, and I think in those cases it's fine for
Django to assign the duty of integrity checking to the programmer.
So, no, we don't need another ticket, we just might need some sample
code by someone who needs this feature ;-)
bye, Georg
This should help a lot for high-volume writes, I suspect.
Jacob
(1) Transaction is begun when a client makes a request from the
webserver. The webserver maintains the connection to the db (and the
db locks associated records or tables) until response is received back
from client. At that point the changes are applied, the transaction is
committed, and the connection is released. Because a client ties up a
connection from point of time between request from webserver and time
that changes are sent back, this is not a method that can scale very
well.
(2) Transaction is not begun until client returns changes to apply
them to database. The changes are applied to multiple records, perhaps
master/detail, all within the same transaction. Then transaction is
committed. If there's error then things are rolled back; thus no
change to master table or to detail table without guarantee that all
are made. Since this method maintains open transaction only for brief
period to apply changes in response sent from client it is very
scalable. Unlike method (1) above there are no long-open transactions
and connections are not generally tied up by open transactions. The
downside is that there is no locking within the db when a user
retrieves a request from the webserver. So you have to simulate some
sort of optimistic locking outside the db.
I'm trying to figure out what sort of transaction support you're
creating for Django. I have no desire or need to use option (1). It
works okay for client/server apps with relatively small numbers of
users, but I think it's bad programming practice for a disconnected or
distributed application, as any web app is.
With transaction support of type (2) you get the guarantee that any set
of changes will be applied in an atomic all-or-nothing way, but you
have the problem that the db itself isn't doing any locking. This can
be worked around fairly easily by using timestamp fields on records in
addition to a primary key, I think. And this method is very scalable,l
unlike method (1).
Thanks for any clarification.
-- Herb
However, since the transactions apply only in short-lived context of
when user is saving changes and they've being applied back to the
database there still arises question of how to prevent one user from
overwriting another's changes without knowing. This type of
transaction support does nothing to prevent that.
Seems to me like incorporating some sort of "overwrite-prevention"
logic into the Django-framework is best method of solving the problem.
Many developers _ need_ this functionality and it should be _desirable_
even when it's not absolutely necessary. Aren't uneditable timestamps
already part of every Django-table? Arent' they sufficient to provide
necessary versioning info? (I.e., if timestamp in db is different from
timestamp of record being updated or deleted then record has been
modified by intermediary user and -- at a minimum -- warning should be
provided, fuller logic would allow choice of completing or abandoning
the transaction, auto-retrieval of intermediately saved record for
comparison, etc.).
-- Herb