He's put together a newsessions package which could be good, but
obviously needs some discussion (here). So, discuss away!
The proposed solution for collision (more on that in a moment) seems
good -- the bug seems to happen solely because we don't immediately
instantiate and save a Session, so that'd work around it.
I'm more wary of tying to an IP address, because I don't think that
really buys any security; there are still ways to spoof or set up
man-in-the-middle attacks, so the only real solution would be to tell
paranoid people to use encrypted connections (which they should be
doing anyway if they'd paranoid :)).
Though I have to say that I'm rather shocked by the ease of session
collision -- it requires the same pair of random numbers (out of a
range of sys.maxint) to come up, in the same order, within a fairly
short period of time, something that just doesn't seem likely to
happen a lot (given sys.maxint on my MacBook, such a collision should
happen roughly once for every 4.6 *quintillion* trials). I know the
'random' module isn't industrial-strength, but surely it's better than
that? And if it's not, then let's start salting it with more stuff.
--
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."
as far as I see, newsessions checks REMOTE_IP. Is this a good idea?
a) the client can sit behind a NAT that might hand out different IPs
b) the server can be behind NAT and not see the true IP at all. It
might see different IPs for the same client over time.
c) a crazy load balancer might get in the way
I wouldn't use REMOTE_IP.
In Ticket #1180, Anton said that users reported that they were able to
get someone else's session. I wonder if that might be related to
something quite different. Anton, was this by chance on a threaded
server with a python-mysql < 1.2.1p2?
Here's the (old) code that generates a session:
def get_new_session_key(self):
"Returns session key that isn't being used."
# The random module is seeded when this Apache child is created.
# Use person_id and SECRET_KEY as added salt.
while 1:
session_key = md5.new(str(random.randint(0, sys.maxint - 1))
+ str(random.randint(0, sys.maxint - 1)) + settings.SECRET_KEY).hexdigest()
try:
self.get(session_key=session_key)
except self.model.DoesNotExist:
break
return session_key
The chance that two consecutive session creations generate the same
session id is 1 : sys.maxint^2, on my system sys.maxint is 2^31. So
we're at 2^62. If the internal state of the random generator is big
enough, it might make sense to hash 4 randints instead of only 2. Or to
check for collisions at the same time as you save the session. But the
chances are good that Anton experiences a different problem.
Michael
(or any proxy, in which case REMOTE_IP will be the same for every
request and you'd want to be looking at the 'X-Forwarded-For' header
-- this is one of the problems with IP-based sessions)
One thing that I would like to suggest is that we do link the SID to a
user-agent (see Chris Shiflett's excellent article on this at [1]), as
this provides some protection (-via-obscurity) against these
collisions, since the attacker not only needs to get a valid SID, but
match it to the correct UA string. There's also no reason for a
"normal" user to change UA strings without needing to login again.
This can easily be added to the existing session framework by adding
it to the hashing
--Simon
Could be easily fixed with providing Model._update() and Model._insert()
as proposed here:
http://groups.google.com/group/django-developers/browse_thread/thread/6ddfb7da8319c6df/c2d3671b6a6fec5f
abstract example code:
http://groups.google.com/group/django-developers/browse_thread/thread/6ddfb7da8319c6df/c2d3671b6a6fec5f#msg_56de2a0ef2b884cc
(First code block, additional to having less queries this would prevent
race conditions)
SessionManager could then calculate a session and try to save an object
with this session-key (allowing only INSERT, not UPDATE). If the key
exists the DB-backend will raise an Exception as the key is unique
(primary_key). If it is free, the INSERT will ensure that no other
session may use the key.
If I get this right the newsession-patch is trying to do so, but will
fail, beuase s.save() (django/contrib/newsessions/models.py, line 29)
will overwrite (SQL UPDATE) existing sessions instead of failing
creating it.
One tiny advantage remains: There is less code (=time) between
calculation the session key and saving it, so less chances to get a race
condition.
But I don't see a real fix there.
Simon G. wrote:
> One thing that I would like to suggest is that we do link the SID to a
> user-agent [...]
I heard about some proxies or anonymizer services that actually do
change the user agent. But I haven't seen this so far, nor searched for
it. ;-)
Greetings, David Danier
UA strings change on every minor rev of some browsers, plus various
tool versions (.net, etc.). Having those users logged out would suck
for my purposes.
Perhaps a setting (sorry!)?
1. REMOTE_IP. I think it is a good idea to check it. Load balancers
acts as proxy servers. Every web server has a module that allow to
restore REMOTE_IP from X-FORWARDED-FOR, for example apache's module is
called mod_rpaf. And ofcourse it's very easy to make REMOTE_IP
checking code to be disabled or enabled via settings.py, just like it
is in php, such modification is quite simple.
2. In my installation I use lighttpd and django as preforked fastcgi
server, python 2.4. It's hard to reproduce such things but they _do_
happen. One can say my installation is broken but I think it's rather
a hole in django if it allows such things to be ever happened.
3. Perhaps, current sessions may be patched too, what is needed is to
force django to try INSERT into table but it looks like dirty hack to
me because documentation says: if object's pk evaluates to something
else than True, django does checking if such row already exists in db
and while django does it, another process may get the same session id
and try it too and we get collision again. My way works rock stable
because of different table design and this is the only reason why i
used another package name (newsession) istead of just patching
existing package.
I'm still wary of this -- again, it doesn't do anything to prevent
man-in-the-middle, and it introduces complexity and -- if we're
worried about session collisions -- introduces a whole new class of
bugs when you have a bunch of people behind a NAT or firewall who all,
from the POV of your server, have the same IP address.
> 2. In my installation I use lighttpd and django as preforked fastcgi
> server, python 2.4. It's hard to reproduce such things but they _do_
> happen. One can say my installation is broken but I think it's rather
> a hole in django if it allows such things to be ever happened.
I think we definitely need to tighten up the time between selecting a
session key and shoving the session into the DB -- I'm thinking along
the lines of using get_or_create and retrying until you get back a
True creation value from it (that way you don't have to do anything
else -- the session is already in the DB when get_or_create returns.
But I'm still worried about the apparent fact that this is all coming
from key collisions, because what it points to is not a race condition
at the database level -- what it points to is bad randomness.
Python's random-number generator uses the Mersenne Twister, and in
Python 2.3 it seeds itself with the current system time (and in 2.4 it
tries system-supplied randomness facilities first, then falls back to
system time).
I'm not an expert on randomness by any means, but the problem we have
here is the same pair of random numbers being generated at the same
time or within a very short time of one another, and my first
inclination is to look at the fact that we're using a deterministic
RNG seeded -- by default -- with the system time.
So instead of trying to ensure safety at the database level (because,
frankly, unless we go to extremes we're not going to manage that), why
not look at better seeds for the RNG and solve the real problem?
Well I don't agree with you here:
1. This option can be disabled by default and only enabled optionally
by developer (like in php, as i said)
2. I have an ethernet connection @home and I sometimes log in to our
private web apps from home. Any 'c00l hacker' is able to scan network
traffic, get my session id and use it to join to my session too just
because there is absolutely no checking who uses session. We added ip
checking at our application level code but why don't implement this at
django side as an option ? Noone says this is 100% guarantee of
getting rid of man-in-the-middle, but at least it prevents from
successful hacks from kiddies when using untrusted networks (ie public
wi-fi, ethernet, etc).
> I think we definitely need to tighten up the time between selecting a
> session key and shoving the session into the DB -- I'm thinking along
> the lines of using get_or_create and retrying until you get back a
> True creation value from it (that way you don't have to do anything
> else -- the session is already in the DB when get_or_create returns.
>
> But I'm still worried about the apparent fact that this is all coming
> from key collisions, because what it points to is not a race condition
> at the database level -- what it points to is bad randomness.
Let's separate problems because there are 2 problems:
1. Somehow same ids are generated in a short period of time. So id
generation needs to be improved.
2. Django allows session collisions.
I am not an expert in the first problem but I definitely think that
second problem must be solved in the way I suggested, instead of
shortening time between checking and storing new record to db. Just
imagine: you have a gun, that sometimes fires to it's holder. I
suggest to change construction to make it impossible in any way, you
suggest to change the way how holder should pull the trigger, hold the
gun in arms etc but if we don't change the construction, whatever we
do there still IS a possibility to shoot yourself :) Hope I explained
well enough.
I guess I just feel that someone who's determined enough to sit around
and sniff for your session ID is also going to be determined enough to
defeat IP checking (which wouldn't be that hard -- again, the real
solution to that is encrypted connections, not increasing numbers
"sessions are tied to x, y and z"), so it doesn't get us any real
improvement in security. But I'll bow out of that debate.
> Let's separate problems because there are 2 problems:
> 1. Somehow same ids are generated in a short period of time. So id
> generation needs to be improved.
> 2. Django allows session collisions.
Problem 2 is a direct consequence of Problem 1. In fact, Problem 2 is,
I think, just a symptom of problem 2 -- you can't have a DB race
condition with sessions unless you're generating identical session
keys for different people in the first place.
I'm not saying I'm against separating insert and update in the ORM,
though -- I'm just trying to think pragmatically: this is a nasty bug
that needs to be fixed as soon as possible, and separate insert/update
would be a non-trivial refactoring of the Django ORM and wouldn't
solve the underlying source of this bug, which is duplicated session
keys. Meanwhile, coming up with a better way to generate session keys
-- probably through seeding the RNG -- is easy to do and solves the
underlying problem.
Which means my vote is going to be for better session keys now, and a
design decision about insert/update when we do the QuerySet refactor
post-0.96.
> I suggest to change construction to make it impossible in any way, you
> suggest to change the way how holder should pull the trigger, hold the
> gun in arms etc but if we don't change the construction, whatever we
> do there still IS a possibility to shoot yourself
Again, for me it's a question of priority -- if we can solve the
key-collision problem by seeding the RNG, that's a better solution for
right now than overhauling the ORM's object-saving methods. To run
with your analogy, we've already been shot and we're bleeding, and the
gun is still going off. Our priority right now should be to stop the
shooting and tend to our wounds, or at least to point it away form us,
not to go shopping for a better gun.
James, there is a concept of 'fool proof'. Real hackers may do many
things. But current model allows even 10 year old kids to be hackers.
This is just against them. There is no ability to protect all sites
with ssl and I would not like a neighbor's son could read my email
because he got my session id by running simple ethereal
> I'm not saying I'm against separating insert and update in the ORM,
> though
Have you ever looked in my code ? There are no hacks of ORM, there is
just a different sessions table design where session id is not anymore
a primary key but just an unique key. Primary key is a separate field
and this approach allows us to use existing ORM to do things correct
way where session collision is impossible at table level, so
impossible by design.
And again: I am not voting against improving if IDs generation, I
think these problems should be solved both.
ak schrieb:
I like that your code guarantees that you don't issue a session id that
is still in use. I don't care how this is achieved, be it with a
separate id or with using get_or_create, as long as it works without
leaving holes as the current implementation does. Is there a good reason
for the current implementation?
But I don't like that newsession insists on checking the IP.
You argue that it protects against an attacker who can sniff the session
id and replay it. But wouldn't a hacker who is able to observe the
traffic simply sniff the password, too? I see problems with IP checking
(as mentioned in posts before), but no real added security.
But something else must go wrong in your case, as the chance to get the
same session ids should be so tiny that you really shouldn't reports
from your users. Comments say that the random number generator is
seeded when the server process starts, and it's seeded with the current
time. Isn't that bound to fail with multiple processes that start at the
same time? I don't know much about the random module, though.
Michael
And one more thing about ip checking: ok, I agree about network
sniffing (but there still is a possibility that sniffering was run
after i logged in so attacker could not see my pass but he see my
session id :) ). But please don't forget that session cookies may also
be stolen via XSS.
I've just recalled a case... I was using Django's randomly generated
passwords and it worked fine under mod_python. Then I switched to
FastCGI and from some moment *all* passwords started being generated
absolutely equal. The thing was that mod_python was reloaded from time
to time by Apache (using MaxRequestsPerChild) and flup-based FastCGI
were never reloaded. Over long time Python's random generator just
degenerated into giving out exact values (this was Python 2.4 on Debian).
I've solved the problem by manually re-seeding the generator based on
time and couple of other request related data. This now works about half
a year without problems. In our session case this can be as simple as:
random.seed(str(datetime.now()) + str(id(request)))
> I've just recalled a case... I was using Django's randomly generated
> passwords and it worked fine under mod_python. Then I switched to
> FastCGI and from some moment *all* passwords started being generated
> absolutely equal. The thing was that mod_python was reloaded from time
> to time by Apache (using MaxRequestsPerChild) and flup-based FastCGI
> were never reloaded. Over long time Python's random generator just
> degenerated into giving out exact values (this was Python 2.4 on Debian).
Fascinating. `manage.py runfcgi` now has a parameter `maxrequests`, so you
can force it to restart, too.
In python 2.3, the random generator seems to be seeded at the time when the
module `random` is imported for the first time. I suspect that when all the
Apache child processes are forked, that it's possible that a few get seeded
with the same system time. Same for fastcgi.
Beginning with python2.4, the seed also uses os.urandom() when available, so
it starts to get safer.
How about this: We could fix this for python 2.3 by seeding the random generator on
django startup using the random function of the database together with
urandom (when available) and the system time.
I'm not sure whether it's a good idea to add periodic reseeding.
Michael
--
noris network AG - Deutschherrnstraße 15-19 - D-90429 Nürnberg -
Tel +49-911-9352-0 - Fax +49-911-9352-100
http://www.noris.de - The IT-Outsourcing Company
Vorstand: Ingo Kraupa (Vorsitzender), Joachim Astel, Hansjochen Klenk -
Vorsitzender des Aufsichtsrats: Stefan Schnabel - AG Nürnberg HRB 17689
This is what SSL is for.
Tying the session to IP is a usability nightmare, what if I'm half way
through a long application process when my wifi goes out of range for
a moment and the DHCP server gives me a new IP, or I move from wifi to
my cable because I'm moving to my desk, or my DSL provider decides
it's time to give me a new IP or I'm browsing from my phone and my IP
changes each time I connect?
Tying the session to IP provides no extra security and degrades the
experience in the common case. Applying SSL adds security and does not
degrade the experience.
Entropy:
Could the RNG be re-seeded when a new session is created? Is it an
expensive operation?
Could the entire request header be used as entropy? Does not
necessarily tie the session model to the request, as the entropy could
be an argument to get_new_session_key, and I would imagine 99% of
calls to this method come from the process_response method of
SessionMiddleware.
I am quite new at django and I've been trying to get all the info I can
by, amongst other means, reading through django-dev and django-users. I
don't remember in which I read this but fastcgi, the way it is designed,
doesn't work well with threads. The actual details escape me but I'm
certain the thing that made this come up had to do with information from
one user being magically overwritten by something submitted by other
user... which sounds terribly like you point #2.
The quick fix on that ocasion was using a forking model in apache,
instead of multithreading iirc.
Just a thought,
henrique
> Beginning with python2.4, the seed also uses os.urandom() when available, so it starts to get safer.
and our app runs under xen domU debian installation so may be there's
something related to xen kernel entropy
Uhm... But it still happens only once on first module import. So it's
not safer at all.
> I'm not sure whether it's a good idea to add periodic reseeding.
Why not? I'm not an expert also but I just don't see any disadvantages
beside speed which is still incomparable to INSERTing a session into
database.
>
> Michael Radziej wrote:
> > Beginning with python2.4, the seed also uses os.urandom() when available, so
> > it starts to get safer.
>
> Uhm... But it still happens only once on first module import. So it's
> not safer at all.
If you have multiple processes which initialize their random
module at exactly the same system time (how granular is it?, you'll get the
same seed and therefore the same sequence of pseudorandom numbers in
python2.3
In python2.4, the seeds will be different, thus you'll get different
numbers.
Just to put it clear: Something that prevents handing out a session
id that is already in use should be installed additionally.
>
> > I'm not sure whether it's a good idea to add periodic reseeding.
>
> Why not? I'm not an expert also but I just don't see any disadvantages
> beside speed which is still incomparable to INSERTing a session into
> database.
The sequence should obey certain other statistical rules. I am not sure
if reseeding could break it. I simply know enough about it to be cautions,
and I'm not deep enough into it to recommend reseeding or not reseeding ;-)
I'd like to just throw out a few comments.
1) The current session framework unquestionably needs some attention.
The way in which sessions are being generated isn't terribly
efficient, and can certainly lead to collisions.
2) James has suggested that the answer is, perhaps, better seeding. I
really don't think that this is an issue. Regardless of the seed,
there is still a mathematical chance that a collision will occur. The
very definition of random means that you can't predict what will come
next. There's a very real chance that any random number generator
will give you the exact same number 100 times in a row. It's a small
probability, but it is completely possible... any random number
generator that tries to avoid this is no longer completely random.
The default Python implementation of 'randomness' should be just fine
for our purposes. To my knowledge, Mersenne is not cryptographically
secure; however, we're not using it for cryptographic purposes...
we're using it as input to a hashing function.
This brings me to my next point -- hashing. Hashing converts input to
a fixed bit-length field. Even if we have truly random and truly
unique input to the hashing function, it's possible (again, highly
unlikely, but possible) that the result of hashing two inputs will
yield the same session key.
Unless we're enforcing this at the database level, there's still the
possibility of collision.
3) In terms of seeding, the default Python approach should be fine (I
can't speak at all to Python 2.3's suitability). It's based on
start-up time of the process. It's unlikely that these seed values
will occur at the -exact- same moment in time (it is, you guessed it,
still possible).
Continual reseeding will actually probably make the problem worse, and
will degrade performance. In a large deployment with lots of
processes continually reseeding, there's a much higher likelihood that
they will hit the same seed, and thus generate the same random number
pattern. I'm fairly certain that the likelihood of collision here is
much higher than using the default Python on-import seeding mechanism.
Again, I see the best solution is to move final collision avoidance to the DB.
4) In terms of tying sessions to an IP, I am very much against this.
It creates complexity, degrades user experience for some users, and
doesn't really add any security. If someone is in a location where
they can sniff your traffic, they can very easily spoof your IP. You
should be using SSL to encrypt this traffic if you are concerned about
this possibility.
5) Thank you to Anton and SmileyChris for bringing this issue
front-and-center again.
Recap:
+1 giving some love to the current session framework
+1 ensuring uniqueness at the DB level
-1 Giving Django a new RNG
-1 Re-seeding on every session generation
-1 Tying sessions to IP
- Ben
Yeah... I actually don't object to your approach at all, I'm just
guessing why random numbers could become... uhm... not that random.
This is true, but given the range the numbers are being pulled from,
the odds of a collision should be around one in four quintillion.
Obviously, we're nowhere near that, so something's up with the
"random" numbers we're getting -- occasional flukes are one thing, but
the apparently consistent collisions people are reporting are quite
another.
> The default Python implementation of 'randomness' should be just fine
> for our purposes. To my knowledge, Mersenne is not cryptographically
> secure; however, we're not using it for cryptographic purposes...
> we're using it as input to a hashing function.
Yes. And for a given input, md5 will always produce the same hash.
And -- most importantly for our purposes -- Python's RNG is completely
deterministic: given the same seed, the underlying Mersenne
implementation will produce the same sequence of numbers. And by
default on Python 2.3, the seed is the startup time, which means that
processes spawned at the same time (say, by a web server starting up)
can get the same seeds and thus will produce the same sequence of
"random" numbers, like clockwork.
I'm not 100% certain, but I'd bet that's why we're seeing more
collisions than we'd expect.
> Unless we're enforcing this at the database level, there's still the
> possibility of collision.
We are "enforcing it" at the database level -- the session key is the
primary key of the table and thus must be unique. The problem, in a
nutshell, seems to be this:
1. User A visits the site and server process A generates a session
key, checks the database to see whether that key is in use, finds that
it is not, and assigns that key to User A.
2. At or around the same time, User B visits the site and server
process B generates the same session key (probably because both
processes started out with the same seed for Python's RNG). Because
User A's session has not yet been serialized and stored, server
process B also finds that the key is not in use, and assigns that key
to User B.
3. At some later point, both sessions are serialized and stored; the
first one executes an INSERT --- because Django does not distinguish
INSERT from UPDATE for purposes of a model's 'save' method -- and the
second executes an UPDATE on that row. Assume for the moment that User
B's session gets the INSERT and User A's session gets the UPDATE.
4. When User B comes back to the site, he finds that the Django
session framework thinks he is User A.
There are several possible approaches here:
1. Manually seeding the RNG with a value unique to the server process,
or to the request, would move the probability of key collision closer
to what it "should" be, thus dramatically reducing the problem.
2. Changing the session framework's behavior so that it stores the
session in the database as soon as it finds an unused key (e.g., by
looping with 'get_or_create' until it gets a new object) would
drastically reduce the odds of two independent processes with the same
key each thinking that their key is unused.
3. Changing Django's ORM to separate INSERT from UPDATE would make it
impossible for two processes to each execute a 'save' on sessions with
the same key and think that they each created a new row.
Options 1 and 2 are relatively easy and could be implemented quickly.
Option 3 is attractive, but would be harder to do.
(though, technically, it would also be possible to use locking or
transaction isolation to achieve the database-level enforcement, but
I'm wary of doing so)
Personally I'd vote for a combination of options 1 and 2 for now, and
a long, hard look at option 3 when we do the QuerySet refactor.
Out of curiosity, how many people who are seeing this bug happen are
also running on Windows? On Unix we can count on fine-grained
timestamps for 2.3 and the presence of /dev/random for 2.4, but on
Windows... who knows?
> The sequence should obey certain other statistical rules. I am not sure
> if reseeding could break it. I simply know enough about it to be cautions,
> and I'm not deep enough into it to recommend reseeding or not reseeding ;-)
In the case of Python, the initial seed doesn't affect the quality of
the randomness too much, but identical seeds will produce identical
sequences of numbers -- you can verify this yourself by manually
calling random.seed(), passing in a seed of your choice, and watching
the sequence of random numbers it produces afterward. If you start a
different Python process and feed it the same seed, you'll see the
same sequence of numbers come out.
Problem is reduced... but still remains.
> 2. Changing the session framework's behavior so that it stores the
> session in the database as soon as it finds an unused key (e.g., by
> looping with 'get_or_create' until it gets a new object) would
> drastically reduce the odds of two independent processes with the same
> key each thinking that their key is unused.
Very inefficient for a generic module widely in use.
> 3. Changing Django's ORM to separate INSERT from UPDATE would make it
> impossible for two processes to each execute a 'save' on sessions with
> the same key and think that they each created a new row.
Ideal solution, exactly what it should have been in the first place.
Instead of ugly reinventions of integrity enforcements, this should be
left to be handled by the database -- where exactly it belongs.
> Options 1 and 2 are relatively easy and could be implemented quickly.
> Option 3 is attractive, but would be harder to do.
I don't think this would be any harder than:
while True:
session.id = generate_new_session_id()
try:
session.insert()
except:
continue
What I don't understand is: Why the developers don't do the correct
thing by providing separate insert() and update() in addition to
save() ? All it requires is a trivial refactoring and actually
simplifies the code (also clarifying the intent of the programmer).
The problem is not alone with sessions but any CRUD view.
Say you get input from normal users. And your current policy is that
users may only add new data but can't change existing data (typical
for comments, forums, feedbacks etc.). Now, if you just construct an
object from user supplied data and simply call save() -- on
maliciously crafted data -- you may end up overwriting existing data!
Without availability of INSERT, the only way you can save yourself is
by doing get_or_create() (or something similar) -- which is twice as
slow. Why imitate C when we have exceptions in Python? And I heard
something about being "perfectionist"...
Silently overwriting existing data has always been stupid, especially
when the data has come from humans!
Regards,
Arvind
One reason it isn't there is because many of the use cases that require
differentiating between updating and creating a new object can already
be implemented without extra code. Requiring an application developer to
always specify update or create anew when it can usually be inferred
just adds to their workload without reason.
> The problem is not alone with sessions but any CRUD view.
>
> Say you get input from normal users. And your current policy is that
> users may only add new data but can't change existing data (typical
> for comments, forums, feedbacks etc.). Now, if you just construct an
> object from user supplied data and simply call save() -- on
> maliciously crafted data -- you may end up overwriting existing data!
If you create a new object, leaving the primary key field empty and call
save() on it, Django will never overwrite existing data. You will only
overwrite existing data if you also specify the primary key field. So we
have a way to enforce "create" already for the typical uses (e.g.
comments, forums, feedbacks).
The problem being discussed in this thread is not about update vs.
creation, it's about enforcing uniqueness and how best to ensure that
when we save the object we won't have a validity constrain violation.
> Without availability of INSERT, the only way you can save yourself is
> by doing get_or_create() (or something similar) -- which is twice as
> slow. Why imitate C when we have exceptions in Python? And I heard
> something about being "perfectionist"...
>
> Silently overwriting existing data has always been stupid, especially
> when the data has come from humans!
You might find your concerns listened to a bit more sympathetically if
you adopted a less confrontational tone in your email. Calling the
developers stupid and being sarcastic about perfectionism is probably
not the best approach to suggesting an alternative (particularly when,
despite your belief, it's not necessarily the unique solution). There
are arguments for your approach, but it's not a slam-dunk and is also
not particularly germane to this thread. It's a discussion worth having
in a clam manner in another thread.
Thank you for your future consideration,
Malcolm
Did anyone else picture mollusks having a conversation when they read
that? ;-)
--
Jon
Do we really need to re-seed the RNG? What if we did something like:
session_key = md5.new("%x%s%s" % (random.getrandbits(64),
datetime.datetime.utcnow(), settings.SECRET_KEY)).hexdigest()
datetime.datetime.utcnow could be replaced with whatever would be used
for the seed. You could include os.getpid if you desire.
This brings the differentiation into the key generation directly, not
the seed. If two seeds are the same, the result will be an identical
random number; however, this way, the two RNGs would have to be in
perfect sync -and- the functions would have to be called at the exact
same time to generate a pre-hash collision. This seems far less
likely than seeding at the same instant.
> 2. Changing the session framework's behavior so that it stores the
> session in the database as soon as it finds an unused key (e.g., by
> looping with 'get_or_create' until it gets a new object) would
> drastically reduce the odds of two independent processes with the same
> key each thinking that their key is unused.
A quick look at the code shows that "get_or_create" has the exact same problem.
It checks the DB, then, if the entry doesn't exist, follows the same
"obj.save()" path.
As a result, this solution would still result in a race condition very
similar to the current implementation.
> Options 1 and 2 are relatively easy and could be implemented quickly.
I think that the general spirit of option 1 is in the right direction.
Option 2 might be an acceptable stop-gap, but it's still not
bulletproof.
> Option 3 is attractive, but would be harder to do.
I agree it would be harder, but it is probably the best option (and
the only bulletproof one we've discussed to date). I understand that
some of the refactoring work taking place will make it easier, so it
may be worth waiting on those changes.
- Ben
Design discussion is healthy. For example, there has been quite a good
amount of feedback on why IP-locking is not really a sensible default.
Of course, some of the discussion has gone around in circles with the
same points being raised again and again, but that's normal for any
thread longer than four or five posts. :-)
Most of your proposed changes seem reasonable. At the moment, it's
post-0.96 from my perspective, since the patch needs to be rewritten,
reviewed, etc and hopefully we are really close to a 0.96 release.
The current version (in #3716) is not a 3 line patch. We're not likely
to commit an entire parallel version of session management. Either the
current version is sufficient or it should be fixed. If you could
rewrite it as a small fix against the existing session module, making
sure it remains backwards compatible, that would be great and help move
it forwards.
Regards,
Malcolm
The point is: It's not possible to extend the session in a
upwards-compatible way, because there's no way to make sure
that INSERT is used when you need to set the pk before save().
get_or_create() first tries a get(), and calls save() if the get() failed.
This still leaves an opportunity for a different process to add the same
session key in the time between get() and save().
It's possible to get this done with help of another table, but
that would still need a syncdb() after an update, and it's also
a rather dirty hack.
So, it's either a change in the database API or a change in the
database table for sessions.
I'm +1 for extending the API to allow explicit inserts.
I never suggested to throw away save(). save() (in semantics) should
remain as it. Just its code should be broken into insert() and
update() and it can insert() or update() depending on whether it got
primary key as argument or not.
Check here for simple example:
http://groups.google.com/group/django-developers/browse_thread/thread/6ddfb7da8319c6df/c2d3671b6a6fec5f#msg_56de2a0ef2b884cc
> If you create a new object, leaving the primary key field empty and call
> save() on it, Django will never overwrite existing data. You will only
> overwrite existing data if you also specify the primary key field. So we
> have a way to enforce "create" already for the typical uses (e.g.
> comments, forums, feedbacks).
I do understand that, but when one gets input from the user, one
typically iterates through keys in REQUEST dictionary and populates
the necessary fields. In present way of doing things, the application
developer has to check and ignore id values if he doesn't want an
update in a view. This is unnecessary manual checking. Please think
about it for a moment.
> The problem being discussed in this thread is not about update vs.
> creation, it's about enforcing uniqueness and how best to ensure that
> when we save the object we won't have a validity constrain violation.
IMHO, the core problem if absence of insert() and update(). Without
availability of insert() like semantics, one has to do an inefficient
(and still not perfect) manual checking to enforce uniqueness of
primary keys. This problem occurs with any model that wants to use
custom primary key.
> You might find your concerns listened to a bit more sympathetically if
> you adopted a less confrontational tone in your email. Calling the
> developers stupid and being sarcastic about perfectionism is probably
> not the best approach to suggesting an alternative (particularly when,
> despite your belief, it's not necessarily the unique solution). There
> are arguments for your approach, but it's not a slam-dunk and is also
> not particularly germane to this thread. It's a discussion worth having
> in a clam manner in another thread.
Sorry for the sarcasm. But if you check the previous thread:
http://groups.google.com/group/django-developers/browse_thread/thread/6ddfb7da8319c6df/c2d3671b6a6fec5f
you may understand my frustration. I love Django and that's why I
cared to report back the problem (with solution).
The design of ORM (in this regard) is unnecessarily restrictive and
leads to uniqueness problems whenever custom primary keys (anything
other than auto-incrementing integers supplied by underlying database)
are used. Other proposed solutions to sessions problems are either
very inefficient or not perfect.
IMHO, the problem should be fixed at the ORM level (especially, when
the fix is trivial). It would also be useful for people (like me) who
want to use custom primary keys in their models.
Arvind
Because doing so would introduce some backward incompatibilities.
Therefore, if this solution is considered it will *only* be considered
*after* 0.96 is released. That said, this is still a bug that needs to be
addressed at some level prior to 0.96 and that is were options 1 & 2 come
in.
>
> The problem is not alone with sessions but any CRUD view.
And that is why it has been said a few times already that this will be
looked at when QuerrySet is refactored - again, *after* 0.96. I'd really
like to see separate insert() and update() as well, but to push for them
before 0.96 would be wasting my breath as the devs have made is very clear
(here and elsewhere) that that is not going to happen as they've promised
backward compatability on 0.96.
>
> Say you get input from normal users. And your current policy is that
> users may only add new data but can't change existing data (typical
> for comments, forums, feedbacks etc.). Now, if you just construct an
> object from user supplied data and simply call save() -- on
> maliciously crafted data -- you may end up overwriting existing data!
>
> Without availability of INSERT, the only way you can save yourself is
> by doing get_or_create() (or something similar) -- which is twice as
> slow. Why imitate C when we have exceptions in Python? And I heard
> something about being "perfectionist"...
>
> Silently overwriting existing data has always been stupid, especially
> when the data has come from humans!
>
> Regards,
> Arvind
>
> >
--
Waylan Limberg
way...@gmail.com
Thats not true, if you provide a save()-method that calls either
update() or insert(). save() should not be removed, it makes many things
very easy. So only additional flexibility is added without removing or
changing any interface.
Greetings, David Danier
I can't speak for anyone else, but...
Notice how I haven't said "I don't think we should do this". I've been
saying "I think the real source of the sessions bug is somewhere else,
and we should fix that first, then look at INSERT vs. UPDATE when we
do the QuerySet refactoring".
I agree that the INSERT/UPDATE distinction is something we should look
at, but I feel like this isn't really the right time; making an API
change in response to a bug -- even when it seems like a trivial
change -- often has unpleasant consequences, simply because it means
making the change quickly and under pressure..
So I'd prefer to deal with the key-collision problem at the source by
doing what we can to stop it from happening in the first place, and
then deal with INSERT/UPDATE when we don't have the specter of a nasty
bug hanging over our heads and pushing us to rush things. And since
we're already going to be looking at QuerySet after 0.96, I feel like
that's a natural time to sit down and look at model saving and INSERT
versus UPDATE.
James Bennett said:
> On 3/13/07, Michael Radziej <m...@noris.de> wrote:
> > a) the client can sit behind a NAT that might hand out different IPs
> > b) the server can be behind NAT and not see the true IP at all. It
> > might see different IPs for the same client over time.
> > c) a crazy load balancer might get in the way
>
> (or any proxy, in which case REMOTE_IP will be the same for every
> request and you'd want to be looking at the 'X-Forwarded-For' header
> -- this is one of the problems with IP-based sessions)
Also, X-Forwarded-For is not a standard header - it's something Squid
does. Granted, Squid is the most common proxy, and most other proxies
will follow its lead in supplying this header, but there is absolutely
no guarantee that a HTTP/1.1 proxy will send this header.
Simon G. said:
> I actually passed this onto the security mailing address, as I thought
> it was better to be safe than sorry. Adrian's response was that tying
> things to the IP address is not a good idea (for the reasons that
> others have stated in this thread).
Agreed.
> One thing that I would like to suggest is that we do link the SID to a
> user-agent (see Chris Shiflett's excellent article on this at [1]), as
> this provides some protection (-via-obscurity) against these
> collisions, since the attacker not only needs to get a valid SID, but
> match it to the correct UA string. There's also no reason for a
> "normal" user to change UA strings without needing to login again.
>
> This can easily be added to the existing session framework by adding
> it to the hashing
I think the problem with doing that is a bit of a false sense of
security about collisions. With a full apartment complex behind a
NAT, there are easily 100 people surfing the net using the same
version of Internet Explorer, so adding user-agent doesn't solve
anything, it just pushes it down to "fewer people get screwed".
James Bennett said:
> Python's random-number generator uses the Mersenne Twister, and in
> Python 2.3 it seeds itself with the current system time (and in 2.4 it
> tries system-supplied randomness facilities first, then falls back to
> system time).
>
> I'm not an expert on randomness by any means, but the problem we have
> here is the same pair of random numbers being generated at the same
> time or within a very short time of one another, and my first
> inclination is to look at the fact that we're using a deterministic
> RNG seeded -- by default -- with the system time.
>
> So instead of trying to ensure safety at the database level (because,
> frankly, unless we go to extremes we're not going to manage that), why
> not look at better seeds for the RNG and solve the real problem?
If this race condition occurs between different processes, os.getpid()
will provide randomness. If this race condition occurs between
different threads in the same process, the file descriptor (integer)
of the connected socket will be different, because they are different
connections. Does Django have access to the Python socket object so
it can get sock.fileno() ?
Some combination of current time, current process id, and current file
descriptor should be enough to avoid MT seed race conditions.
ak said:
> 1. This option [IP-based session checking]
> can be disabled by default and only enabled optionally
> by developer (like in php, as i said)
One of the benefits I see in Django is allowing non-web-application
develpers to "get things right" without really having to understand
everything that they're doing. Giving developers an option to use
IP-based checking has several downsides:
- it encourages using obscurity as an alternative to security
- it encourages developers to do something that is known to have very
real problems (the old AOL-proxy argument).
Just because PHP does something doesn't mean that it's a good idea.
If you really are worried about XSS attacks and the like, you should
look at Wordpress. All of their administrative actions are protected
by nonces, mitigating the damage that can be done by a successful
cookie theft (and it doesn't depend on proxies or clients "following
the rules"). Not that WP is a shining example of secure programming,
but their strict adherence to nonces does provide a good second layer
of defense.
> 2. I have an ethernet connection @home and I sometimes log in to our
> private web apps from home. Any 'c00l hacker' is able to scan network
> traffic, get my session id and use it to join to my session too just
> because there is absolutely no checking who uses session. We added ip
> checking at our application level code but why don't implement this at
> django side as an option ? Noone says this is 100% guarantee of
> getting rid of man-in-the-middle, but at least it prevents from
> successful hacks from kiddies when using untrusted networks (ie public
> wi-fi, ethernet, etc).
HTTPS for public sites or a secure VPN for company intranets.
Anything else I've seen suggested in this thread is defeatable by
someone sufficiently dedicated.
--
Tom Barta
It would be better to be backwards compatible. Certainly no backwards
incompatible solution will go in before 0.96 and once we make that
decision, it's worth taking the time required to make sure we have
examined all options.
Maybe we need to looks at explicit insert() vs. update(), maybe one of
the alternatives is possible. Adding an extra field to the sessions
table would be a last resort since no real extra information is carried
in that field -- it's an implementation workaround, by the looks of it.
So it looks like there is some still design work and implementation
experimenting that needs to be done here.
Regards,
Malcolm
That works too :)
I'm just wanting anything which reduces the odds of a collision,
because it feels like some deployments are getting lots of them (I
have a suspicion this is system-specific because of how Python
initially seeds its RNG).
> This brings the differentiation into the key generation directly, not
> the seed. If two seeds are the same, the result will be an identical
> random number; however, this way, the two RNGs would have to be in
> perfect sync -and- the functions would have to be called at the exact
> same time to generate a pre-hash collision. This seems far less
> likely than seeding at the same instant.
I think the collision problems are largely coming from groups of
server processes being spawned at the same time (anecdotally, someone
mentioned to me that on Windows Python can't get sub-second precision
for the initial time seed which, if correct, may explain a lot of
this).
> A quick look at the code shows that "get_or_create" has the exact same problem.
> It checks the DB, then, if the entry doesn't exist, follows the same
> "obj.save()" path.
> As a result, this solution would still result in a race condition very
> similar to the current implementation.
Yes and no. Part of the problem right now, I think, is that we wait to
store the session, which increases the window in which a collision can
cause problems -- immediately saving the session as soon as we hit an
unused key reduces that window to, most likely, a fraction of a
second, which requires stupendous coincidences to screw up. I'd feel
comfortable with a combination of this and the above option as the
initial fix.
> I agree it would be harder, but it is probably the best option (and
> the only bulletproof one we've discussed to date). I understand that
> some of the refactoring work taking place will make it easier, so it
> may be worth waiting on those changes.
Agreed; again, I feel like trying to change the API or the mechanics
of model saving while we have a bug like this hanging over our heads
is a recipe for trouble down the line -- better to solve the
key-collision problem right now, and then have a little breathing room
to work out a model-save solution we can hang on to.
Again, I'm not disagreeing with the INSERT/UPDATE suggestion, I'm just
saying that a change to the model API should not take place while we
have a bug hanging over our heads; let's fix the random-number
collision, reduce the window between getting a session key and saving
the session, and then look at model saving once we can breathe again.
Making a design decision like this under the pressure of a bug that we
all want fixed is not a good idea ;)