Help resolve massive performance regression in 2.7 vs 2.5 runtime

Pol

unread,

Nov 12, 2011, 9:48:57 PM11/12/11

to Google App Engine

Hi,

Since switching to 2.7 runtime, logging in to http://www.everpix.com
went from about a second to anywhere from 15s to 60s. I tracked it
down to this single password checking line:

from bcrypt import bcrypt
bcrypt.hashpw(password, self.password_hash) == self.password_hash

This comes from "a native Python implementation of the py-bcrypt
package from http://www.mindrot.org/projects/py-bcrypt/" grabbed from
here: https://github.com/erlichmen/py-bcrypt.

So what's happening here and how can we fix this?

Thanks,

- Pol

Brian Quinlan

unread,

Nov 12, 2011, 9:58:24 PM11/12/11

to google-a...@googlegroups.com

Hi Pol,

On Sun, Nov 13, 2011 at 1:48 PM, Pol <p...@everpix.net> wrote:
> Hi,
>
> Since switching to 2.7 runtime, logging in to http://www.everpix.com
> went from about a second to anywhere from 15s to 60s. I tracked it
> down to this single password checking line:
>
> from bcrypt import bcrypt
> bcrypt.hashpw(password, self.password_hash) == self.password_hash

What value are you using for "threadsafe" in your app.yaml?

How large is self.password_hash?

Cheers,
Brian

> This comes from "a native Python implementation of the py-bcrypt
> package from http://www.mindrot.org/projects/py-bcrypt/" grabbed from
> here: https://github.com/erlichmen/py-bcrypt.
>
> So what's happening here and how can we fix this?
>
> Thanks,
>
> - Pol
>

> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
>
>

Brandon Wirtz

unread,

Nov 12, 2011, 10:58:40 PM11/12/11

to google-a...@googlegroups.com

Unless you are protecting Medical records bcrypt is overkill if you do some
reasonably smart things like "Failed logins from IP >9"

Or, if you just do something weird to the password BEFORE you SHA it. Like
interleave the user name in the password, Salt1 + UpSaEsRsNwAoMrEd + Salt2

Or Pick 2 Hash's SHA(pass) + Md5(pass)

Don't want to store all that string length? Odd Characters from
Sha(Pass+salt) + Even Characters from MD5(Pass+Salt)

Uniqueness of the method is more important than the method.

Pol

unread,

Nov 13, 2011, 12:50:17 AM11/13/11

to Google App Engine

Hi Brian,

threadsafe is "true"

Here's an example password_hash for an original 8 characters password:

$2a$04$cbM2uHwDphIG3jFFRpq1mui5aVjevnDUwhvQ77S/WG/qvJMpiXAL6

On Nov 12, 6:58 pm, Brian Quinlan <bquin...@google.com> wrote:
> Hi Pol,
>
> On Sun, Nov 13, 2011 at 1:48 PM, Pol <p...@everpix.net> wrote:
> > Hi,
>

> > Since switching to 2.7 runtime, logging in tohttp://www.everpix.com

> > went from about a second to anywhere from 15s to 60s. I tracked it
> > down to this single password checking line:
>
> > from bcrypt import bcrypt
> > bcrypt.hashpw(password, self.password_hash) == self.password_hash
>
> What value are you using for "threadsafe" in your app.yaml?
>
> How large is self.password_hash?
>
> Cheers,
> Brian
>
>
>
>
>
>
>
> > This comes from "a native Python implementation of the py-bcrypt

> > package fromhttp://www.mindrot.org/projects/py-bcrypt/" grabbed from

Brian Quinlan

unread,

Nov 13, 2011, 1:09:05 AM11/13/11

to google-a...@googlegroups.com

Hi Pol,

Thanks for getting back to me.

On Sun, Nov 13, 2011 at 4:50 PM, Pol <p...@everpix.net> wrote:
> Hi Brian,
>
> threadsafe is "true"

There is a known issue where concurrent requests (enabled with
threadsafe) can be much slower than non-concurrent requests,
especially if the request is CPU-bound.

You might want to set threadsafe to "false" and see if that fixes the problem.

Cheers,
Brian

Pol

unread,

Nov 13, 2011, 12:32:06 PM11/13/11

to Google App Engine

Hi Brian,

It helps, it's now at ~5s instead, but still at least 2x slower than
on the 2.5 runtime.

So on December 1st, the 50% discount for front-instances is gone. The
idea is to compensate by switching to Python 2.7 with multithreading,
but it looks like at this point it's a lose-lose situation: it runs
more requests at the same time, but they take longer. We're mid-
November already, do you guys think you'll have all of this working
perfectly within 2 weeks? The 1.6 SDK which actually allows to run 2.7
locally was just released, so we're only starting to test now. Seems
to me deferring the payment change 1 more month to January 1st 2012
wouldn't hurt :)

Brian Quinlan

unread,

Nov 13, 2011, 2:51:05 PM11/13/11

to google-a...@googlegroups.com

Hi Pol,

Thanks for trying it again with threadsafe disabled and reporting back!

On Mon, Nov 14, 2011 at 4:32 AM, Pol <p...@everpix.net> wrote:
> Hi Brian,
>

> It helps, it's now at ~5s instead, but still at least 2x slower than
> on the 2.5 runtime.
>
> So on December 1st, the 50% discount for front-instances is gone. The
> idea is to compensate by switching to Python 2.7 with multithreading,
> but it looks like at this point it's a lose-lose situation: it runs
> more requests at the same time, but they take longer. We're mid-
> November already, do you guys think you'll have all of this working
> perfectly within 2 weeks?

No, the issues with concurrent requests won't be fixed by the end of November.

But note that concurrent requests will *not* improve the utilization
of CPU-bound requests. Running multiple threads on the same CPU just
proportionally slows each thread down.

> The 1.6 SDK which actually allows to run 2.7
> locally was just released, so we're only starting to test now. Seems
> to me deferring the payment change 1 more month to January 1st 2012
> wouldn't hurt :)

You can send your billing-related suggestions to App Engine's
Engineering Directory, Peter Magnusson (p...@google.com).

Thanks!

Cheers,
Brian

Pol

unread,

Nov 14, 2011, 11:19:26 AM11/14/11

to Google App Engine

Hi Brian,

> > So on December 1st, the 50% discount for front-instances is gone. The
> > idea is to compensate by switching to Python 2.7 with multithreading,
> > but it looks like at this point it's a lose-lose situation: it runs
> > more requests at the same time, but they take longer. We're mid-
> > November already, do you guys think you'll have all of this working
> > perfectly within 2 weeks?
>
> No, the issues with concurrent requests won't be fixed by the end of November.
>
> But note that concurrent requests will *not* improve the utilization
> of CPU-bound requests. Running multiple threads on the same CPU just
> proportionally slows each thread down.

That doesn't make sense: apps do a mix of CPU stuff and RPC stuff (and
possibly URL requests). What's the points of concurrent requests if it
slows down the CPU stuff while allowing to parallelize your RPC calls?
The end result will be the same number of instance as requests end up
taking longer. Isn't the scheduler supposed to watch all this and
ensure the CPU on each physical machine is not saturated?

Only apps that do long poll URL requests and barely use the CPU would
benefit of concurrent requests then.

We were told: don't worry so much about hours-based pricing, just wait
for 2.7 runtime, it'll have concurrent requests, it'll compensate
things. Clearly that doesn't work as promised if just turning
threadsafe ON makes a 2 seconds requests turn into a 30-60 seconds
one: the scheduler is not doing the right thing.

It seems what you need is a per WSGIApplication instance control of
the concurrent setting instead of global one, so you can turn it on
only where it makes sense.

Finally, no matter what, concurrent or not, there's still a problem as
2.7 runtime appears slower than 2.5 in this simple empirical test. I'm
starting to suspect you are using the 2.7 transition as a opportunity
to run more virtual instances per physical machine.

- Pierre

Brian Quinlan

unread,

Nov 14, 2011, 1:27:22 PM11/14/11

to google-a...@googlegroups.com

Hi Pierre,

On Tue, Nov 15, 2011 at 3:19 AM, Pol <p...@everpix.net> wrote:
> Hi Brian,
>
>> > So on December 1st, the 50% discount for front-instances is gone. The
>> > idea is to compensate by switching to Python 2.7 with multithreading,
>> > but it looks like at this point it's a lose-lose situation: it runs
>> > more requests at the same time, but they take longer. We're mid-
>> > November already, do you guys think you'll have all of this working
>> > perfectly within 2 weeks?
>>
>> No, the issues with concurrent requests won't be fixed by the end of November.
>>
>> But note that concurrent requests will *not* improve the utilization
>> of CPU-bound requests. Running multiple threads on the same CPU just
>> proportionally slows each thread down.
>
> That doesn't make sense: apps do a mix of CPU stuff and RPC stuff (and
> possibly URL requests). What's the points of concurrent requests if it
> slows down the CPU stuff while allowing to parallelize your RPC calls?

This pattern (a mix of CPU use and RPC calls) will benefit from
concurrent requests. I was writing about what I understood to be your
login example.

Presumably it does a single datastore read to access user information
(taking 40ms of so) and then spends 1 seconds doing cryptography.

> The end result will be the same number of instance as requests end up
> taking longer. Isn't the scheduler supposed to watch all this and
> ensure the CPU on each physical machine is not saturated?
>
> Only apps that do long poll URL requests and barely use the CPU would
> benefit of concurrent requests then.
>
> We were told: don't worry so much about hours-based pricing, just wait
> for 2.7 runtime, it'll have concurrent requests, it'll compensate
> things. Clearly that doesn't work as promised if just turning
> threadsafe ON makes a 2 seconds requests turn into a 30-60 seconds
> one: the scheduler is not doing the right thing.

Yes, these large latency increases are a bug:
http://code.google.com/p/googleappengine/issues/detail?id=6323

> It seems what you need is a per WSGIApplication instance control of
> the concurrent setting instead of global one, so you can turn it on
> only where it makes sense.
>
> Finally, no matter what, concurrent or not, there's still a problem as
> 2.7 runtime appears slower than 2.5 in this simple empirical test. I'm
> starting to suspect you are using the 2.7 transition as a opportunity
> to run more virtual instances per physical machine.

That's not the case. The Python 2.7 runtime is slower than the Python
2.5 runtime in some cases and faster in others. We aren't publicizing
the reasons why at this point.

Cheers,
Brian

> - Pierre

Brandon Wirtz

unread,

Nov 14, 2011, 5:07:42 PM11/14/11

to google-a...@googlegroups.com

Half sized instances, so they may be half speed.

-----Original Message-----
From: google-a...@googlegroups.com
[mailto:google-a...@googlegroups.com] On Behalf Of Pol
Sent: Sunday, November 13, 2011 9:32 AM
To: Google App Engine
Subject: [google-appengine] Re: Help resolve massive performance regression
in 2.7 vs 2.5 runtime

Nick Johnson

unread,

Nov 14, 2011, 6:55:34 PM11/14/11

to google-a...@googlegroups.com

No! Please, please don't do this. Obscurity is no substitute for security.

1) Bcrypt or similar is not 'overkill' no matter who you are. Users reuse passwords, and they're entitled to the best protection you can reasonably provide them.

2) Bcrypt is not there to protect against online attacks, it's there to protect against offline attacks, where an attacker obtains your hashed and salted passwords.

3) Doing "something weird" is security through obscurity. Do not base your security on your attacker not knowing what you did. Really, really don't just concatenate salts to the beginning or end of the password.

4) Both MD5 and SHA1 are merkle-damgard construction hashes (http://en.wikipedia.org/wiki/Merkle%E2%80%93Damg%C3%A5rd_construction). As a result, the concatenation of several hashes is no more secure than the most secure of the individual hashes.

-Nick Johnson

Nick Johnson, Developer Programs Engineer, App Engine

Brandon Wirtz

unread,

Nov 14, 2011, 9:07:32 PM11/14/11

to google-a...@googlegroups.com

If I know your salt, I can “De-Hash” bcrypts faster than I can any of the “weird” combinations. Because there are libraries for doing so on ATI cards.

If you do something weird a script kiddie can’t just pull code off the web and attack it.

You want to see who can offline crack a set of 1M users? Your bcrypt list vs my “Weird” You don’t even have to give me the salt I’ll have 10k of those cracked in the first 72 hours. 10 to 1 odds you won’t get through mine without my source code in my life time.

-Brandon Wirtz

PS
I don’t usually do the “trust me I’m far more evil” but FBI, Homeland Security, and the CIA have been to my doorstep for things I have defeated, documented, or built to keep from being defeated. The first time I was in 3^rd grade.

Nick Johnson

unread,

Nov 14, 2011, 9:21:11 PM11/14/11

to google-a...@googlegroups.com

Hi Brandon,

What you say is fine if your threat model only includes "script kiddies" who don't have your source code. If either of those is not true - you have an adversary with some level of independent skill, or your source code is compromised - any method that relies on obscurity for its security will fare very poorly.

One thing to bear in mind is that if your app is ever compromised, your password database and/or source may be posted publicly; at that point, you no longer have to worry about just the initial attacker, but anyone with sufficient motivation.

Of course, using federated login like OpenID or the Users API obviates the need to store passwords at all, making it Someone Else's Problem. :)

-Nick

Stephen Buergler

unread,

Nov 14, 2011, 9:30:54 PM11/14/11

to google-a...@googlegroups.com

It doesn't matter if you can have your ATI card up and running sooner if every single password attempt takes a whole lot longer to try. That is the main strength of bcrypt.

Brandon Wirtz

unread,

Nov 14, 2011, 9:52:41 PM11/14/11

to google-a...@googlegroups.com

Nick,

I agree, that my threat model assumes they didn’t get my source code. That “Somebody else’s problem” works under the assumption people are going to get my data, not my source code because I don’t ever write my own DB server code I am stuck using someone else’s which means the vulnerability that I am most likely to face is that somebody else’s screw up will be where my problem lies.

Granted this is a better strategy if you are running compiled code, since my code lives on the Google Server I’m at the mercy of Google’s Security, where as if I were running compiled code it would be less likely someone would get the code.

I would say that unique salt per user, is a good thing. The most common way to attack a large password database is to look at the most common entries and compare against the most common passwords from other sources. If you know the 15 most used passwords and the 15 most often occurring database results you are a long ways towards knowing what those 15 values are and calculating the salt. You aren’t crunching millions of combinations you are crunching 1000’s and once you have the salt, you take your already deciphered list of the most common passwords and you calculate the top 5k using bcrypt and you now have about 50% of the data in fewer than 10k operations.

Compare that with my scenario. You have data. You don’t have the source code. The UserID or other “spoiler” is in every salt so the reoccurrence of a hash doesn’t correspond to a duplicate password, and now the computation is nearly impossible even if you have the source code, because you have to calculate every value for every user anyway.

Would Brcypt(Pass+UserID+Salt) be the best? Yes. But MD5(Pass+UserID+Salt) is going to still going to be orders of magnitude more difficult than Bcrypt(Pass+salt), because I can’t use knowledge of frequency tables to predict likely outcomes or detect duplicate passwords.

-Brandon

Jeff Schnitzer

unread,

Nov 15, 2011, 6:05:29 AM11/15/11

to google-a...@googlegroups.com

On Mon, Nov 14, 2011 at 12:19 PM, Pol <p...@everpix.net> wrote:

> But note that concurrent requests will *not* improve the utilization
> of CPU-bound requests. Running multiple threads on the same CPU just
> proportionally slows each thread down.

That doesn't make sense: apps do a mix of CPU stuff and RPC stuff (and
possibly URL requests). What's the points of concurrent requests if it
slows down the CPU stuff while allowing to parallelize your RPC calls?
The end result will be the same number of instance as requests end up
taking longer. Isn't the scheduler supposed to watch all this and
ensure the CPU on each physical machine is not saturated?

Only apps that do long poll URL requests and barely use the CPU would
benefit of concurrent requests then.

The vast majority of web applications (on any platform) spend the majority of their time blocked waiting for I/O. It's usually an order of magnitude difference - and this isn't long polling, it's just normal database access and whatnot.

It sounds like your bcrypt library is intolerably slow in python, which creates a rare CPU bound situation. Either decrement the number of rounds or consider more efficient but still reasonably secure alternatives in this thread:

http://stackoverflow.com/questions/7027196/how-can-i-use-bcrypt-scrypt-on-appengine-for-python

Also, I +1 Nick's comment about DON'T roll your own. There are plenty of off-the-shelf options written by people who have spent significant portions of their educated life thinking about this problem.

Jeff

Jeff Schnitzer

unread,

Nov 15, 2011, 6:10:43 AM11/15/11

to google-a...@googlegroups.com

Dunno about the Python library, but the standard Java jBCrypt library uses a random salt per-user by default.

I'm also very suspicious of this idea that the attacker doesn't have the source code. Python is trivially easy to decompile. Also, where do you keep your source code? In Github? Opens up a whole new set of attack vectors, including disgruntled Github employees.

Jeff

Waleed Abdulla

unread,

Nov 15, 2011, 7:37:08 PM11/15/11

to google-a...@googlegroups.com

I just switched one of my apps to Python 2.7 and the average latency went up from 125ms to about 1.3 seconds (about 10x). My app is 'not' CPU-bound. It has a mix of different use patterns (some requests use urlfetch, others load data from datastore or memcache). That variety should make it a good candidate to benefit from multi-threading, but that doesn't seem to be working as expected. While the number of instances went down from 9 instances to 6, the 10X increase in latency is not acceptable.

I spent a lot of time investigating it, and finally changed threadsafe to false and that fixed the problem (and, of course, raised my instance count to it's original level). Like the original poster, I was hoping that multithreading will reduce my instance cost, but that plan failed.

Am I doing something wrong? Or is multi-threading not ready for live traffic yet?

Waleed

app id: bitpixelshr

Nick Johnson

unread,

Nov 15, 2011, 7:49:38 PM11/15/11

to google-a...@googlegroups.com

Hi Waleed,

It's impossible to say what the cause of the latency might be without more details. You say you spent a lot of time investigating it; can you show us some of the data you gathered? Did you use appstats?

-Nick Johnson

Waleed Abdulla

unread,

Nov 15, 2011, 10:00:51 PM11/15/11

to google-a...@googlegroups.com

Hi Nick,

Here are more details, and I'm attaching AppStats screen shots.

- Along with the switch to Python 2.7, we also switched to NDB, and we removed our use of memcache in favor of the built-in caching in NDB. So when we noticed the delays, the first thing we suspected was that NDB caching wasn't working, so we re-implemented our use of memcache. That didn't help.

- We checked if there are "pending_ms" values in the logs. There weren't any.

- The instance and latency in "Application Settings" are all set to "automatic".

- To keep things simple, we tested with one of the simple requests (it fetches a Blob from memcache and serves it as an image, then updates values in memcache). That request takes 1s to 3s when threadsafe is True, and 100-300ms when threadsafe is False.

- I just enabled appstats after your email and collected data for both cases. Most of the delay seems to be happening in RPC calls. For example, memcache calls seem to take 200ms+. I'm attaching a couple of screen shots for a simple request when threadsafe is True and False.

Hope this helps.

Waleed

appstats_threadsafe_true.png

appstats_threadsafe_false.png

Waleed Abdulla

unread,

Nov 21, 2011, 4:29:21 PM11/21/11

to google-a...@googlegroups.com, nickj...@google.com, bqui...@google.com

Bump!

Nick, I sent the info you requested.

I'm still trying to find an explanation for why using multithreading increased my response time by 10X. Any ideas?

Waleed

Brian Quinlan

unread,

Nov 21, 2011, 5:10:05 PM11/21/11

to Waleed Abdulla, google-a...@googlegroups.com, nickj...@google.com

Hi Waleed,

This is likely related to this bug:
http://code.google.com/p/googleappengine/issues/detail?id=6323

Though I'm a bit surprised that it is impacting you this much since
you don't seem to be using much CPU.

Cheers,
Brian

Reply all

Reply to author

Forward