A few TaskScheduler calls landed in //net, ok?

Gabriel Charette

unread,

Feb 26, 2017, 12:37:33 PM2/26/17

to Ryan Sleevi, Francois Pierre Doray, net...@chromium.org, Robert Liao

Hey Ryan,

we automated much of the base::TaskScheduler migration and it was auto sharded to various owners. It came to my attention that some of them landed in //net (with various //net owner's approval -- picked randomly via script), just wanted to check that these usages were fine by you (given your previous concerns in some sections of the //net codebase). Some of them were specifically reviewed by you, so I assume it's fine but just making sure :).

Thanks,

Gab

Ryan Sleevi

unread,

Feb 26, 2017, 3:20:30 PM2/26/17

to Gabriel Charette, Ryan Sleevi, Francois Pierre Doray, net...@chromium.org, Robert Liao

Gab,

Thanks for bringing this up! And for CC'ing net-dev@, to make sure this information gets out wider (or in the event I'm completely wrong :D)

As we discussed previously, the concern is about a specific sequence of calls, where:

1) //net calls into 3P library (blocking API from 3P library)

2) 3P library calls back into //net via 3P libraries extension hooks (and 3P library requires synchronous answer)

3) //net needs to asynchronously answer (e.g. prompt for UI)

The concern emerges from the fact that the 3P library may grab locks in the call during #1, thus causing any other calls to the 3P library to block until #2/#3 are completed for the original caller. If the async hop in #3 (e.g. to prompt the user) ends up calling back into #1 (which should never happen), we potentially deadlock. If the async hop in #3 (e.g. to prompt the user) ends up starved, because all available workers are also stuck in #1, then we potentially starve and also deadlock.

So far, the approach to //net has been to ensure that any calls to #1 end up on a dedicated WorkerPool, as that allows unbounded growth for #1, and thus hopefully reduces the risk of any blocking on #3. We've made sure that the UI prompts never end up blocking on #1 again, so it's really just a question of allocation strategy.

For NSS, the concrete concern is those that need to transit the PKCS11PasswordFunc, which is implemented using the CryptoModuleBlockingPasswordDelegate, and which needs to post (and wait for a response from) the UI thread. Further, since the UI thread may involve any number of other threads (such as DBus or IO thread for IPC'ing to the GPU process), we basically need to ensure that any other Chromium thread is responsive - that is, making sure any NSS call goes through #1.

If the TaskScheduler had the ability to indicate "These are tasks which must not block other pooled tasks" - that is, that the TaskScheduler ensures they will run on one or more dedicated threads which will not block other TaskScheduler'd tasks - then I think we might be able to safely move things to the TaskScheduler, because we'd be getting the same WorkerPool semantics that we rely on.

Having said that, and having had the opportunity to sit down with you and talk through these issues (thanks!) and better figure out exactly where the uncertainty and unease were, I'm now thinking the following three call sequences may be 'unsafe', because they represent tasks in the #1 category:

- https://cs.chromium.org/chromium/src/chrome/browser/chromeos/net/client_cert_store_chromeos.cc?rcl=ddb49ff47dc851ae174a2224e1b43d42872f508b&l=84

- https://cs.chromium.org/chromium/src/net/cert/nss_cert_database_chromeos.cc?rcl=ddb49ff47dc851ae174a2224e1b43d42872f508b&l=52

- https://cs.chromium.org/chromium/src/net/cert/nss_cert_database.cc?rcl=ddb49ff47dc851ae174a2224e1b43d42872f508b&l=103 ( and related)

Basically, these are all potential entry points into the sequence I described (notably, they're focused on NSS files, unsurprisingly :D)

There's one more, which I'm not sure about, but my gut says it may not be safe, and that's around the relationship between https://cs.chromium.org/chromium/src/net/ssl/threaded_ssl_private_key.cc?rcl=ddb49ff47dc851ae174a2224e1b43d42872f508b&l=75 and https://cs.chromium.org/chromium/src/net/ssl/ssl_platform_key_nss.cc?rcl=ddb49ff47dc851ae174a2224e1b43d42872f508b&l=175 .

I believe the ThreadedSSLPrivateKey is safe, because of https://cs.chromium.org/chromium/src/net/ssl/ssl_platform_key_util.h?rcl=ddb49ff47dc851ae174a2224e1b43d42872f508b&l=23 - guaranteeing a (single) dedicated worker thread for these calls. If this was ever changed to use a TaskScheduler-managed TaskRunner, we'd be in trouble.

Note that while I've focused on NSS, the same concerns would exist on macOS/Windows. However, there, our integration with the OS/third-party APIs is much more limited, and thus the only changes for ending up in the scenario I mentioned (that I'm aware of) are contained in the ThreadedSSLPrivateKey, and thus safe.

I think you and I have covered all the other places of NSS integration - most notably, certificate verification - and we've made sure those use a WorkerPool (and thus continue to scale on the #1 tasks in a way that doesn't block the #2/#3 sequence).

Ryan Sleevi

unread,

Feb 26, 2017, 3:30:42 PM2/26/17

to Ryan Sleevi, Gabriel Charette, Francois Pierre Doray, net...@chromium.org, Robert Liao

On Sun, Feb 26, 2017 at 12:19 PM, Ryan Sleevi <rsl...@chromium.org> wrote:

There's one more, which I'm not sure about, but my gut says it may not be safe, and that's around the relationship between https://cs.chromium.org/chromium/src/net/ssl/threaded_ssl_private_key.cc?rcl=ddb49ff47dc851ae174a2224e1b43d42872f508b&l=75 and https://cs.chromium.org/chromium/src/net/ssl/ssl_platform_key_nss.cc?rcl=ddb49ff47dc851ae174a2224e1b43d42872f508b&l=175 .

Apologies, this was an earlier paragraph I meant to delete, due to the explanation below. That is, I do believe that the ThreadedSSLPrivateKey is safe (only after working through the sequence as described below)

Gabriel Charette

unread,

Feb 27, 2017, 3:29:00 PM2/27/17

to Ryan Sleevi, Gabriel Charette, Francois Pierre Doray, net...@chromium.org, Robert Liao

Ok we will revert these usages in M58 to be safe (as discussed we're almost done removing all callers of base::WorkerPool and will soon move it to net::WorkerPool per it being the only corner of the codebase with such special requirements given its interaction with 3P libraries).

How is the waiting in #2 implemented? base::WaitableEvent?

In the medium term we're planning to hook known blocking calls in Chromium (e.g. base::WaitableEvent::Wait()) in order for the scheduler to know that a thread is blocked and replace it with another thread in that pool -- guaranteeing that the pools are always processing pending work (while keeping things under control when work isn't blocked -- as opposed to base::WorkerPool which will unconditionally grow to one thread per task).

Thanks,

Gab

Ryan Sleevi

unread,

Feb 27, 2017, 3:36:45 PM2/27/17

to Gabriel Charette, Ryan Sleevi, Francois Pierre Doray, net...@chromium.org, Robert Liao

On Mon, Feb 27, 2017 at 12:28 PM, Gabriel Charette <g...@chromium.org> wrote:

Ok we will revert these usages in M58 to be safe (as discussed we're almost done removing all callers of base::WorkerPool and will soon move it to net::WorkerPool per it being the only corner of the codebase with such special requirements given its interaction with 3P libraries).

How is the waiting in #2 implemented? base::WaitableEvent?

There's two main entry points for #2 with NSS, and that's

https://cs.chromium.org/chromium/src/chrome/browser/ui/crypto_module_delegate_nss.h?rcl=90129f52048e8042a9bb2caa618305da80ff6019&l=21

and

https://cs.chromium.org/chromium/src/net/cert_net/nss_ocsp.cc?rcl=90129f52048e8042a9bb2caa618305da80ff6019&l=443

The former uses base::WaitableEvent, the latter uses a condition variable to implement timeout semantics (because it's related to URL fetching)

The former is potentially hit by any NSS call (so it's up to the component to ensure the proper use of a worker/dedicated task runner; e.g. the SSLPrivateKey stuff uses a single dedicated thread, as did some of the ChromeOS bits), the latter is hit only during certificate validation, which always uses base::WorkerPool

In the medium term we're planning to hook known blocking calls in Chromium (e.g. base::WaitableEvent::Wait()) in order for the scheduler to know that a thread is blocked and replace it with another thread in that pool -- guaranteeing that the pools are always processing pending work (while keeping things under control when work isn't blocked -- as opposed to base::WorkerPool which will unconditionally grow to one thread per task).

That sounds like it'd go a long way to solving the UI case, which would allow almost all the callers except for cert validation to move onto a shared pool.

François Doray

unread,

Feb 28, 2017, 8:08:53 PM2/28/17

to rsl...@chromium.org, Gabriel Charette, net...@chromium.org, Robert Liao

I reverted:

https://cs.chromium.org/chromium/src/chrome/browser/chromeos/net/client_cert_store_chromeos.cc?l=84

https://cs.chromium.org/chromium/src/net/cert/nss_cert_database_chromeos.cc?l=52

https://cs.chromium.org/chromium/src/net/cert/nss_cert_database.cc?l=103 and related

This call site https://cs.chromium.org/chromium/src/net/ssl/threaded_ssl_private_key.cc?rcl=ddb49ff47dc851ae174a2224e1b43d42872f508b&l=75 doesn't post to TaskScheduler so there is nothing to revert.

Will not try to reland before TaskScheduler supports growing its pool of threads dynamically.

Ryan Sleevi

unread,

Feb 28, 2017, 8:13:03 PM2/28/17

to François Doray, Ryan Sleevi, Gabriel Charette, net...@chromium.org, Robert Liao

Thanks Francois!

I think once we're able to grow once we recognize contention on a base::Lock/base::ConditionVariable - or simply a way to signal "Don't wait for me" to the scheduler (like we do for ScopedAllowIO and friends) - we'll be in a good place to convert the rest of //net (and its consumers) overs

Reply all

Reply to author

Forward