Ordering of mojo error handlers / problems with NetworkService restarts

45 views
Skip to first unread message

Łukasz Anforowicz

unread,
Oct 8, 2018, 6:39:23 PM10/8/18
to network-s...@chromium.org, chromium-mojo, Ken Rockot, shi...@chromium.org
Hello,

Consider the following situation:
  • content::StoragePartition::network_context_ is a network::mojom::NetworkContextPtr
  • The InterfacePtr is initially bound in content::StoragePartitionImpl::InitNetworkContext and an error handler is registered to call InitNetworkContext again after errors.
  • network::mojom::URLLoaderFactoryPtr is bound in RenderFrameHostImpl::CreateNetworkServiceDefaultFactoryInternal, by passing network::mojom::URLLoaderFactoryRequest into CreateURLLoaderFactory method of network::mojom::NetworkContext
    • Today, RenderFrameHostImpl registers an error handler (RenderFrameHostImpl::UpdateSubresourceLoaderFactories) via network::mojom::URLLoaderFactoryPtr (on a separate network::mojom::URLLoaderFactoryPtr maintained for the sole purpose of detecting mojo errors - see RenderFrameHostImpl::network_service_connection_error_handler_holder_)
    • For the future, I am trying to get rid of |RenderFrameHostImpl::network_service_connection_error_handler_holder_| and instead introduce generic RegisterNetworkServiceCrashHandler(base::RepeatingClosure).  See https://crrev.com/c/1263555.  This generic handler registers an error handler via network::mojom::NetworkServicePtr.
Problem: when RenderFrameHostImpl::UpdateSubresourceLoaderFactories runs, the network::mojom::NetworkContextPtr might or might not yet realize that an error occurred.  This means that RenderFrameHostImpl::UpdateSubresourceLoaderFactories might use a stale network::mojom::NetworkContext* (one that won't work going forward).

The problem doesn't go away even if PostTask is used - when executing a task posted from an error handler of network::mojom::NetworkServicePtr, I still see that network::mojom::NetworkContextPtr did not yet realize that an error has occurred (*).

Q1: I assume that the problem above might occur even today, right?  Can this be the source of https://crbug.com/889855 that tracks flakiness in NetworkServiceRestartBrowserTests?

Q2: What can I do about the timing/sync problem?  How can a user of network::mojom::URLLoaderFactoryPtr (e.g. RenderFrameHostImpl) handle errors in a way that ensures that the error/bound state of network::mojom::URLLoaderFactoryPtr and network::mojom::NetworkContextPtr is in sync when the handler runs?

Q3: Please shout if I made any mistakes in the analysis or in the code :-)


-Lukasz

(*) To repro the problem:
2. DISPLAY=:20 out/rel/content_browsertests --gtest_filter=*NetworkServiceRestartBrowserTest.FetchFromServiceWorkerControlledPage_PassThrough* --gtest_repeat=20
3. Notice that the test flakily fails and when it fails, a stale network::mojom::NetworkContext* is used from RenderFrameHostImpl::UpdateSubresourceLoaderFactories:

[ RUN      ] NetworkServiceRestartBrowserTest.FetchFromServiceWorkerControlledPage_PassThrough

[231847:231847:1008/151606.045385:ERROR:storage_partition_impl.cc(742)] StoragePartitionImpl::GetNetworkContext; is_bound = 0; has_error = 0; network_context_.get() = (nil)
[231847:231847:1008/151606.045756:ERROR:storage_partition_impl.cc(1343)] StoragePartitionImpl::InitNetworkContext; was_bound = 0; had_error = 0; network_context_.get() = 0x1937d93a6480
[231847:231847:1008/151606.045910:ERROR:storage_partition_impl.cc(742)] StoragePartitionImpl::GetNetworkContext; is_bound = 1; has_error = 0; network_context_.get() = 0x1937d93a6480
[231847:231847:1008/151606.143935:ERROR:storage_partition_impl.cc(742)] StoragePartitionImpl::GetNetworkContext; is_bound = 1; has_error = 0; network_context_.get() = 0x1937d93a6480
[231847:231847:1008/151606.984577:ERROR:storage_partition_impl.cc(742)] StoragePartitionImpl::GetNetworkContext; is_bound = 1; has_error = 0; network_context_.get() = 0x1937d93a6480
[231847:231847:1008/151606.994738:ERROR:storage_partition_impl.cc(742)] StoragePartitionImpl::GetNetworkContext; is_bound = 1; has_error = 0; network_context_.get() = 0x1937d93a6480
[231847:231847:1008/151607.019132:ERROR:embedded_worker_instance.cc(96)] embedded_worker_instance.cc: CreateFactoryBundle ...
[231847:231847:1008/151607.019199:ERROR:storage_partition_impl.cc(742)] StoragePartitionImpl::GetNetworkContext; is_bound = 1; has_error = 0; network_context_.get() = 0x1937d93a6480
[231847:231847:1008/151607.019306:ERROR:embedded_worker_instance.cc(126)] embedded_worker_instance.cc: CreateFactoryBundle ... done.
[231847:231847:1008/151607.019337:ERROR:embedded_worker_instance.cc(96)] embedded_worker_instance.cc: CreateFactoryBundle ...
[231847:231847:1008/151607.019381:ERROR:storage_partition_impl.cc(742)] StoragePartitionImpl::GetNetworkContext; is_bound = 1; has_error = 0; network_context_.get() = 0x1937d93a6480
[231847:231847:1008/151607.019444:ERROR:embedded_worker_instance.cc(126)] embedded_worker_instance.cc: CreateFactoryBundle ... done.
[231847:231847:1008/151607.117424:ERROR:storage_partition_impl.cc(742)] StoragePartitionImpl::GetNetworkContext; is_bound = 1; has_error = 0; network_context_.get() = 0x1937d93a6480
[231847:231847:1008/151607.119528:ERROR:storage_partition_impl.cc(742)] StoragePartitionImpl::GetNetworkContext; is_bound = 1; has_error = 0; network_context_.get() = 0x1937d93a6480

[231868:231904:1008/151607.148278:ERROR:network_service_test_helper.cc(114)] Intentionally terminating current process to simulate NetworkService crash for testing.

An error handler of network::mojom::NetworkServicePtr posts a task that executes RenderFrameHostImpl::UpdateSubresourceLoaderFactories:
[231847:231847:1008/151607.160191:ERROR:render_frame_host_impl.cc(4874)] RenderFrameHostImpl::UpdateSubresourceLoaderFactories ...
[231847:231847:1008/151607.160262:ERROR:storage_partition_impl.cc(742)] StoragePartitionImpl::GetNetworkContext; is_bound = 1; has_error = 0; network_context_.get() = 0x1937d93a6480
Notice that on the previous line, a stale |network_context_| is used (i.e. |has_error = 0| shows that InterfacePtr::encountered_error() still returned false despite NetworkService crash).
[231847:231847:1008/151607.160514:ERROR:render_frame_host_impl.cc(4887)] RenderFrameHostImpl::UpdateSubresourceLoaderFactories ... done.
[231847:231847:1008/151607.161109:ERROR:storage_partition_impl.cc(1343)] StoragePartitionImpl::InitNetworkContext; was_bound = 1; had_error = 1; network_context_.get() = 0x1937d954acd0
Only now |had_error| is true...
[231847:231847:1008/151608.152541:ERROR:embedded_worker_instance.cc(96)] embedded_worker_instance.cc: CreateFactoryBundle ...
[231847:231847:1008/151608.152646:ERROR:storage_partition_impl.cc(742)] StoragePartitionImpl::GetNetworkContext; is_bound = 1; has_error = 0; network_context_.get() = 0x1937d954acd0
[231847:231847:1008/151608.152756:ERROR:embedded_worker_instance.cc(126)] embedded_worker_instance.cc: CreateFactoryBundle ... done.
[231847:231847:1008/151608.152789:ERROR:embedded_worker_instance.cc(96)] embedded_worker_instance.cc: CreateFactoryBundle ...
[231847:231847:1008/151608.152851:ERROR:storage_partition_impl.cc(742)] StoragePartitionImpl::GetNetworkContext; is_bound = 1; has_error = 0; network_context_.get() = 0x1937d954acd0
[231847:231847:1008/151608.152928:ERROR:embedded_worker_instance.cc(126)] embedded_worker_instance.cc: CreateFactoryBundle ... done.
../../content/browser/network_service_restart_browsertest.cc:776: Failure
Expected equality of these values:
  "Echo"
    Which is: 0x317a9a
  EvalJs(shell(), script)
    Which is: "TypeError: Failed to fetch"

John Abd-El-Malek

unread,
Oct 8, 2018, 7:07:57 PM10/8/18
to Łukasz Anforowicz, Chong Zhang, network-service-dev, chromium-mojo, Ken Rockot, Makoto Shimazu
On Mon, Oct 8, 2018 at 3:39 PM Łukasz Anforowicz <luk...@chromium.org> wrote:
Hello,

Consider the following situation:
  • content::StoragePartition::network_context_ is a network::mojom::NetworkContextPtr
  • The InterfacePtr is initially bound in content::StoragePartitionImpl::InitNetworkContext and an error handler is registered to call InitNetworkContext again after errors.
  • network::mojom::URLLoaderFactoryPtr is bound in RenderFrameHostImpl::CreateNetworkServiceDefaultFactoryInternal, by passing network::mojom::URLLoaderFactoryRequest into CreateURLLoaderFactory method of network::mojom::NetworkContext
    • Today, RenderFrameHostImpl registers an error handler (RenderFrameHostImpl::UpdateSubresourceLoaderFactories) via network::mojom::URLLoaderFactoryPtr (on a separate network::mojom::URLLoaderFactoryPtr maintained for the sole purpose of detecting mojo errors - see RenderFrameHostImpl::network_service_connection_error_handler_holder_)
    • For the future, I am trying to get rid of |RenderFrameHostImpl::network_service_connection_error_handler_holder_| and instead introduce generic RegisterNetworkServiceCrashHandler(base::RepeatingClosure).  See https://crrev.com/c/1263555.  This generic handler registers an error handler via network::mojom::NetworkServicePtr.
Problem: when RenderFrameHostImpl::UpdateSubresourceLoaderFactories runs, the network::mojom::NetworkContextPtr might or might not yet realize that an error occurred.  This means that RenderFrameHostImpl::UpdateSubresourceLoaderFactories might use a stale network::mojom::NetworkContext* (one that won't work going forward).

I got confused by this as well before, and Chong explained that even if a stale interfaceptr is used, UpdateSubresourceLoaderFactories will be called a second time when that stale pointer notices the connection error.

--
You received this message because you are subscribed to the Google Groups "network-service-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to network-service...@chromium.org.
To post to this group, send email to network-s...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/network-service-dev/CAA_NCUG-vM-kgk6eEQZg%3Dv9ZS3-EXgm%2BUcniEtrpN-o0t0B3TA%40mail.gmail.com.

Chong Zhang

unread,
Oct 8, 2018, 7:21:16 PM10/8/18
to John Abd-El-Malek, Łukasz Anforowicz, network-service-dev, chromium-mojo, Ken Rockot, Makoto Shimazu
To add more context John had a similar patch before to create a universal crash observer on NetworkServicePtr, however it turns out to require more complexity due to the independence of each pipe.

IMO the relatively robust solution currently is to register error handler on the parent InterfacePtr, e.g. URLLoaderFactoryPtr relies on NetworkContextPtr, and NetworkContextPtr NetworkServicePtr.

Otherwise we might have to introduce dependencies on InterfacePtr in the future (which might not be what we want).

-
Chong

Matt Menke

unread,
Oct 8, 2018, 7:23:57 PM10/8/18
to cho...@chromium.org, John Abd-El-Malek, Łukasz Anforowicz, network-service-dev, chromium-mojo, Ken Rockot, shi...@chromium.org
Chong:  Note that in this case, we're in a difference process.

Thant having been said, what happens is, at worst, you end up getting a new pipe that's also dead, notice it's dead, and request a new one.  Eventually the browser process realizes the NetworkService and NetworkContext pointers are dead, and gives you a new, good, pointer.  So I don't think we should spend too much time worrying about this case - network process crashes should be rare, anyways.

Łukasz Anforowicz

unread,
Oct 8, 2018, 7:38:01 PM10/8/18
to Matt Menke, cho...@chromium.org, John Abd-El-Malek, network-service-dev, chromium-mojo, Ken Rockot, shi...@chromium.org
Thank you very much for all the replies!  So, how would I trigger repopulating NetworkService with GetNetworkService()->AddCorbExceptionForPlugin(process_id) data after a crash?
  • NetworkServicePtr is hidden/embedded inside network_service_instance_cc, so I can't call NetworkServicePtr::set_connection_error_handler.
  • I guess I can create/bind an extra/fake InterfacePtr object (maybe a NetworkContextPtr), but this might be called at a wrong moment (when NetworkService didn't yet refresh), right?
  • Maybe I can continue with the RegisterNetworkServiceCrashHandler approach, but comment that it is called when NetworkServicePtr errors out (+ that ordering wrt other error handlers is undefined).
WDYT?

To unsubscribe from this group and stop receiving emails from it, send an email to network-service-dev+unsub...@chromium.org.
To post to this group, send email to network-service-dev@chromium.org.

--
You received this message because you are subscribed to the Google Groups "network-service-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to network-service-dev+unsub...@chromium.org.
To post to this group, send email to network-service-dev@chromium.org.

Matt Menke

unread,
Oct 8, 2018, 7:40:21 PM10/8/18
to Łukasz Anforowicz, cho...@chromium.org, John Abd-El-Malek, network-service-dev, chromium-mojo, Ken Rockot, shi...@chromium.org
The method to crash the NetworkContext in the browser test fixture already waits for the NetworkService to be restarted.  If AddCorbExceptionForPlugin isn't call synchronously on NetworkService restart, perhaps it should be?

To unsubscribe from this group and stop receiving emails from it, send an email to network-service...@chromium.org.
To post to this group, send email to network-s...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "network-service-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to network-service...@chromium.org.
To post to this group, send email to network-s...@chromium.org.

Łukasz Anforowicz

unread,
Oct 9, 2018, 7:02:11 PM10/9/18
to Matt Menke, cho...@chromium.org, John Abd-El-Malek, network-service-dev, chromium-mojo, Ken Rockot, Makoto Shimazu
On Mon, Oct 8, 2018 at 4:40 PM, Matt Menke <mme...@chromium.org> wrote:
The method to crash the NetworkContext in the browser test fixture already waits for the NetworkService to be restarted.  If AddCorbExceptionForPlugin isn't call synchronously on NetworkService restart, perhaps it should be?

In https://chromium-review.googlesource.com/1263555AddCorbExceptionForPlugin is indeed called synchronously on NetworkService restart.  I guess this (i.e. option #3 from my previous email) is good enough for this CL.

Taking a step back:
  • I understand now how crash notification work reliably today.  The potential to get multiple error notification for the same crash seems weird, but I guess I can't argue with the fact that it all works out in practice.
  • I understand why RenderFrameHostImpl has to listen for errors of NetworkContextPtr and/or URLLoaderFactoryPtr (and shouldn't listen for errors of NetworkServicePtr because they are not synchronized with error-state of NetworkContextPtr).  OTOH, I think it is undesirable to double the number of pipes and network::URLLoaderFactory objects just to enable error detection.  Maybe this can be fixed if 1) StoragePartitionImpl would expose [Un]RegisterNetworkContextErrorHandler and 2) RenderFrameHostImpl used this to react to errors (instead of creating an extra/artificial URLLoaderFactory)?  What do you all think about this idea?

To unsubscribe from this group and stop receiving emails from it, send an email to network-service-dev+unsub...@chromium.org.
To post to this group, send email to network-service-dev@chromium.org.

--
You received this message because you are subscribed to the Google Groups "network-service-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to network-service-dev+unsub...@chromium.org.
To post to this group, send email to network-service-dev@chromium.org.

Matt Falkenhagen

unread,
Oct 9, 2018, 7:47:50 PM10/9/18
to luk...@chromium.org, Matt Menke, cho...@chromium.org, John Abd-El-Malek, network-service-dev, chromium-mojo, Ken Rockot, Makoto Shimazu
I may be missing something but the flaky test above is about service worker which doesn't use the error notification mechanism from RenderFrameHostImpl::UpdateSubresourceLoaderFactories. The service worker in the renderer has a URLLoaderFactoryPtr to the network loader factory (ServiceWorkerContextClient::network_service_connection_error_handler_holder_). When that connection breaks the service worker self-terminates, so a new service worker can be created when needed.

It looks like what's happening in the flake is the new service worker is also getting a dead network service connection. Maybe that service worker will also self-terminate and eventually it'll work?

To unsubscribe from this group and stop receiving emails from it, send an email to network-service...@chromium.org.
To post to this group, send email to network-s...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "network-service-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to network-service...@chromium.org.
To post to this group, send email to network-s...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "network-service-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to network-service...@chromium.org.
To post to this group, send email to network-s...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/network-service-dev/CAA_NCUFPOyNNbuPY6dAx%3DhLp8-SXeP%2BNnsf2rjuTayigOzEz9w%40mail.gmail.com.

Matt Menke

unread,
Oct 10, 2018, 11:05:59 AM10/10/18
to Matt Falkenhagen, Łukasz Anforowicz, cho...@chromium.org, John Abd-El-Malek, network-service-dev, chromium-mojo, Ken Rockot, Makoto Shimazu
I, too, was thinking we were talking about something in the renderer, and not registering to watch for NetworkService re-creation in the browser process.

If it were up to me, I'd make these exceptions per-URLLoaderFactory, and then when the renderer requests a new URLLoaderFactory, we re-attach them then.  (I'd also add a side channel for adding exceptions to existing URLLoaderFactories).  I assume that would be a pretty big refactor, though, at least if SW's don't have access to per-frame URLLoaderFactories currently.  :(

Makoto Shimazu

unread,
Oct 11, 2018, 4:45:25 AM10/11/18
to mme...@chromium.org, Matt Falkenhagen, luk...@chromium.org, cho...@chromium.org, John Abd-El-Malek, network-s...@chromium.org, chromium-mojo, Ken Rockot, Makoto Shimazu
Regarding to NetworkServiceRestartBrowserTest.FetchFromServiceWorkerControlledPage_PassThrough and _NoFetchHandler, IIUC, they rely on RenderFrameHostImpl::UpdateSubresourceLoaderFactories() because these test cases use "fallback factory" connected to the URLLoaderFactory provided by NetworkService, and the factory is updated via UpdateSubresourceLoaderFactories(). This was introduced in crrev.com/c/1237720.
I don't know if the same thing happens on _RespondWithFetch because it should rely on the restarting mechanism explained by falken@. The timing of getting newer factory would be much after the disconnection happens in that test.


2018年10月11日(木) 0:06 Matt Menke <mme...@chromium.org>:
Reply all
Reply to author
Forward
0 new messages