Hey Daniel!
I was able to reproduce the MultiplexRouter lifetime issue via test code and some synchonization games, so I figured I'd put up a CL and we'd be able to get it in for Christmas!
I suspect this could be causing some of the heap corruption we've been seeing in Mojo. This sort of thing would be a hotpath.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
base::OnceClosure error_handler_called_for_testing_;I am really not a fan of test-only callbacks. Is this something we'll need forever? Are there any possible alternatives?
&MultiplexRouter::OnPipeConnectionError, base::WrapRefCounted(this),I think this needs a comment to explain why it won't be a leak, since it looks like `this` owns `connector_`.
Is it important to add this ref preemptively? We can't just `scoped_refptr protect(this);` in `OnPipeConnectionError()`?
Based on the CL description, it sounds like it wouldn't work because the error handler callback is posted, and so the ref needs to exist while the callback is in flight, but more comments would be good.
// Without the extra reference fromIt looks like the extra reference causes the WeakPtrFactory to be invalidated on the wrong thread sometimes:
```
[273332:273343:FATAL:base/memory/weak_ptr.cc:27] DCHECK failed: sequence_checker_.CalledOnValidSequence(&bound_at) || HasOneRef(). WeakPtrs must be invalidated on the same sequenced thread as where they are bound.
Check failed at:
#0 0x60f235f15466 in ___interceptor_backtrace ??:0:0
#1 0x60f23c1ad548 in base::debug::CollectStackTrace(base::span<void const*, 18446744073709551615ul, void const**>) ./../../base/debug/stack_trace_posix.cc:1052:7
#2 0x60f23c17e81f in base::debug::StackTrace::StackTrace(unsigned long) ./../../base/debug/stack_trace.cc:279:20
#3 0x60f23be973c5 in logging::LogMessage::Flush() ./../../base/logging.cc:705:29
#4 0x60f23be96f30 in logging::LogMessage::~LogMessage() ./../../base/logging.cc:694:3
#5 0x60f23be63dc1 in ~DCheckLogMessage ./../../base/check.cc:147:3
#6 0x60f23be63dc1 in logging::(anonymous namespace)::DCheckLogMessage::~DCheckLogMessage() ./../../base/check.cc:143:32
#7 0x60f23be62c30 in operator() ./gen/third_party/libc++/src/include/__memory/unique_ptr.h:77:5
#8 0x60f23be62c30 in reset ./gen/third_party/libc++/src/include/__memory/unique_ptr.h:290:7
#9 0x60f23be62c30 in logging::NotReachedError::~NotReachedError() ./../../base/check.cc:306:16
#10 0x60f23beb6140 in base::internal::WeakReference::Flag::Invalidate() ./../../base/memory/weak_ptr.cc:27:3
#11 0x60f23beb70a6 in base::internal::WeakReferenceOwner::~WeakReferenceOwner() ./../../base/memory/weak_ptr.cc:87:12
#12 0x60f23e3c5d69 in mojo::Connector::~Connector() ./../../mojo/public/cpp/bindings/lib/connector.cc:177:1
#13 0x60f23e40257d in mojo::internal::MultiplexRouter::~MultiplexRouter() ./../../mojo/public/cpp/bindings/lib/multiplex_router.cc:468:1
#14 0x60f2369f7d0c in ~TestableMultiplexRouter ./../../mojo/public/cpp/bindings/tests/testable_multiplex_router.cc:56:1
#15 0x60f2369f7d0c in ~TestableMultiplexRouter ./../../mojo/public/cpp/bindings/tests/testable_multiplex_router.cc:52:53
#16 0x60f2369f7d0c in non-virtual thunk to mojo::test::TestableMultiplexRouter::~TestableMultiplexRouter() ./../../mojo/public/cpp/bindings/tests/testable_multiplex_router.cc:0:0
#17 0x60f2369f87e8 in DeleteInternal<mojo::AssociatedGroupController> ./../../base/memory/ref_counted.h:438:5
#18 0x60f2369f87e8 in Destruct ./../../base/memory/ref_counted.h:391:5
#19 0x60f2369f87e8 in Release ./../../base/memory/ref_counted.h:427:7
#20 0x60f2369f87e8 in Release ./../../base/memory/scoped_refptr.h:392:8
#21 0x60f2369f87e8 in ~scoped_refptr ./../../base/memory/scoped_refptr.h:280:7
#22 0x60f2369f87e8 in reset ./../../base/memory/scoped_refptr.h:311:18
#23 0x60f2369f87e8 in mojo::test::TestableMultiplexRouter::Scheduler::ResetRouter() ./../../mojo/public/cpp/bindings/tests/testable_multiplex_router.cc:83:11
#24 0x60f2369f9202 in Invoke<void (mojo::test::TestableMultiplexRouter::Scheduler::*)(), mojo::test::TestableMultiplexRouter::Scheduler *> ./../../base/functional/bind_internal.h:730:12
#25 0x60f2369f9202 in MakeItSo<void (mojo::test::TestableMultiplexRouter::Scheduler::*)(), std::__Cr::tuple<base::internal::UnretainedWrapper<mojo::test::TestableMultiplexRouter::Scheduler, base::unretained_traits::MayNotDangle, (partition_alloc::internal::RawPtrTraits)0> > > ./../../base/functional/bind_internal.h:922:12
#26 0x60f2369f9202 in RunImpl<void (mojo::test::TestableMultiplexRouter::Scheduler::*)(), std::__Cr::tuple<base::internal::UnretainedWrapper<mojo::test::TestableMultiplexRouter::Scheduler, base::unretained_traits::MayNotDangle, (partition_alloc::internal::RawPtrTraits)0> >, 0UL> ./../../base/functional/bind_internal.h:1059:14
#27 0x60f2369f9202 in base::internal::Invoker<base::internal::FunctorTraits<void (mojo::test::TestableMultiplexRouter::Scheduler::*&&)(), mojo::test::TestableMultiplexRouter::Scheduler*>, base::internal::BindState<true, true, false, void (mojo::test::TestableMultiplexRouter::Scheduler::*)(), base::internal::UnretainedWrapper<mojo::test::TestableMultiplexRouter::Scheduler, base::unretained_traits::MayNotDangle, (partition_alloc::internal::RawPtrTraits)0>>, void ()>::RunOnce(base::internal::BindStateBase*) ./../../base/functional/bind_internal.h:972:12
#28 0x60f235fe2666 in base::OnceCallback<void ()>::Run() && ./../../base/functional/callback.h:155:12
#29 0x60f23bf756b8 in base::TaskAnnotator::RunTaskImpl(base::PendingTask&) ./../../base/task/common/task_annotator.cc:229:34
#30 0x60f23c022493 in RunTask<(lambda at ../../base/task/sequence_manager/thread_controller_with_message_pump_impl.cc:474:11)> ./../../base/task/common/task_annotator.h:113:5
#31 0x60f23c022493 in base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::DoWorkImpl(base::LazyNow*) ./../../base/task/sequence_manager/thread_controller_with_message_pump_impl.cc:472:23
#32 0x60f23c0208ce in base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::DoWork() ./../../base/task/sequence_manager/thread_controller_with_message_pump_impl.cc:346:40
#33 0x60f23c1c6f07 in base::MessagePumpEpoll::Run(base::MessagePump::Delegate*) ./../../base/message_loop/message_pump_epoll.cc:224:55
#34 0x60f23c024843 in base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::Run(bool, base::TimeDelta) ./../../base/task/sequence_manager/thread_controller_with_message_pump_impl.cc:647:12
#35 0x60f23bf2c855 in base::RunLoop::Run(base::Location const&) ./../../base/run_loop.cc:134:14
#36 0x60f23c111da3 in base::Thread::Run(base::RunLoop*) ./../../base/threading/thread.cc:361:13
#37 0x60f23c11266a in base::Thread::ThreadMain() ./../../base/threading/thread.cc:436:3
#38 0x60f23c16676f in base::(anonymous namespace)::ThreadFunc(void*) ./../../base/threading/platform_thread_posix.cc:102:13
#39 0x60f235f6d587 in asan_thread_start(void*) _asan_rtl_:0
```
namespace mojo {Please fix this WARNING reported by ClangTidy: check: modernize-concat-nested-namespaces
nested namespaces can be concatenated...
check: modernize-concat-nested-namespaces
nested namespaces can be concatenated (https://clang.llvm.org/extra/clang-tidy/checks/modernize/concat-nested-namespaces.html)
(Note: You can add `Skip-Clang-Tidy-Checks: modernize-concat-nested-namespaces` footer to the CL description to skip the check)
(Lint observed on `linux-clang-tidy-rel`, but not on `android-clang-tidy-rel`)
auto router = base::WrapRefCounted(new TestableMultiplexRouter(
std::move(message_pipe), config, set_interface_id_namespace_bit, runner,
"TestableMultiplexRouter"));```suggestion
auto router = base::MakeRefCounted<TestableMultiplexRouter>(
std::move(message_pipe), config, set_interface_id_namespace_bit, runner,
"TestableMultiplexRouter");
```
destructor_wait_->TimedWait(base::Seconds(1));Why 1 second?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Connector connector_;the connector_ is owned by the MultiplexRouter, so it will be deleted at the same time. Is that not the issue here, that the connector_ and MultiplexRouter are deleted while the Connector is performing the error handling on its own sequence? Does the sequence check hit all the time if we make it prominent in the Connector dtor?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
drive-by: Can we do something like [this](https://source.chromium.org/chromium/chromium/src/+/main:mojo/public/cpp/bindings/shared_remote.h?q=symbol%3A%5Cbmojo%3A%3ASharedRemoteBase%3A%3ARemoteWrapper%3A%3ADeleteOnCorrectThread%5Cb%20case%3Ayes) to ensure the clean up happens in the right thread?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
I addressed the feedback. I integrated the Deleter code, which I think was a really great suggestion from @Andrea. I think by guaranteeing the deletion happens on the same sequence as all the other calls, we guarantee that the Connector will be shutdown before the calls would come in or we guarantee the calls would happen before the MultiplexRouter is actually deleted. I augmented the being_destructed_ to be a mutable atomic bool and had that set in DeleteOnCorrectSequence and added checks before all the base::WrapRefCounted calls for PostTasks to ensure AddRef won't happen after the destruction has been scheduled.
I will look at the remote_unittest failures probably tonight, but it seems like the issue is hot so I thought I'd put out what I have now as an update.
I am really not a fan of test-only callbacks. Is this something we'll need forever? Are there any possible alternatives?
I updated to friend class the RaceConditionScheduler and I run the callback there right before I call into the method.
Connector connector_;the connector_ is owned by the MultiplexRouter, so it will be deleted at the same time. Is that not the issue here, that the connector_ and MultiplexRouter are deleted while the Connector is performing the error handling on its own sequence? Does the sequence check hit all the time if we make it prominent in the Connector dtor?
It's not deleted at *exactly* the same time.
"For both user-defined or implicitly-defined destructors, after executing the body of the destructor and destroying any automatic objects allocated within the body, the compiler calls the destructors for all non-static non-variant data members of the class, in reverse order of declaration, then it calls the destructors of all direct non-virtual base classes in reverse order of construction (which in turn call the destructors of their members and their base classes, etc), and then, if this object is of most derived class, it calls the destructors of all virtual bases.
Even when the destructor is called directly (e.g. obj.~Foo();), the return statement in ~Foo() does not return control to the caller immediately: it calls all those member and base destructors first."
https://www.en.cppreference.com/w/cpp/language/destructor.html
So ~MultiplexRouter is destroyed, and then the member variables are destroyed one at a time in reverse order, with the task_runner_ being destroyed after the connector_ for example.
I added a MultiplexRouter::DeleteOnCorrectSequence() that gets called by a Destruct method, and that will ensure the destruction is on the right thread and made being_destructed_ a mutable atomic bool set when that method is called. I followed up with checks on that where there were AddRefs to prevent the Use After Free where we have posted the DeleteOnCorrectSequence but then we call something else that AddRefs and posts additional tasks to be handled after that would be called.
&MultiplexRouter::OnPipeConnectionError, base::WrapRefCounted(this),I think this needs a comment to explain why it won't be a leak, since it looks like `this` owns `connector_`.
Is it important to add this ref preemptively? We can't just `scoped_refptr protect(this);` in `OnPipeConnectionError()`?
Based on the CL description, it sounds like it wouldn't work because the error handler callback is posted, and so the ref needs to exist while the callback is in flight, but more comments would be good.
Done
This is perfect! Truly the solution could not have been complete without that suggestion!
Please fix this WARNING reported by ClangTidy: check: modernize-concat-nested-namespaces
nested namespaces can be concatenated...
check: modernize-concat-nested-namespaces
nested namespaces can be concatenated (https://clang.llvm.org/extra/clang-tidy/checks/modernize/concat-nested-namespaces.html)
(Note: You can add `Skip-Clang-Tidy-Checks: modernize-concat-nested-namespaces` footer to the CL description to skip the check)
(Lint observed on `linux-clang-tidy-rel`, but not on `android-clang-tidy-rel`)
Done
auto router = base::WrapRefCounted(new TestableMultiplexRouter(
std::move(message_pipe), config, set_interface_id_namespace_bit, runner,
"TestableMultiplexRouter"));```suggestion
auto router = base::MakeRefCounted<TestableMultiplexRouter>(
std::move(message_pipe), config, set_interface_id_namespace_bit, runner,
"TestableMultiplexRouter");
```
Done
destructor_wait_->TimedWait(base::Seconds(1));Emily AndrewsWhy 1 second?
To ensure there's never a hang in the test. I added an EXPECT_TRUE so it would be an error to have the time elapse.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Fixed the remoting_unittests issue and should be good to go in if you guys are pleased with the change.
I removed the feature key, since ensuring destruction on sequence is safe enough and will not result in hangs.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
the RefCountedThreadSafe::DeleteInternal implementation.Can the CL description mention why the webnn changes are needed? (Though a number of WebNN tests appear to be broken)
// Deletes this object on the correct sequence to protect againstThis comment should explain what "the correct sequence is". That might be better suited as a class-level comment line 24 that explains the threading requirements for the class as a whole; then this can just reference that.
if (being_destructed_) {I'm wondering if this (and other similar checks) are potentially racy.
What prevents this from being `false` here, but then being set to `true` before the `PostTask()` call below?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Is this ready for a re-review?
Not yet, I think you are right about the race, but I'm thinking through when to get rid of the handler reference explicitly once the InterfaceEndpointClient and InterfaceEndpoint are deconstructed. I think I might have something later today.
The ASAN test failures are because I need to get the right RunLoop hook to run through the asynchronous deconstruction. That might take a little time too.
if (being_destructed_) {I'm wondering if this (and other similar checks) are potentially racy.
What prevents this from being `false` here, but then being set to `true` before the `PostTask()` call below?
I found a HasAtLeastOneRef API on RefCountedThreadSafe, and this will not be racy.
if (being_destructed_) {Emily AndrewsI'm wondering if this (and other similar checks) are potentially racy.
What prevents this from being `false` here, but then being set to `true` before the `PostTask()` call below?
I found a HasAtLeastOneRef API on RefCountedThreadSafe, and this will not be racy.
Actually, still a little racy
the RefCountedThreadSafe::DeleteInternal implementation.Can the CL description mention why the webnn changes are needed? (Though a number of WebNN tests appear to be broken)
I'm not surprised the WebNN tests need modifications. We've been struggling in WebNN to ensure that all the impl objects are destroyed during test shutdown because tasks need to be posted between threads.
void WebNNContextProviderImpl::SetDisconnectHandlerForTesting( // IN-TESTIs this comment necessary since the method is named "ForTesting"?