Replies inline. Also CC'ing chrome-memory@ for archiving, and so that I'm not a single point of contact.For context, Andre is trying to move the PoissonAllocationSampler (which is mainly used by the heap profiler) to use a different set of dispatch hooks to unblock memory-tagging extensions (MTE), which is an important security feature. (On Android, I believe?) I asked for the refactor to be behind a feature flag for a kill-switch in case it causes the heap profiler to regress, because we use that data often.What's the timeline & priority on this project? I feel like I'm becoming a bottleneck for you - do we need to find more reviewers to share the load?On Thu, Dec 15, 2022 at 10:29 AM Andre Kempe <Andre...@arm.com> wrote:Hej Joe.
we published a new PS for CL https://chromium-review.googlesource.com/c/chromium/src/+/4031766 . Ok, more than one, I had some problems with the iOS build.
What time zone are you in? After the new year it might be best to set up a video meeting / hackathon to work through these issues in real time, because it seems like many of them are integration issues that I could solve quickly.
We haven't addressed all issues as there are two open questions. But it would be great if you could have a look at the changes so far.
Unfortunately I'm on a sheriff rotation & have a lot of meetings in the next two days, so I don't think I'll have time to review in detail before going on vacation on Monday. Sorry! (I'm back on January 2.)As for the open questions (btw, PSA stands for PoissonAllocationSampler):
1) https://chromium-review.googlesource.com/c/chromium/src/+/4031766/comment/d0f1b098_ce9a9eaf/
So, there are a couple of things to consider here:
- the structure of the PSA. It doesn't allow for a complete un-initialize in testing due to various functions using static local variables as in https://source.chromium.org/chromium/chromium/src/+/main:base/sampling_heap_profiler/poisson_allocation_sampler.cc;l=223 I tried to turn the tests into parametrized tests. Unfortunately, they always fail once I start shuffling the test sequence because of some dependency between the tests and the un-initialize problem.
Yes, I tried to add various "reset state" functions to get around this, but I was never really happy with it. But it's performance sensitive code so I didn't want to go too far in getting rid of the globals to make it testable.- the way the PSA is being invoked with the Dispatcher is quite different from the way it is today, and I am not sure the that the testing done for the old invocation is adequate for the new invocation. In fact, there are quite extensive tests for the Dispatcher. Also, I wouldn't test the whole chain from malloc and free down to the Sampler (as this is already tested by the Dispatcher).
I do want to keep a full stack test from malloc to the HeapProfilerController collecting a report, because it will catch any regressions caused by layers being hooked up wrong. But to unblock MTE, you could disable / delete the tests that are causing the most problems right now, and file a bug to add a new end-to-end test.In hindsight the end-to-end HeapProfilerController test should be a browsertest, which will start a new subprocess (thus getting completely clean state for the PoissonAllocationSampler and dispatcher hooks). Ideally this would be done before switching to the new allocator, but the timing is awkward. If necessary we could be sure to manually test the heap profiler, and then add a test to catch regressions from changes we aren't watching as closely.We could also use base/test/multiprocess_test.h to move the lower-level tests that use globals into subprocesses, although that would need more boilerplate to get the test results.
- for the feature-flag we introduce quite an amount of code, especially for testing it thoroughly, thus rendering the flag as a possibility to fall back to a known stable path problematic. Additionally, these changes are likely to become obsolete due to the different the way the Sampler is invoked by the Dispatcher.
Now that I've had more time to review the code changes, I think that there's little chance of the new Dispatcher causing hidden problems. (Except for performance issues, see below.) If it doesn't work in production we should be able to spot that early in Canary and revert the CL that switches to the Dispatcher. But, I'd want to coordinate closely with you:- make sure the change is in a CL that's easy to revert if needed- make sure we know when you land it so we can check the heap profiler results on canary/dev over the next week (we'll need to check the release cadence for all platforms, eg. Linux doesn't really have a canary, to make sure it's rolled out everywhere we care about before we conclude it's fully working)- make sure no follow-up CL's that would make the revert harder are committed until we've verified it hasn't affected the heap profilerWe were wondering if there is another approach to bring the Dispatcher into action. I.e., using a FieldTrial for a smooth transition, or disabling the Dispatcher by default for the beginning and switching to enabled-by-default afterwards.
Yes, we can do that easily if there's a base::Feature in place. That's better than just using the Feature for a kill switch, in fact. But I thought your problems are mostly about the complexity of integrating the base::Feature into the code, with all the duplication it involves.
If not, then I'd propose to go for an intermediate CL to improve the testability of PSA and remove dependencies from other tests since this will probably also affect production code.
I'd be very happy to approve any CL that improves testability!
2) https://chromium-review.googlesource.com/c/chromium/src/+/4031766/comment/d187bfa8_8431636c/
I wouldn't want to move the code for setting up the Dispatcher into PSA as this we will have more observers in the future, and this would create quite a dependency between PAS and all the new observers. However, I wouldn't mind having a centralized point of setup. Do you know of a good point which all the delegates have access to?
I don't think there's one point that all the delegates call automatically this early in setup, but you could put the dispatcher setup into a utility function, eg. `heap_profiler_controller_ = InitializeMemoryObservers(command_line)`, so that each delegate just needs to call it instead of duplicating the setup code. Then additional observers would update that function to handle all the different types of observer.Regarding the on-demand-initialization, that's an issue, indeed. Once the Dispatcher is set up, the set of observers cannot be modified. This gives us a very good runtime performance as the Dispatcher doesn't use any conditional branches, locks and the like, but at the cost of some flexibility. From your point of view, would it be ok to implement an on/off switch in PSA as you proposed in the last paragraph of the comment.
Yes, as long as we have a Feature to disable it if we find performance problems due to the added conditional branch once it reaches beta or stable. I don't think testing in canary will be enough for that.I think we'd want to split RecordAlloc into two hooks:// staticreturn;// Sampling is in fact disabled. Reset the state of the sampler.// We do this check off the fast-path, because it's quite a rare state when// allocation hooks are installed but the sampler is not running.return;}}If the Feature is off, use the old dispatcher, and install this as the hook.If the Feature is on, duplicate the code into a new hook that swaps the order:if (LIKELY(!g_running.load(etc)) {// Sampling is disabled. This is the default state, but must be checked on the fast path because it's common that// allocation hooks are installed by the sampler is not running....}ThreadLocalData* = etc...instance_->DoRecordAllocThat way the feature's only checked once at startup, and we get the same performance when it's on.Optionally, we could keep both of those hooks, and for platforms that have MTE (or any other observers installed through the dispatcher), at startup install the observers + the PoissonAllocationSampler hook with LIKELY(!g_running). For platforms where PoissonAllocationSampler is the only observer, install the hook with UNLIKELY(!g_running) when the sampler is first used.But that might add more complexity than it's worth so lets skip it, and keep it in mind for an optimization if necessary.
Thanks for your ideas and help.
No problem! Sorry that this feature is causing such complications.Joe
Hej Joe,
Happy new year to you all.
Regarding your question
What time zone are you in? After the new year it might be best to set up a video meeting / hackathon to work through these issues in real time, because it seems like many of them are integration issues that I could solve quickly.
Yes, having a small meeting to discuss next steps would be greatly appreciated.
We’re in the UK. I assume you’re somewhere in the US, aren’t you? So, would 4th or 5th at ~5pm GMT work for you?
Thanks & Best Regards,
André