Isolating the browser process from third-party accessibility software

527 views
Skip to first unread message

Patrick Monette

unread,
Feb 21, 2020, 2:30:18 PM2/21/20
to Chromium-dev, Chris Hamilton, Andrew Ritz, Stefan Smolen, C.J. Hebert, Michael Ens
This is a discussion about implementing a cross-process parent/child window relationship between the browser process and a utility process where the main window would live, so that accessibility software would continue working without injecting into the main browser process.

First, some background:
As part of a stability effort to reduce third-party related crashes in the browser process, we've made the decision, ~2 years ago, to start blocking third-party software from injecting into Chrome processes. This change hasn't been implemented *yet* due to a multitude of issues encountered.

One such issue is that a change we have to make (disabling legacy hooking) negatively impact accessibility software, which we're committed to keep working correctly, as they are critical to some of our users.

At the end of last year, a few months ago, we've had a quick informal discussion with a few Microsoft engineers and we talked about possibly implementing a cross-process parent/child window relationship between the browser process and a sacrificial utility process to host the window, where accessibility software would be allowed to inject into. This is something some of us (Chris Hamilton, me) were interested in but lacked the technical knowledge to make it happen.

Andrew Ritz, Michael Ens and their team (Stefan Smolen, C.J. Hebert) seemed interested in collaborating for this as they've actually implemented something similar in the non chromium-based Edge browser.

The Microsoft blog post goes a little bit into the details and risks of such an approach, but I'm wondering if it's possible to know more? We'd love to get this working as in addition to keeping accessibility software out of the main process, we'd also be able to keep Chrome compatible with some window managers that need to inject into the window process to function optimally.

Thanks in advance!

Stefan Smolen

unread,
Feb 23, 2020, 4:04:57 PM2/23/20
to Patrick Monette, Chromium-dev, Chris Hamilton, Andrew Ritz, C.J. Hebert, Michael Ens, Kevin Babbitt, Rossen Atanassov
Thanks for starting the conversation,

I know that Kevin and Rossen both expressed interest / support for this topic so I'm adding them to the thread.

From: Patrick Monette <pmon...@chromium.org>
Sent: Friday, February 21, 2020 11:30 AM
To: Chromium-dev <chromi...@chromium.org>
Cc: Chris Hamilton <chr...@chromium.org>; Andrew Ritz <Andre...@microsoft.com>; Stefan Smolen <ssm...@microsoft.com>; C.J. Hebert <CJ.H...@microsoft.com>; Michael Ens <Micha...@microsoft.com>
Subject: [EXTERNAL] Isolating the browser process from third-party accessibility software
 

Michael Ens

unread,
Feb 23, 2020, 4:05:01 PM2/23/20
to Stefan Smolen, Patrick Monette, Chromium-dev, Chris Hamilton, Andrew Ritz, C.J. Hebert, Kevin Babbitt, Rossen Atanassov

Hi,

 

To give context on who’s who, Kevin and Rossen are experts on accessibility (among other topics J), while Stefan, CJ, and I are experts on cross-process window parenting & input processing, and process model in general.  We’ve been meaning to open this thread – thank you – but wanted to bring Rossen and Kevin in on the discussion.  They have knowledge of some solid issues specific to accessibility that need to be worked through, that are large in their own right.

 

I’m going to give a long answer your question.  I’m not sure where I’m giving too much or too little detail so I’m open to feedback.  I 100% agree with Raymond’s assertion that this is like juggling chainsaws, but if I do say so myself, we have a few experience chainsaw jugglers J.

 

What is Input Queue Attachment?

 

Most threads in Windows has a win32k message queue.  There’s a number of equivalent ways of looking at that queue, and the way I like to look at it is as a queue of queues.  When you call GetMessage or PeekMessage, you pull messages out of the queue in this order:

 

NonQueued > Posted > Input > Generated

 

The NonQueued “queue” is essentially all the other threads that are trying to interrupt your thread, eg. with a SendMessage operation that would be processed before GetMessage() or PeekMessage() actually returns.  Posted messages come from PostMessage and many APIs that ultimately wrap PostMessage.  Input messages are generally keyboard & pointer (mouse/touch/pen etc.), including messages related to keyboard focus and activation.  Generated messages are things like timers, paint messages, etc., that are generated when there’s no other work in the queue but a certain flag is set on the input queue.  WM_MOUSEMOVE / WM_POINTERMOVE is a special case that’s sometimes an input message, sometimes a generated message, due to how mouse coalescing works.


For the most part, every thread has a completely independent queue of queues.  But when you attach input queues, you have two threads that share the same Input Queue, and many of the same flags and member variables for the Input Queue’s State, such as this thread’s currently focused window.  I’m going to attempt a little ascii art, not sure if it’ll pull through:

 

Thread A:   NonQueued > Posted       Generated

                              \     /

                               Input

                              /     \

Thread B:   NonQueued > Posted       Generated

 

Problems with Attached Queues

Implicit synchronous dependency

 

If that diagram made any sense, the immediately obvious problem is that a deadlock in one thread can cause another idle thread to never process messages.  Consider if thread A receives a click (mousedown followed by mouseup), then B receives a click, then A receives a click again.  A and B are both idle.  The Input Queue looks like:

 

Adown Aup Bdown Bup Adown Aup

 

Then suppose that A can process messages just fine, but processing Bdown triggers an infinite loop.  Now the queue looks like:

 

Bup Adown Aup

 

Bdown has already been removed from the queue, but the damage is done.  Bup cannot be processed because thread B is wedged.  Adown can’t be processed because Bup is ahead of it in queueing order.  So even though thread A is completely idle, it cannot service any input whatsoever.

 

This can be particularly challenging when you call a function like SetFocus, including implicit calls when you call DefWindowProc on WM_MOUSEDOWN.  Focus is a synchronous queue state.  If you move focus from one window on the current input queue to another, they both get a synchronous message.  So moving focus to or from a thread that is processing messages, will block the other thread.

 

There’s a number of other subtle interactions.  This also leads to developers seeing a hang, attaching a debugger, and being confused when confronted with a completely idle thread that seemed to be frozen.

 

The good news is, if you can detect this, you can detach input queues just-in-time, and then A can proceed with processing messages.

 

Common queue flags causing Messsage Pump Infinite Loop (“sympathy hang”) or Stall

 

Interestingly, it looks like the chromium source is already dealing with this to some degree in MessagePumpForUI::WaitForWork, so it looks like it’s something you already encountered in some cases, probably having to do with dialog windows.  Also this blog post from a Mozilla dev concerning the same code https://dblohm7.ca/blog/2015/03/12/waitmessage-considered-harmful/.  At first glance, I’m not sure the existing solution is really effective, but maybe I just need to sit down in the debugger to see what’s going on live.  It’s a bit different from the workarounds and solutions I’ve seen before.

 

The fundamental problem is, in the above case, suppose thread A checks its queue for messages.  It has pending messages, but none it can process.  There are two basic ways to let the thread go to sleep in MsgWaitForMultipleObjects(), hinging on whether you pass MWMO_INPUTAVAILABLE or not.  With no flag, it says “wake me up when there’s new messages anywhere in the queue”.  If thread B eventually processes its message, thread A might never wake up even though it still has pending work because there’s no new pending work, until something “pokes” it – experienced by the user as a window that hangs until you just wave your mouse over the window and it clears up.  With the flag, it says “wake me up when there’s any input messages, new or not, for me to process”.  Then it wakes up immediately because there’s already input messages for it to process.  But it can’t process them yet due to the attached queue.  So you spin in a tight loop.  And this tight loop occupies CPU time that could be spent working on whatever is blocking thread B.  All the while suffering from increased battery usage.

 

Problems with cross process parenting besides Attached Queues

Owner/owned problem

 

Suppose you call, or an API you call, creates a top-level window based on the current window – whether modal or modeless.  Top-level don’t have parents but they often have owners, such that when you click the owner, the owned dialog comes to the top, or vice-versa.  When you attempt to make a child window an owner, the ownership is basically transferred to the ancestor of that child which is a top-level window.  This means you can implicitly create a cross-process Owner/Owned relationship.  This, first of all, implicitly attaches input queues again, even if you detached them before.  But secondly, this creates a synchronous dependency risk of its own.  When the mouse clicks down on a background window, the Windows kernel does a synchronous message to all owned windows to tell them to join that window in the Z-order.  That means you have a sudden synchronous dependency from one thread to another, which cannot be detected ahead of time because the kernel made the call before usermode code even sees it happening.  This one is particularly insidious because unlike the Implicit Synchronous Dependency of attached input queues, when you’re in this state, there’s no way out other than to either have one thread be unlocked on its own, or to kill that process.  Un-ownering the window after-the-fact will not repair the situation.

 

Why are Input Queues Attached?

You can “feel” the answer if you create a app with multiprocess UI and call AttachThreadInput with fAttach=FALSE, in a loop, to fully detach input queues of the windows.  In short, keyboard focus is messed up such (sometimes it looks like something should have focus, but typing reveals that keyboard input is still going to the wrong window), clicking the child window doesn’t automatically bring the parent window to the top, etc..  Win32 is generally built with the assumption that all windows that share a common ancestor top-level window, share the same input queue, and some very basic things break down when you subvert that.

 

More literally, the input queues are attached because the Windows kernel automatically attaches queues between threads that have windows in a hierarchy relationship, and various triggers can cause Windows to re-evaluate that relationship and attempt to re-establish input queue attachment.  So even if you do detach input queues, if you’re not careful, they can be re-attached for you behind your back.  We have a decent survey of conditions that reattach input queues.  One of the trickiest to deal with has to do with IMEs.

 

Our Experience

Every previous Microsoft browser has included cross-process UI, at least since IE8.

 

Old hang resistance (accepting queue attachment as inevitable, and accept nontrivial work on both threads as also inevitable, and working around it)

In IE and earlier versions of the non-Chromium Edge, we spent a great deal of effort on dealing with this imperfectly.  Feel free to skip this section, the point I’m raising here is some concrete information on what “juggling chainsaws” really means.  The key points were:

 

  • Do not allow any dialog to be launched from the context of the child UI thread – even if create from third party code (using shims/interceptors).  We instead created something we call the “Alternate Owner Window”, and attempted to replicate the win32k’s window z-order functionality without actually having a window ownership relationship.
  • Watchdog objects would monitor threads for responsiveness, and aggressively attempt to detach input queues and un-parent windows that were unresponsive.  If they later respond, they .
    • The riskiest time is the period between a hard-hang occurring and us detecting it, because in this time somebody can call SetFocus or process WM_MOUSEACTIVATE, and as soon as you do that the hang spreads to other threads.
    • Watchdog was periodic when user interactions were occurring, but also triggered on-demand before risky calls like SetFocus across threads.
  • We didn’t use MWMO_INPUTAVAILABLE on browser threads (we had one per top-level window, never attached to one another), and instead used the fact that we knew every other thread that should be attached to our main browser thread, and before they transitioned from “processing input” à “processing non-input”, they would PostMessage WM_NULL to the browser thread.  This would wake the thread up just in case there was pending unprocessed input, just once, and then let it sleep.  This only works when we have high confidence we can detect transitions from input processing à not input processing on all other attached threads.
  • By convention, it was assumed that if our browser threads were doing a heavy load or infinite looping, it doesn’t really matter that any thread attached to it is spinning or suffering, because by definition the user impact is already unacceptably severe.  So the goal was to protect the main browser threads from other threads which could in theory process heavy loads without significant user impact.

 

Overall, this leads to an experience where user interactions can be janky for about 1 second after some code enters a tight loop, but otherwise the browser is mostly resilient to problems from other threads.

 

Newer hang resistance (no queue attachment whatsoever, still having cross-process UI)

Newer versions of the pre-Chromium Edge browser used a different technique, just solving all the issues that input queue attachment solved, without the actual input queue attachment.  This technique gives much more symmetric independence (neither thread depends on the other, for the most part), and was fundamentally reliable with no hairy race conditions.  I believe this solution worked out great, and I’d like to take inspiration from it.  Unfortunately, we can’t import it as-is, because the precise implementation is tied pretty closely to UWP.  I’m not recommending that chromium switch to being a UWP on Windows, and I know it would at best take several releases of Windows to bring the whole implementation to win32 compatibility (we had looked into it in the past).

 

However, that’s for the entire “Component UI” API we designed.  Many fundamental implementation details could be generalized to win32 more easily as isolated features.  We can experiment internally with them and work with the various teams within the Windows organization to make these underlying APIs (or some wrappers thereof) stable and supported so we could feel comfortable using them going forward.  In particular, there is an API that allows you to give an HWND a false parent or a false child (just 1:1) for accessibility purposes only, without impacting input queue attachment, queue flags, or window ownership.  This API also allows us to fake out the dimensions and visibility state of that window.

 

Alternative (queue attachment, but total vacancy)

The other obvious solution is, if no attached thread ever does anything that risks taking nontrivial time, then we don’t have a problem.  We use MWMO_INPUTAVAILABLE, even though it effectively doubles CPU usage, because spin time is guaranteed to be so short that it makes no real difference.  We can simply accept all the other synchronous dependencies because we really don’t need to run both threads in parallel at any point, their workload is so sparse.  No dialog will pop up because the premise

 

I know the browser process UI thread already has a strict policy avoiding IO on the UI thread and that’s a great step here.  A big risk is we have to be sure that nothing sneaks in that we don’t control, adding a dialog from a thread we don’t own, because it ruins everything.  One of the vectors by which things could sneak in is through the very accessibility tools we are trying to keep out of proc in the first place.  Here I want to defer to Kevin and Rossen as to the best way to do UIAutomation endpoints.  Another are the window manager injectors you’re describing below.

Michael Ens

unread,
Feb 23, 2020, 4:05:02 PM2/23/20
to Stefan Smolen, Patrick Monette, Chromium-dev, Chris Hamilton, Andrew Ritz, C.J. Hebert, Kevin Babbitt, Rossen Atanassov

Correcting a missing thought below:

 

“No dialog will pop up because the premise" becomes “No dialog will pop up because launching a dialog would be nontrivial code."

The other obvious solution is, if no attached thread ever does anything that risks taking nontrivial time, then we don’t have a problem.  We use MWMO_INPUTAVAILABLE, even though it effectively doubles CPU usage, because spin time is guaranteed to be so short that it makes no real difference.  We can simply accept all the other synchronous dependencies because we really don’t need to run both threads in parallel at any point, their workload is so sparse.  No dialog will pop up because launching a dialog would be nontrivial code.

Michael Ens

unread,
Feb 23, 2020, 4:05:09 PM2/23/20
to Chromium-dev, chr...@chromium.org, andre...@microsoft.com, ssm...@microsoft.com, cj.h...@microsoft.com, micha...@microsoft.com

Patrick Monette

unread,
Mar 5, 2020, 5:52:13 PM3/5/20
to Chromium-dev, chr...@chromium.org, andre...@microsoft.com, ssm...@microsoft.com, cj.h...@microsoft.com, micha...@microsoft.com
Thanks Michael for the comprehensive overview!

I'm still tring to grasp all the details but I have a few questions for you already.

From your response, it seems that the solution used in non-chromium Edge has advantages over the other alternatives that are quite appealing. While it would be a bit unfortunate to have a solution that works only in Windows 10, it's most likely a non-issue given that Windows 7 has already entered end of life, and that Windows 8 usage is quite lower than Windows 10.

Since this requires some changes at the OS level, and thus it would be a long term solution, do you have a guesstimate on how long it could take to have those new API available? Also, is there any changes we could make in Chromium right now that would help us in the future?

On another note, I've would have liked to create a small proof of concept for implementing option #1 (accepting queue attachment as inevitable, and accept nontrivial work on both threads as also inevitable, and working around it. 

You've written in details about the issues with attached input queues but one thing that still isn't clear to me is how this would concretely be implemented in Chromium. This is probably a naive question, but do you think it'll require significant changes to how Chrome creates/manages its main window or is there a simple way to reparent the current Chrome main window to some kind of empty window that in a transparent way? The UI is actually being drawn outside of the client area, over the window frame. Would that have to move into the parent window's process? Does most of the input handling have to move to that process too?

Thanks again for the all knowledge you're sharing with us.

Michael Ens

unread,
Mar 5, 2020, 6:54:59 PM3/5/20
to Patrick Monette, Chromium-dev, Kevin Babbitt, Rossen Atanassov, chr...@chromium.org, Andrew Ritz, Stefan Smolen, C.J. Hebert

I actually think most of the experimentation would be purely on the chromium side.  For instance, the accessibility items ultimately just come down to calls to SetProp on the HWND in question with some well-defined internal property names.  We wouldn’t want to rely on that implementation detail being stable in the long-term, but I don’t think it would block prototyping.  The properties of interest are:

 

UIA_HWNDXOffset

UIA_HWNDYOffset

UIA_HWNDWidth

UIA_HWNDHeight

UIA_WindowVisibilityOverridden

CrossProcessChildHWND

CrossProcessParentHWND

UIA_UseSiblingAsChildForHitTesting

 

I can get back to you later on their meaning, it’s subtle and a bit weirder at points than you might expect because it’s only expected to be used internally.  This does not constitute documentation J.

 

In terms of your question – yes, we can reparent the window relatively simply, which makes a proof of concept for #1 not too difficult at least as far as windowing goes.  Basically, just do what chromium does now, then a new process that creates a new top-level window and returns the handle (the HWND).  Then have the Browser Process call SetWindowLongPtr to convert the existing top-level window from WS_POPUP to WS_CHILD, and SetParent to parent it to the HWND from the utility process.  That’s almost how old Internet Explorer works, except in IE it’s the other way around, where the child process’ child windows are reparented to the top-level window.  But regardless of who is parent, we need to perform the operations in the medium-IL process.  The vast majority of input handling does not necessarily need to move to the new process, but things like the size/move modal loop (dragging by the titlebar) and closing the window from the taskbar are going to get in touch with this new top-level window first.  That means, if we don’t move it, then we have the condition I warned about where both processes will receive input and potentially keyboard focus, so it opens a whole lot of landmines.  But it can be done.  Windows will automatically route your pointer (mouse, touch, etc.) input to the child window just as it would any other child window, and there will one keyboard focus window across both UI threads.

 

We’d probably set things up so that whenever this “host” process owning the top level window goes down, then the browser process self-terminates, at least to begin with.  The long-haul part is creating careful mitigations for potential queue issues, and also adjusting crashpad etc. to be aware of the new reality that a hang in one thread could be due to something in a different process entirely that doesn’t seem to be waiting from usermode.

 

Then the next step is changing the accessibility objects themselves and serializing them out to this process that contains the top-level window.  I know Kevin and Rossen think this will take a good deal of time and have big rocks, and apparently they fell off the email, so I’m putting them back J.

Will Harris

unread,
Aug 13, 2020, 7:44:34 PM8/13/20
to Chromium-dev, micha...@microsoft.com, Chris Hamilton, Andrew Ritz, ssm...@microsoft.com, C.J. Hebert, Patrick Monette, kbab...@microsoft.com, Rossen Atanassov
Hi,

I'm curious if there has been any progress on this work - I wonder if there was an idea of the steps needed to make this happen, and what the blockers were? We would still very much like a way to get legacy IME working but be able to prevent injection into the main browser process.

Thanks,

Will

Stefan Smolen

unread,
Aug 13, 2020, 8:21:34 PM8/13/20
to Will Harris, Chromium-dev, Michael Ens, Chris Hamilton, Andrew Ritz, C.J. Hebert, Patrick Monette, Kevin Babbitt
Hey Will,

I have this on my schedule to investigate in more detail the cross process accessibility and/or input mechanisms outlined by Michael starting in the next few months. The work had been delayed for various reasons, including addressing initial public release feedback of Edge and global events since then, but it is still a priority for us.

-Stefan

From: Will Harris <w...@chromium.org>
Sent: Thursday, August 13, 2020 4:44 PM
To: Chromium-dev <chromi...@chromium.org>
Cc: Michael Ens <Micha...@microsoft.com>; Chris Hamilton <chr...@chromium.org>; Andrew Ritz <Andre...@microsoft.com>; Stefan Smolen <ssm...@microsoft.com>; C.J. Hebert <CJ.H...@microsoft.com>; Patrick Monette <pmon...@chromium.org>; kbab...@microsoft.com <chromi...@chromium.org>; Kevin Babbitt <kbab...@microsoft.com>
Subject: Re: [EXTERNAL] Re: Isolating the browser process from third-party accessibility software
 

Gabriel Charette

unread,
Oct 27, 2020, 11:25:56 AM10/27/20
to Chromium-dev, micha...@microsoft.com, Chris Hamilton, Andrew Ritz, C.J. Hebert, kbab...@microsoft.com, Rossen Atanassov, ssm...@microsoft.com, Patrick Monette
[was: Isolating the browser process from third-party accessibility software]

Hi,

just stumbled upon this thread (forked) in the context of trying to understand why we see multi-seconds hangs in calls to ::PeekMessage (from MessagePumpForUI::ProcessNextWindowsMessage() most regularly but other callers in message_pump_win.cc see this too).

Would the synchronous implicit inter-thread input queue dependency mentioned above cause this? If I understand correctly, it shouldn't? It should only cause ::PeekMessage to return no-input MSG and have the thread go idle with pending input (and do the MWMO_INPUTAVAILABLE dance in WaitForWork())?

We know of course that ::PeekMessage can lead to processing sent-messages inline (and we see some hangs there too) but in this case we're talking about NtUserPeekMessage being on top of the stack..?

Here's an example of a surprising stack where we see such a hang (15 seconds! associated trace):

<aliased> NtUserPeekMessage
 _PeekMessage(tagMSG *,HWND__ *,unsigned int,unsigned int,unsigned int,unsigned int,int)
 PeekMessageW
 base::MessagePumpForUI::ProcessNextWindowsMessage()
 base::MessagePumpForUI::DoRunLoop()
 base::MessagePumpWin::Run(base::MessagePump::Delegate *)
 base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::Run(bool,base::TimeDelta)
 base::RunLoop::Run()
 ChromeBrowserMainParts::MainMessageLoopRun(int *)
 content::BrowserMainLoop::RunMainMessageLoopParts()
 content::BrowserMainRunnerImpl::Run()
 content::BrowserMain(content::MainFunctionParams const &)
 content::ContentMainRunnerImpl::RunServiceManager(content::MainFunctionParams &,bool)
 content::ContentMainRunnerImpl::Run(bool)
 content::RunContentProcess(content::ContentMainParams const &,content::ContentMainRunner *)
 content::ContentMain(content::ContentMainParams const &)
 ChromeMain
 MainDllLoader::Launch(HINSTANCE__ *,base::TimeTicks)
 wWinMain
 __scrt_common_main_seh
 BaseThreadInitThunk
 RtlUserThreadStart

Stefan Smolen

unread,
Oct 27, 2020, 6:56:55 PM10/27/20
to Gabriel Charette, Chromium-dev, Michael Ens, Chris Hamilton, Andrew Ritz, C.J. Hebert, Kevin Babbitt, Rossen Atanassov, Patrick Monette
Chromium does not have any cross thread input queue attachment on the browser thread, at least by default, so it would not have the problem Michael explained with input queue detachment for most users. A third party program can install a global low level input hook or directly attach input queues with any other thread (like the Browser UI thread) to cause "Jank", though in that case other programs on the same computer would likely also notice similar issues. 

I took a look at the trace but could not find any 15 second hang in PeekMessage, did you share the wrong trace?

From: Gabriel Charette <g...@chromium.org>
Sent: Tuesday, October 27, 2020 8:25 AM
To: Chromium-dev <chromi...@chromium.org>
Cc: Michael Ens <Micha...@microsoft.com>; Chris Hamilton <chr...@chromium.org>; Andrew Ritz <Andre...@microsoft.com>; C.J. Hebert <CJ.H...@microsoft.com>; Kevin Babbitt <kbab...@microsoft.com>; Rossen Atanassov <Rossen.A...@microsoft.com>; Stefan Smolen <ssm...@microsoft.com>; Patrick Monette <pmon...@chromium.org>
Subject: [EXTERNAL] Windows ::PeekMessage jank
 

Gabriel Charette

unread,
Oct 27, 2020, 7:47:54 PM10/27/20
to Stefan Smolen, Gabriel Charette, Chromium-dev, Michael Ens, Chris Hamilton, Andrew Ritz, C.J. Hebert, Kevin Babbitt, Rossen Atanassov, Patrick Monette
Oops yes, wrong trace. Correct trace.

Stefan Smolen

unread,
Oct 28, 2020, 6:42:54 PM10/28/20
to Gabriel Charette, Chromium-dev, Michael Ens, Chris Hamilton, Andrew Ritz, C.J. Hebert, Kevin Babbitt, Rossen Atanassov, Patrick Monette
In the trace we see that PeekMessage is dispatching some sent messages inline every few seconds during the 15 second interval (mostly WM_WINDOWPOSCHANGING, WM_WINDOWPOSCHANGED), so it's possible that it is simply busy receiving and dispatching sent messages slowly. Since it is busy dispatching sent messages inline, it's unlikely that messages of lower priority (posted, input, timer, paint) are involved. What puzzles me about this explanation is that the stack profiler did not capture any lengthy callbacks to the browser UI thread

This does not sound like anything we've investigated before as a widespread issue, but if there is a reliable repro we could learn more about what messages get dispatched to what HWND by collecting an ETW trace with Windows kernel providers that I know exist but would take some time to dig up

From: Gabriel Charette <g...@chromium.org>
Sent: Tuesday, October 27, 2020 4:45 PM
To: Stefan Smolen <ssm...@microsoft.com>
Cc: Gabriel Charette <g...@chromium.org>; Chromium-dev <chromi...@chromium.org>; Michael Ens <Micha...@microsoft.com>; Chris Hamilton <chr...@chromium.org>; Andrew Ritz <Andre...@microsoft.com>; C.J. Hebert <CJ.H...@microsoft.com>; Kevin Babbitt <kbab...@microsoft.com>; Rossen Atanassov <Rossen.A...@microsoft.com>; Patrick Monette <pmon...@chromium.org>
Subject: Re: [EXTERNAL] Windows ::PeekMessage jank
 

Stefan Smolen

unread,
May 3, 2021, 5:31:35 PM5/3/21
to Chromium-dev, Stefan Smolen, micha...@microsoft.com, Chris Hamilton, Andrew Ritz, C.J. Hebert, pmon...@chromium.org, kbab...@microsoft.com, w...@chromium.org

Apologies for the long delay in getting back. To update, I spent some time looking in to this and I'm going to summarize my understanding of the issue and recommend some options.

The main goal for this discussion seem to be about how to apply the ProcessExtensionPointDisablePolicy mitigation to the Browser process. This mitigation isn't enough to prevent all third party code injection, but it prevents many legacy injection techniques:
557798 - Block legacy hooking mechanisms on Win8+ - chromium

The mitigation was attempted by Chromium in the past but some issues came up where some third party software injecting in to the Browser process stopped working properly which was reported by users, most notably in old abandonware IMEs that use IMM32 rather than the newer TSF. A few of these are still popular and I found this bug with more details:
1017694 - after update 78.0.3904.70 cannot input a win7 chinese - chromium

I also found a bug tracking other non-accessibility software that intentionally injects in the Browser process using accessibility APIs (hooks) that was broken by this mitigation:
1018714 - Breaks Windows system hooks - chromium 

I do not think of either of these "accessibility" software, where in previous replies this discussion was led to how we solved the accessibility problem with non-Chromium Edge. One point of clarification is that the legacy IMEs that don't work when this ProcessExtensionPointDisablePolicy policy is applied to Chromium also don't work in the non-Chromium Edge, along with other UWP-based UI like the in-box Windows 10 settings app.

I didn't find a great explanation anywhere of exactly what ProcessExtensionPointDisablePolicy does, so I did some research and poking around to build a list:

 

As mentioned in the last 2 bullet points, accessibility software is still allowed to hook via some common entry points as long as they have the UIAccess capability, which is required for hooking system UI, so I expect most (if not all) accessibility software in use today already has this capability for that reason. Given that, I don't think the ideas from earlier about inventing a way to connect accessibility trees to another process (via HWND parenting or props) where third party accessibility binaries gets loaded makes much sense to prevent accessibility tools from injecting in the browser process, since they would still be loaded in the browser process (where the hooks are fired by Windows from HWND input) anyway.

Somewhat related, newer IME binaries use a different mechanism to load in to the browser process which isn't prevented by ProcessExtensionPointDisablePolicy.

To actually have accessibility and IME software run out of the browser process, it would require avoiding triggering the hooks used to inject the software. In the cases I looked at in depth it means not having HWND Focus / Foreground in the Browser process, which is possible if we get rid of all HWND keyboard and mouse input use in the Browser process, which seems possible if we delegate those HWNDs to a separate (UI) process. In this case we may be able to get to a point where all of the relevant application events a third party software needs to listen to are triggered by a new utility process that manages all of the HWNDS and input.

To summarize, I can think of two main approaches that let us move forward to enable this policy in the Browser process without Windows changes:

  • The simple option is to add a heuristic that detects cases where we think there is still user value in third party code injection (e.g. old IMEs) that's used to selectively disable the ProcessExtensionPointDisablePolicy in Browser process for those minority of users where everyone else has the policy enabled. A generic detection wouldn't find some other corner case hooking software, so we may want something else to special case some software like a compat list managed by the browser, or override setting that an affected third party software could use to disable the mitigation if it's known to conflict with their software, or we could take a hard line and require they ship with UIAccess. The idea with this is that most users would have the policy applied and a minority would have the policy disabled due to the heuristic, and that would be relatively simple to implement.
  • The more complicated option is to move all input (e.g. user facing HWNDs) to a new Chromium (UI?) process where IMEs and Accessibility tools get loaded by consequence, and pipe the input and output from that process back to the Browser via IPC (for example with asynchronous mojo messages). The idea being, there is a large API surface between third party code and the HWND, hooks related to the HWND (and UI thread), window messages related to the HWND, etc. When we consider everything happening in the browser process, the code directly interacting with the HWND represents a minority of the code so it may be isolatable. This approach seems like it will carry some complexity, as anything that relies on a "UI thread" or HWND would need to become UI process aware, and several HWND / input interactions handled in UI code are normally synchronous and reentrant with OS callbacks (e.g. DefWindowProc triggering further window messages, or move-size loop in a drag and drop operation) and that's not necessarily easy to implement the same way with multiple processes, but I think it could be possible with workarounds in some of the corner cases. Input / output for the Renderer process is already asynchronous and decoupled from OS input so I don't expect that to be a complexity. With this sort of change moving third party software out of the browser process, it will move us closer to the point where we can enable other security mitigations like ACG to the browser process that were previously blocked by incompatibilities with third party code, so I think of it as more of a north star approach.

Interested in hearing thoughts from others on these ideas

-Stefan

Will Harris

unread,
May 10, 2021, 4:13:31 PM5/10/21
to Chromium-dev, ssm...@microsoft.com, micha...@microsoft.com, Chris Hamilton, Andrew Ritz, C.J. Hebert, Patrick Monette, kbab...@microsoft.com, Will Harris

HI Stefan,

Thanks for your analysis here. We agree that moving the HWND out of process might be a very complex task and it seems that approaching the issue from the heuristics perspective might be the better approach, to make some headway in the cleaning up of the browser process memory space.

We are curious exactly what the heuristics would be and how sure we can be that we will be accurately detecting the legitimate IMEs that might need ProcessExtensionPointDisablePolicy to be disabled, do you have any telemetry from your end that might give us an estimation of the number of clients that might be opted out of the policy as a result?

I think the best next step here would be to upload a CL with your heuristics, we can then deploy telemetry to measure the % of clients, and then make a decision as to how to deploy this safely.

Regards,

Will

Chris Hamilton

unread,
May 10, 2021, 4:57:56 PM5/10/21
to Will Harris, Chromium-dev, ssm...@microsoft.com, micha...@microsoft.com, Andrew Ritz, C.J. Hebert, Patrick Monette, kbab...@microsoft.com
Last time we tried to enable this mitigation we mostly received complaints from users of Chinese, Koren and Japanese IME software. Once upon a time we had metrics counting how often legacy IMEs were used, and it was a non-trivial portion of users. We could revive those metrics, but my guess is that any simple heuristic like this will disable the mitigation for something like 25% of users. It's been a couple year, so potentially people have moved on to newer versions of their preferred IME software, but we won't know until we measure again.

The other problem with this is the fact that we'd be breaking a long tail of legitimate "productivity" apps, and not allowing an alternative to them, unless we were also to allow them to be opted out from the mitigation. Going down this road forces us to judge individual pieces of software, which is not a great place to be.

For those reasons I still prefer a solution that moves the UI out of process entirely, despite the large complexity.

Cheers,

Chris
Reply all
Reply to author
Forward
0 new messages