Devirtualizing/Inlining ensureMarked improves marking time by 10%-45%.

3 views
Skip to first unread message

Kouhei Ueno

unread,
Dec 7, 2014, 10:19:03 PM12/7/14
to oilpan-...@chromium.org
Hi team,

I applied a quick hack to measure speed up by following optimizations:
- Devirtualize/ilnine all Visitor methods called from ensureMarked
- Remove recursion level check.
These will make a single visitor->trace(Member) call compile to following 7 x86 insns from previous >50 insns:
 62b:   49 8b 5f 38             mov    rbx,QWORD PTR [r15+0x38]
 62f:   48 85 db                test   rbx,rbx
 632:   74 45                   je     679
 634:   8b 43 f0                mov    eax,DWORD PTR [rbx-0x10]
 637:   a8 01                   test   al,0x1
 639:   75 3e                   jne    679
 63b:   83 4b f0 01             or     DWORD PTR [rbx-0x10],0x1


The result is that we will get 10%-45% speed up on marking performance.

I think the measurement is positive enough that the optimizations are worth doing.
I'll start landing the approach incrementally in trunk.

--
Kouhei Ueno

Kentaro Hara

unread,
Dec 7, 2014, 10:25:30 PM12/7/14
to Kouhei Ueno, oilpan-...@chromium.org
This is a great improvement!

- Remove recursion level check.

How can we remove the recursion level check?


--
Kentaro Hara, Tokyo, Japan

Kouhei Ueno

unread,
Dec 7, 2014, 10:31:39 PM12/7/14
to Kentaro Hara, oilpan-...@chromium.org
On Mon, Dec 8, 2014 at 12:25 PM, Kentaro Hara <har...@chromium.org> wrote:
This is a great improvement!

- Remove recursion level check.

How can we remove the recursion level check?
For the measruement, I simply removed it.

To do this properly, I'd like to experiment placing a guard page.
Checking this explicitly on every ensureMarked() call is a significant cost that we would like to avoid.



--
Kouhei Ueno

Kentaro Hara

unread,
Dec 7, 2014, 10:38:31 PM12/7/14
to Kouhei Ueno, oilpan-...@chromium.org
To do this properly, I'd like to experiment placing a guard page.
Checking this explicitly on every ensureMarked() call is a significant cost that we would like to avoid.

Please refer to the previous discussion thread about eager tracing. We concluded that we cannot go with a guard page. We need to have a logic to fall back to a manual stack when stack overflows.

What you could do to reduce the number of CPU instructions is something like:

void Heap::collectGarbage() {
  intptr_t dummy;
  s_stackEnd = &dummy - 10 * 1024;  // Allow 10 KB stack for eager tracing
  ...;
}

void mark(void* object) {
  intptr_t dummy;
  if (LIKELY(s_stackEnd <= &dummy)) {
    object->mark();
    return;
  }
  pushToMarkingStack(object);

Kouhei Ueno

unread,
Dec 7, 2014, 11:05:56 PM12/7/14
to Kentaro Hara, oilpan-...@chromium.org
In short term (1-2wk), I'm going to focus on inlining/devirtualizing and I'd like to revisit the stack limit optimization later.

On Mon, Dec 8, 2014 at 12:38 PM, Kentaro Hara <har...@chromium.org> wrote:
To do this properly, I'd like to experiment placing a guard page.
Checking this explicitly on every ensureMarked() call is a significant cost that we would like to avoid.

Please refer to the previous discussion thread about eager tracing. We concluded that we cannot go with a guard page. We need to have a logic to fall back to a manual stack when stack overflows.
IIUC, the conclusion was that we shouldn't use the system stack up to its limit and crash.
I think there are still some room to explore, e.g. set up a reasonable guard page within the system limit and use it to switch to manual stack.
 
What you could do to reduce the number of CPU instructions is something like:

void Heap::collectGarbage() {
  intptr_t dummy;
  s_stackEnd = &dummy - 10 * 1024;  // Allow 10 KB stack for eager tracing
  ...;
}

void mark(void* object) {
  intptr_t dummy;
  if (LIKELY(s_stackEnd <= &dummy)) {
    object->mark();
    return;
  }
  pushToMarkingStack(object);
}
I think this is just as costly as the current approach.



--
Kouhei Ueno

Erik Corry

unread,
Dec 9, 2014, 5:11:46 AM12/9/14
to Kouhei Ueno, oilpan-...@chromium.org
This is really great!

For the recursion limit I am still a fan of type based static decisions.

For the depth test (either all the time or just for types on the blacklist) this approach may be worth trying.

static uint64_t stack_limit;

setup() {
  int dummy;
  uint64_t sp = reinterpret_cast<uint64_t>(&dummy);
  stack_limit = sp < 10000 ? 256 : sp - 10000;
}

mark() {
  int dummy;
  uint64_t sp = reinterpret_cast<uint64_t>(&dummy);
  if (sp < stack_limit) // Push, don't recurse.
}

To unsubscribe from this group and stop receiving emails from it, send an email to oilpan-review...@chromium.org.

Reply all
Reply to author
Forward
0 new messages