Devirtualizing/Inlining ensureMarked improves marking time by 10%-45%.

Kouhei Ueno

unread,

Dec 8, 2014, 3:19:03 AM12/8/14

to oilpan-...@chromium.org

Hi team,

I applied a quick hack to measure speed up by following optimizations:

- Devirtualize/ilnine all Visitor methods called from ensureMarked

- Remove recursion level check.

These will make a single visitor->trace(Member) call compile to following 7 x86 insns from previous >50 insns:

62b: 49 8b 5f 38 mov rbx,QWORD PTR [r15+0x38]

62f: 48 85 db test rbx,rbx

632: 74 45 je 679

634: 8b 43 f0 mov eax,DWORD PTR [rbx-0x10]

637: a8 01 test al,0x1

639: 75 3e jne 679

63b: 83 4b f0 01 or DWORD PTR [rbx-0x10],0x1

patch: https://codereview.chromium.org/782223002

The result is that we will get 10%-45% speed up on marking performance.

perf: https://nyaxtstep.com/static/tmp/devirtualize.html

I think the measurement is positive enough that the optimizations are worth doing.

I'll start landing the approach incrementally in trunk.

--

Kouhei Ueno

Kentaro Hara

unread,

Dec 8, 2014, 3:25:30 AM12/8/14

to Kouhei Ueno, oilpan-...@chromium.org

This is a great improvement!

- Remove recursion level check.

How can we remove the recursion level check?

--

Kentaro Hara, Tokyo, Japan

Kouhei Ueno

unread,

Dec 8, 2014, 3:31:39 AM12/8/14

to Kentaro Hara, oilpan-...@chromium.org

On Mon, Dec 8, 2014 at 12:25 PM, Kentaro Hara <har...@chromium.org> wrote:

This is a great improvement!

- Remove recursion level check.

How can we remove the recursion level check?

For the measruement, I simply removed it.

To do this properly, I'd like to experiment placing a guard page.

Checking this explicitly on every ensureMarked() call is a significant cost that we would like to avoid.

--

Kouhei Ueno

Kentaro Hara

unread,

Dec 8, 2014, 3:38:31 AM12/8/14

to Kouhei Ueno, oilpan-...@chromium.org

To do this properly, I'd like to experiment placing a guard page.
Checking this explicitly on every ensureMarked() call is a significant cost that we would like to avoid.

Please refer to the previous discussion thread about eager tracing. We concluded that we cannot go with a guard page. We need to have a logic to fall back to a manual stack when stack overflows.

What you could do to reduce the number of CPU instructions is something like:

void Heap::collectGarbage() {

intptr_t dummy;

s_stackEnd = &dummy - 10 * 1024; // Allow 10 KB stack for eager tracing

...;

}

void mark(void* object) {

intptr_t dummy;

if (LIKELY(s_stackEnd <= &dummy)) {

object->mark();

return;

}

pushToMarkingStack(object);

Kouhei Ueno

unread,

Dec 8, 2014, 4:05:56 AM12/8/14

to Kentaro Hara, oilpan-...@chromium.org

In short term (1-2wk), I'm going to focus on inlining/devirtualizing and I'd like to revisit the stack limit optimization later.

On Mon, Dec 8, 2014 at 12:38 PM, Kentaro Hara <har...@chromium.org> wrote:

To do this properly, I'd like to experiment placing a guard page.
Checking this explicitly on every ensureMarked() call is a significant cost that we would like to avoid.

Please refer to the previous discussion thread about eager tracing. We concluded that we cannot go with a guard page. We need to have a logic to fall back to a manual stack when stack overflows.

IIUC, the conclusion was that we shouldn't use the system stack up to its limit and crash.

I think there are still some room to explore, e.g. set up a reasonable guard page within the system limit and use it to switch to manual stack.

What you could do to reduce the number of CPU instructions is something like:

void Heap::collectGarbage() {
intptr_t dummy;
s_stackEnd = &dummy - 10 * 1024; // Allow 10 KB stack for eager tracing
...;
}

void mark(void* object) {
intptr_t dummy;
if (LIKELY(s_stackEnd <= &dummy)) {
object->mark();
return;
}
pushToMarkingStack(object);
}

I think this is just as costly as the current approach.

--

Kouhei Ueno

Erik Corry

unread,

Dec 9, 2014, 10:11:46 AM12/9/14

to Kouhei Ueno, oilpan-...@chromium.org

This is really great!

For the recursion limit I am still a fan of type based static decisions.

For the depth test (either all the time or just for types on the blacklist) this approach may be worth trying.

static uint64_t stack_limit;

setup() {

int dummy;

uint64_t sp = reinterpret_cast<uint64_t>(&dummy);

stack_limit = sp < 10000 ? 256 : sp - 10000;

}

mark() {

int dummy;

uint64_t sp = reinterpret_cast<uint64_t>(&dummy);

if (sp < stack_limit) // Push, don't recurse.

}

To unsubscribe from this group and stop receiving emails from it, send an email to oilpan-review...@chromium.org.

Reply all

Reply to author

Forward