RFC: Restructure OpResult Values to Optimize Memory Layout

47 views

Skip to first unread message

River Riddle

unread,

Sep 30, 2019, 7:29:12 PM9/30/19

to MLIR

RFC: Restructure OpResult Values to Optimize Memory Layout

The current implementation of values is very memory expensive. The existing implementation allocates an OpResult object per result in an operation, and a BlockArgument object per argument to a block. OpResults require a whopping 24-bytes per-result(on x64), with an additional overhead of 4-bytes per operation. This means that to store zero results on an operation we use 4-bytes, 28-bytes for 1 result, 52-bytes for 2 results, and so on and so forth.

sizeof OpResult: 24-bytes

// IRObjectWithUseList: 8-bytes

IROperand *firstUse;

// Value: 8-bytes

PointerIntPair<Type, 1, Kind> typeAndKind;

// OpResult: 8-bytes

Operation *const owner;

sizeof Operation(related to results): 4-bytes

// Operation: 4-bytes

unsigned numResults;

Proposal:

I propose that we use a more tightly packed representation for Operation results. This new representation will optimize for the case where an operation only has a few number of results(1-6). The cases where the number of results is greater will have additional runtime complexity in a few places, but this seems reasonable given that this is far from the common case. The new representation is essentially a suggestion that we no longer allocate pre-result OpResult objects, but instead pack them directly into Operation:

Operation contains a single use-list.

There isn’t a strong reason why we need a use-list per-result object. I propose that Operation contains a single use-list that is shared between all of the result values. This means that all results iterate the same use-list, but filter on the result number that the use corresponds to.

Results are packed into a single Type instance.

To save space, we can pack the result types into a single Type instance on the Operation. For the single result cases, this type directly corresponds to the result. In the case of more than one result, this type corresponds to a tuple type containing each of the result types.

Provide an anchor for results greater than the optimized size(i.e. currently >6)

This section ties into how we will represent the Value type in this new world. Without going into too much detail, as this will be covered in the next section, for the cases of a large number of results we will need to provide some anchor to point to. This is essentially a stripped version of OpResult, that just serves to identify the parent operation and the result number. More details on this are in the next section.

New Value Representation

Given the above, we will need a new representation for the Value class. As we have split apart the OpResult class, we will need a new value-typed representation that wraps a custom representation. This representation has 3 states:

The value is a block argument.
The value is the result of an operation, and the result number is small(<=6)
The value is the result of an operation, and the result number is large(>6)

With some tricky bit packing, we can encode all of these cases with a single pointer.

Case 1: Block Arguments

This proposal does not include changes to how block arguments are implemented, so these can be represented via a pointer to a structure containing the same elements as exists today.

Case 2: Operation Result (result number <= 6)

For the case of a small result number, we can encode the result number into the low bits of the owning Operation.

Case 3: Operation Result (result number > 6)

This is a tricky case. Unlike in case 2, the result number is too large to encode in the operation pointer. Therefore, we need to provide an anchor pointer that can identify both the operation and the result number. This can be achieved by having a trailing list of unsigned values in the parent operation. These values simply contain the result number. Then, when creating a value, we point to the address of the specific unsigned value that corresponds to the result (after result 6). With a little pointer arithmetic, we can backtrace to the owning Operation from this anchor.

New Memory Cost

With the proposed changes, let’s look at the new memory cost of an operation:

sizeof Operation(related to results):

/// Storage when ResultCount >= 0

// 1-bit, zero cost after bit-packing

bool hasResults : 1;

/// Storage when ResultCount > 0

// 8-bytes

PointerIntPair<Type, 1, bool> resultTypesAndIsPacked;

// 8-bytes

IROperand *firstUse;

/// Storage when ResultCount > 6

// 4-bytes per result after 6.

unsigned resultNos[ResultCount - 6];

Let’s compare this cost, with the original:

Memory Cost Per Result (in bytes)	Old	Proposed
0-Result	4	0
1-Result	28	16
N-Result(<=6)	4 + 24*N	16
N-Result(>6)	4 + 24*N	16 + 4*(N-6)

NOTE:

A prerequisite to this RFC is that we will need to make `Value` value typed. This will make the transition to the new representation seamless, but can be done separately as an NFC refactoring.

Thoughts?

Uday Bondhugula

unread,

Oct 1, 2019, 7:21:27 AM10/1/19

to MLIR

Sounds impactful, but I find this proposal really hard to read without iterating through it multiple times. Some questions inline.

On Tuesday, October 1, 2019 at 4:59:12 AM UTC+5:30, River Riddle wrote:

RFC: Restructure OpResult Values to Optimize Memory Layout
The current implementation of values is very memory expensive. The existing implementation allocates an OpResult object per result in an operation, and a BlockArgument object per argument

That's right, but on another note, all results here are part of a single allocation along with the operands, and thus spatial reuse is exploited. It'll be good to state somewhere whether the proposal preserves this.

to a block. OpResults require a whopping 24-bytes per-result(on x64), with an additional overhead of 4-bytes per operation. This means that to store zero results on an operation we use 4-bytes, 28-bytes for 1 result, 52-bytes for 2 results, and so on and so forth.

sizeof OpResult: 24-bytes
// IRObjectWithUseList: 8-bytes
IROperand *firstUse;

// Value: 8-bytes
PointerIntPair<Type, 1, Kind> typeAndKind;

// OpResult: 8-bytes
Operation *const owner;

sizeof Operation(related to results): 4-bytes
// Operation: 4-bytes
unsigned numResults;

Proposal:
I propose that we use a more tightly packed representation for Operation results. This new representation will optimize for the case where an operation only has a few number of results(1-6). The cases where the number of results is greater will have additional runtime complexity in a few places, but this seems reasonable given that this is far from the common case. The new

In fact, anything other than 0-result and 1-result ops is pretty uncommon (relatively speaking when compared to the number of 0-result and 1-result ops). Only the call op(?) when you are looking at mid/low level has multiple results?

representation is essentially a suggestion that we no longer allocate pre-result OpResult objects, but instead pack them directly into Operation:

Operation contains a single use-list.
There isn’t a strong reason why we need a use-list per-result object. I propose that Operation contains a single use-list that is shared between all of the result values. This means that all results iterate the same use-list, but filter on the result number that the use corresponds to.

Does this impact the efficiency of iterating over the use lists? For eg, if one of the results had 20 uses, but another one just one. I think this part isn't described later in the proposal.

Results are packed into a single Type instance.
To save space, we can pack the result types into a single Type instance on the Operation. For the single result cases, this type directly corresponds to the result. In the case of more than one result, this type corresponds to a tuple type containing each of the result types.

But MLIR doesn't have tuple types, and deliberately avoided them. Did you mean a tuple type that's meant for internal use (can't be used in the IR that's printed, parsed, or built)?

Provide an anchor for results greater than the optimized size(i.e. currently >6)

What's special about 6? Shouldn't this be related to the number of low bits available? Isn't that number 3 and so you can store up to 7?

This section ties into how we will represent the Value type in this new world. Without going into too much detail, as this will be covered in the next section, for the cases of a large number of results we will need to provide some anchor to point to. This is essentially a stripped version of OpResult, that just serves to identify the parent operation and the result number. More details on this are in the next section.

New Value Representation
Given the above, we will need a new representation for the Value class. As we have split apart the OpResult class, we will need a new value-typed representation that wraps a custom representation. This representation has 3 states:
The value is a block argument.
The value is the result of an operation, and the result number is small(<=6)
The value is the result of an operation, and the result number is large(>6)

With some tricky bit packing, we can encode all of these cases with a single pointer.

Case 1: Block Arguments
This proposal does not include changes to how block arguments are implemented, so these can be represented via a pointer to a structure containing the same elements as exists today.
Case 2: Operation Result (result number <= 6)

This may sound naive but it's not clear whether "result number" means the number of results or the number/position of the result among the list of results.

For the case of a small result number, we can encode the result number into the low bits of the owning Operation.

Yes, but what about the result values themselves? You are only talking about the result numbers here.

Case 3: Operation Result (result number > 6)
This is a tricky case. Unlike in case 2, the result number is too large to encode in the operation pointer. Therefore, we need to provide an anchor pointer that can identify both the operation and the result number. This can be achieved by having a trailing list of unsigned values in the parent operation. These values simply contain the result number. Then, when creating a value, we point

Could you rewrite this para? "result number" is a single thing - it's not clear what's meant by multiple values containing a result number.

to the address of the specific unsigned value that corresponds to the result (after result 6). With a little pointer arithmetic, we can backtrace to the owning Operation from this anchor.

New Memory Cost
With the proposed changes, let’s look at the new memory cost of an operation:

sizeof Operation(related to results):
/// Storage when ResultCount >= 0

// 1-bit, zero cost after bit-packing
bool hasResults : 1;

/// Storage when ResultCount > 0

// 8-bytes
PointerIntPair<Type, 1, bool> resultTypesAndIsPacked;

// 8-bytes
IROperand *firstUse;

/// Storage when ResultCount > 6

// 4-bytes per result after 6.
unsigned resultNos[ResultCount - 6];

Let’s compare this cost, with the original:

Did you mean resultTypeAndIsPacked?

How does this impact the efficiency of getDefiningOp if at all. Currently, it's just a dyn_cast to an OpResult and accessing its owner field.

And what about IROperandUse? Is that changed in any way? How does getUses work? Can you sketch out the new Operation class?

In general, I find this proposal pretty hard to read - unless iterated upon multiple times. I'll just wait for your response first. Could you consider posting this as an online doc (either GitHub md / an issue / Google doc) and update it in place?

~ Uday

River Riddle

unread,

Oct 1, 2019, 12:21:39 PM10/1/19

to Uday Bondhugula, MLIR

Hi Uday,

Thanks for the response. Some comments inline.

On Tue, Oct 1, 2019 at 4:21 AM 'Uday Bondhugula' via MLIR <ml...@tensorflow.org> wrote:

Sounds impactful, but I find this proposal really hard to read without iterating through it multiple times. Some questions inline.

Yes, sorry.

On Tuesday, October 1, 2019 at 4:59:12 AM UTC+5:30, River Riddle wrote:
RFC: Restructure OpResult Values to Optimize Memory Layout
The current implementation of values is very memory expensive. The existing implementation allocates an OpResult object per result in an operation, and a BlockArgument object per argument

That's right, but on another note, all results here are part of a single allocation along with the operands, and thus spatial reuse is exploited. It'll be good to state somewhere whether the proposal preserves this.

Spatial reuse is exploited given that the data for results will still be on the operation object, but it's important to note that the current concept of an OpResult is radically changing. We may still keep the OpResult class as an interface into a certain kind of Value, but we won't be storing them on Operations.

to a block. OpResults require a whopping 24-bytes per-result(on x64), with an additional overhead of 4-bytes per operation. This means that to store zero results on an operation we use 4-bytes, 28-bytes for 1 result, 52-bytes for 2 results, and so on and so forth.

sizeof OpResult: 24-bytes
// IRObjectWithUseList: 8-bytes
IROperand *firstUse;

// Value: 8-bytes
PointerIntPair<Type, 1, Kind> typeAndKind;

// OpResult: 8-bytes
Operation *const owner;

sizeof Operation(related to results): 4-bytes
// Operation: 4-bytes
unsigned numResults;

Proposal:
I propose that we use a more tightly packed representation for Operation results. This new representation will optimize for the case where an operation only has a few number of results(1-6). The cases where the number of results is greater will have additional runtime complexity in a few places, but this seems reasonable given that this is far from the common case. The new

In fact, anything other than 0-result and 1-result ops is pretty uncommon (relatively speaking when compared to the number of 0-result and 1-result ops). Only the call op(?) when you are looking at mid/low level has multiple results?

For the general case, yes 0-1 results are the most common cases throughout the entire ecosystem. This case is largely what we are trying to optimize towards. Being able to efficiently represent up to 6, is due to the number of low bits. That being said, there are some dialects that may produce additional control arguments, e.g. the TensorFlow executor dialect. These operations can commonly produce ~2 outputs.

representation is essentially a suggestion that we no longer allocate pre-result OpResult objects, but instead pack them directly into Operation:

Operation contains a single use-list.
There isn’t a strong reason why we need a use-list per-result object. I propose that Operation contains a single use-list that is shared between all of the result values. This means that all results iterate the same use-list, but filter on the result number that the use corresponds to.

Does this impact the efficiency of iterating over the use lists? For eg, if one of the results had 20 uses, but another one just one. I think this part isn't described later in the proposal.

Yes, there is some additional compute overhead to filter the use list for the specific result that is desired. This isn't a problem for the common case(1 result), but for above we are making a trade off between additional filtering and memory overhead.

Results are packed into a single Type instance.
To save space, we can pack the result types into a single Type instance on the Operation. For the single result cases, this type directly corresponds to the result. In the case of more than one result, this type corresponds to a tuple type containing each of the result types.

But MLIR doesn't have tuple types, and deliberately avoided them. Did you mean a tuple type that's meant for internal use (can't be used in the IR that's printed, parsed, or built)?

MLIR does have a Tuple type, we don't support it for the standard operations though.

https://github.com/tensorflow/mlir/blob/master/g3doc/LangRef.md#tuple-type

Provide an anchor for results greater than the optimized size(i.e. currently >6)

What's special about 6? Shouldn't this be related to the number of low bits available? Isn't that number 3 and so you can store up to 7?

6 is the number that we can store, given that we have 2 other cases that we need to be able to identify. The value in the low bits is a combination of the Value `kind`, and in this specific case the result number.

This section ties into how we will represent the Value type in this new world. Without going into too much detail, as this will be covered in the next section, for the cases of a large number of results we will need to provide some anchor to point to. This is essentially a stripped version of OpResult, that just serves to identify the parent operation and the result number. More details on this are in the next section.

New Value Representation
Given the above, we will need a new representation for the Value class. As we have split apart the OpResult class, we will need a new value-typed representation that wraps a custom representation. This representation has 3 states:
The value is a block argument.
The value is the result of an operation, and the result number is small(<=6)
The value is the result of an operation, and the result number is large(>6)

With some tricky bit packing, we can encode all of these cases with a single pointer.

Case 1: Block Arguments
This proposal does not include changes to how block arguments are implemented, so these can be represented via a pointer to a structure containing the same elements as exists today.
Case 2: Operation Result (result number <= 6)

This may sound naive but it's not clear whether "result number" means the number of results or the number/position of the result among the list of results.

Result number means the number of results.

For the case of a small result number, we can encode the result number into the low bits of the owning Operation.

Yes, but what about the result values themselves? You are only talking about the result numbers here.

What do you mean by "values"? The OpResult class as it is now won't exist anymore. This proposal is largely about decomposing it and merging the duplication with the other OpResults of an operation. The pieces of the `result` is stored on the actual operation. That means that in this case, we just need a pointer to the operation and the result number. From there we can access the specific components necessary.

Case 3: Operation Result (result number > 6)
This is a tricky case. Unlike in case 2, the result number is too large to encode in the operation pointer. Therefore, we need to provide an anchor pointer that can identify both the operation and the result number. This can be achieved by having a trailing list of unsigned values in the parent operation. These values simply contain the result number. Then, when creating a value, we point

Could you rewrite this para? "result number" is a single thing - it's not clear what's meant by multiple values containing a result number.

to the address of the specific unsigned value that corresponds to the result (after result 6). With a little pointer arithmetic, we can backtrace to the owning Operation from this anchor.

New Memory Cost
With the proposed changes, let’s look at the new memory cost of an operation:

sizeof Operation(related to results):
/// Storage when ResultCount >= 0

// 1-bit, zero cost after bit-packing
bool hasResults : 1;

/// Storage when ResultCount > 0

// 8-bytes
PointerIntPair<Type, 1, bool> resultTypesAndIsPacked;

// 8-bytes
IROperand *firstUse;

/// Storage when ResultCount > 6

// 4-bytes per result after 6.
unsigned resultNos[ResultCount - 6];

Let’s compare this cost, with the original:

Did you mean resultTypeAndIsPacked?

No, the type is different depending on the number of types. If there are more than 1, the type correpsonds to a TupleType containing each of the individual result types.

How does this impact the efficiency of getDefiningOp if at all. Currently, it's just a dyn_cast to an OpResult and accessing its owner field.

For the case where the number of results is small, getDefiningOp is O(1). For the large case, we need to do an address computation to find the Operation*. This is detailed slightly in the "case 3" section above.

And what about IROperandUse? Is that changed in any way? How does getUses work? Can you sketch out the new Operation class?

Operands don't change at all, they still hold a Value(though Value will now no longer be a pointer).

getUses, as detailed above, will need to filter on the specific result. This is necessary because we will only have one use list going forward.

The `sizeof Operation` is intended to give a sketch of the required components on Operation (for each case) moving forward.

In general, I find this proposal pretty hard to read - unless iterated upon multiple times. I'll just wait for your response first. Could you consider posting this as an online doc (either GitHub md / an issue / Google doc) and update it in place?

~ Uday

Memory Cost Per Result (in bytes)
Old
Proposed
0-Result
4
0
1-Result
28
16
N-Result(<=6)
4 + 24*N
16
N-Result(>6)
4 + 24*N
16 + 4*(N-6)

NOTE:
A prerequisite to this RFC is that we will need to make `Value` value typed. This will make the transition to the new representation seamless, but can be done separately as an NFC refactoring.

Thoughts?

--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/ee06fb74-18ee-499c-aaae-53f2496fc215%40tensorflow.org.

Uday Bondhugula

unread,

Oct 2, 2019, 6:54:53 PM10/2/19

to MLIR

Okay - sounds good.

Results are packed into a single Type instance.
To save space, we can pack the result types into a single Type instance on the Operation. For the single result cases, this type directly corresponds to the result. In the case of more than one result, this type corresponds to a tuple type containing each of the result types.

But MLIR doesn't have tuple types, and deliberately avoided them. Did you mean a tuple type that's meant for internal use (can't be used in the IR that's printed, parsed, or built)?
MLIR does have a Tuple type, we don't support it for the standard operations though.
https://github.com/tensorflow/mlir/blob/master/g3doc/LangRef.md#tuple-type

I see - so, it's really a no tuple type on standard ops.

Provide an anchor for results greater than the optimized size(i.e. currently >6)

What's special about 6? Shouldn't this be related to the number of low bits available? Isn't that number 3 and so you can store up to 7?
6 is the number that we can store, given that we have 2 other cases that we need to be able to identify. The value in the low bits is a combination of the Value `kind`, and in this specific case the result number.

This section ties into how we will represent the Value type in this new world. Without going into too much detail, as this will be covered in the next section, for the cases of a large number of results we will need to provide some anchor to point to. This is essentially a stripped version of OpResult, that just serves to identify the parent operation and the result number. More details on this are in the next section.

New Value Representation
Given the above, we will need a new representation for the Value class. As we have split apart the OpResult class, we will need a new value-typed representation that wraps a custom representation. This representation has 3 states:
The value is a block argument.
The value is the result of an operation, and the result number is small(<=6)
The value is the result of an operation, and the result number is large(>6)

With some tricky bit packing, we can encode all of these cases with a single pointer.

Case 1: Block Arguments
This proposal does not include changes to how block arguments are implemented, so these can be represented via a pointer to a structure containing the same elements as exists today.
Case 2: Operation Result (result number <= 6)

This may sound naive but it's not clear whether "result number" means the number of results or the number/position of the result among the list of results.
Result number means the number of results.

"result count" or "NumResults" will help avoid confusion here.

For the case of a small result number, we can encode the result number into the low bits of the owning Operation.

Yes, but what about the result values themselves? You are only talking about the result numbers here.
What do you mean by "values"? The OpResult class as it is now won't exist anymore. This proposal is largely about decomposing it and merging the

Isn't an op's result still a 'Value'? It is those values I was referring to. What would op->getResult(i) return? Are these just going to exist but not stored in IR (it looks like that's what you mean from the line right below)? But this isn't stated in the proposal.

duplication with the other OpResults of an operation. The pieces of the `result` is stored on the actual operation. That means that in this case, we just need a pointer to the operation and the result number. From there we can access the specific components necessary.

Case 3: Operation Result (result number > 6)
This is a tricky case. Unlike in case 2, the result number is too large to encode in the operation pointer. Therefore, we need to provide an anchor pointer that can identify both the operation and the result number. This can be achieved by having a trailing list of unsigned values in the parent operation. These values simply contain the result number. Then, when creating a value, we point

Could you rewrite this para? "result number" is a single thing - it's not clear what's meant by multiple values containing a result number.

to the address of the specific unsigned value that corresponds to the result (after result 6). With a little pointer arithmetic, we can backtrace to the owning Operation from this anchor.

New Memory Cost
With the proposed changes, let’s look at the new memory cost of an operation:

sizeof Operation(related to results):
/// Storage when ResultCount >= 0

// 1-bit, zero cost after bit-packing
bool hasResults : 1;

/// Storage when ResultCount > 0

// 8-bytes
PointerIntPair<Type, 1, bool> resultTypesAndIsPacked;

// 8-bytes
IROperand *firstUse;

/// Storage when ResultCount > 6

// 4-bytes per result after 6.
unsigned resultNos[ResultCount - 6];

Let’s compare this cost, with the original:

Did you mean resultTypeAndIsPacked?
No, the type is different depending on the number of types. If there are more than 1, the type correpsonds to a TupleType containing each of the individual result types.

Sure, makes sense. But it's a single type in all cases, right? (either a single type or a single tuple type?)

How does this impact the efficiency of getDefiningOp if at all. Currently, it's just a dyn_cast to an OpResult and accessing its owner field.
For the case where the number of results is small, getDefiningOp is O(1). For the large case, we need to do an address computation to find the Operation*. This is detailed slightly in the "case 3" section above.

And what about IROperandUse? Is that changed in any way? How does getUses work? Can you sketch out the new Operation class?
Operands don't change at all, they still hold a Value(though Value will now no longer be a pointer).

Does that change the size of IROperandUse? Will Value just wrap around some pointer - what is its structure?

It's still not clear to me what the resultNos[resultCount - 6] is? Is "resultNo" a short for 'result number'? Then, isn't resultNos[ i ] == resultNos.size() + 6, and so why do you need to store anything in resultNos?

~ Uday

getUses, as detailed above, will need to filter on the specific result. This is necessary because we will only have one use list going forward.

The `sizeof Operation` is intended to give a sketch of the required components on Operation (for each case) moving forward.

In general, I find this proposal pretty hard to read - unless iterated upon multiple times. I'll just wait for your response first. Could you consider posting this as an online doc (either GitHub md / an issue / Google doc) and update it in place?

~ Uday

Memory Cost Per Result (in bytes)
Old
Proposed
0-Result
4
0
1-Result
28
16
N-Result(<=6)
4 + 24*N
16
N-Result(>6)
4 + 24*N
16 + 4*(N-6)

NOTE:
A prerequisite to this RFC is that we will need to make `Value` value typed. This will make the transition to the new representation seamless, but can be done separately as an NFC refactoring.

Thoughts?

--
You received this message because you are subscribed to the Google Groups "MLIR" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.

Reply all

Reply to author

Forward

0 new messages