RFC: Restructure OpResult Values to Optimize Memory Layout
The current implementation of values is very memory expensive. The existing implementation allocates an OpResult object per result in an operation, and a BlockArgument object per argument to a block. OpResults require a whopping 24-bytes per-result(on x64), with an additional overhead of 4-bytes per operation. This means that to store zero results on an operation we use 4-bytes, 28-bytes for 1 result, 52-bytes for 2 results, and so on and so forth.
sizeof OpResult: 24-bytes
sizeof Operation(related to results): 4-bytes
Proposal:
I propose that we use a more tightly packed representation for Operation results. This new representation will optimize for the case where an operation only has a few number of results(1-6). The cases where the number of results is greater will have additional runtime complexity in a few places, but this seems reasonable given that this is far from the common case. The new representation is essentially a suggestion that we no longer allocate pre-result OpResult objects, but instead pack them directly into Operation:
Operation contains a single use-list.
There isn’t a strong reason why we need a use-list per-result object. I propose that Operation contains a single use-list that is shared between all of the result values. This means that all results iterate the same use-list, but filter on the result number that the use corresponds to.
Results are packed into a single Type instance.
To save space, we can pack the result types into a single Type instance on the Operation. For the single result cases, this type directly corresponds to the result. In the case of more than one result, this type corresponds to a tuple type containing each of the result types.
Provide an anchor for results greater than the optimized size(i.e. currently >6)
This section ties into how we will represent the Value type in this new world. Without going into too much detail, as this will be covered in the next section, for the cases of a large number of results we will need to provide some anchor to point to. This is essentially a stripped version of OpResult, that just serves to identify the parent operation and the result number. More details on this are in the next section.
New Value Representation
Given the above, we will need a new representation for the Value class. As we have split apart the OpResult class, we will need a new value-typed representation that wraps a custom representation. This representation has 3 states:
The value is a block argument.
The value is the result of an operation, and the result number is small(<=6)
The value is the result of an operation, and the result number is large(>6)
With some tricky bit packing, we can encode all of these cases with a single pointer.
Case 1: Block Arguments
This proposal does not include changes to how block arguments are implemented, so these can be represented via a pointer to a structure containing the same elements as exists today.
Case 2: Operation Result (result number <= 6)
For the case of a small result number, we can encode the result number into the low bits of the owning Operation.
Case 3: Operation Result (result number > 6)
This is a tricky case. Unlike in case 2, the result number is too large to encode in the operation pointer. Therefore, we need to provide an anchor pointer that can identify both the operation and the result number. This can be achieved by having a trailing list of unsigned values in the parent operation. These values simply contain the result number. Then, when creating a value, we point to the address of the specific unsigned value that corresponds to the result (after result 6). With a little pointer arithmetic, we can backtrace to the owning Operation from this anchor.
New Memory Cost
With the proposed changes, let’s look at the new memory cost of an operation:
sizeof Operation(related to results):
Let’s compare this cost, with the original:
NOTE:
A prerequisite to this RFC is that we will need to make `Value` value typed. This will make the transition to the new representation seamless, but can be done separately as an NFC refactoring.
RFC: Restructure OpResult Values to Optimize Memory Layout
The current implementation of values is very memory expensive. The existing implementation allocates an OpResult object per result in an operation, and a BlockArgument object per argument
to a block. OpResults require a whopping 24-bytes per-result(on x64), with an additional overhead of 4-bytes per operation. This means that to store zero results on an operation we use 4-bytes, 28-bytes for 1 result, 52-bytes for 2 results, and so on and so forth.
sizeof OpResult: 24-bytes
sizeof Operation(related to results): 4-bytes
Proposal:
I propose that we use a more tightly packed representation for Operation results. This new representation will optimize for the case where an operation only has a few number of results(1-6). The cases where the number of results is greater will have additional runtime complexity in a few places, but this seems reasonable given that this is far from the common case. The new
representation is essentially a suggestion that we no longer allocate pre-result OpResult objects, but instead pack them directly into Operation:
Operation contains a single use-list.
There isn’t a strong reason why we need a use-list per-result object. I propose that Operation contains a single use-list that is shared between all of the result values. This means that all results iterate the same use-list, but filter on the result number that the use corresponds to.
Results are packed into a single Type instance.
To save space, we can pack the result types into a single Type instance on the Operation. For the single result cases, this type directly corresponds to the result. In the case of more than one result, this type corresponds to a tuple type containing each of the result types.
Provide an anchor for results greater than the optimized size(i.e. currently >6)
This section ties into how we will represent the Value type in this new world. Without going into too much detail, as this will be covered in the next section, for the cases of a large number of results we will need to provide some anchor to point to. This is essentially a stripped version of OpResult, that just serves to identify the parent operation and the result number. More details on this are in the next section.
New Value Representation
Given the above, we will need a new representation for the Value class. As we have split apart the OpResult class, we will need a new value-typed representation that wraps a custom representation. This representation has 3 states:
The value is a block argument.
The value is the result of an operation, and the result number is small(<=6)
The value is the result of an operation, and the result number is large(>6)
With some tricky bit packing, we can encode all of these cases with a single pointer.
Case 1: Block Arguments
This proposal does not include changes to how block arguments are implemented, so these can be represented via a pointer to a structure containing the same elements as exists today.
Case 2: Operation Result (result number <= 6)
For the case of a small result number, we can encode the result number into the low bits of the owning Operation.
Case 3: Operation Result (result number > 6)
This is a tricky case. Unlike in case 2, the result number is too large to encode in the operation pointer. Therefore, we need to provide an anchor pointer that can identify both the operation and the result number. This can be achieved by having a trailing list of unsigned values in the parent operation. These values simply contain the result number. Then, when creating a value, we point
to the address of the specific unsigned value that corresponds to the result (after result 6). With a little pointer arithmetic, we can backtrace to the owning Operation from this anchor.
New Memory Cost
With the proposed changes, let’s look at the new memory cost of an operation:
sizeof Operation(related to results):
Let’s compare this cost, with the original:
Sounds impactful, but I find this proposal really hard to read without iterating through it multiple times. Some questions inline.
On Tuesday, October 1, 2019 at 4:59:12 AM UTC+5:30, River Riddle wrote:RFC: Restructure OpResult Values to Optimize Memory Layout
The current implementation of values is very memory expensive. The existing implementation allocates an OpResult object per result in an operation, and a BlockArgument object per argument
That's right, but on another note, all results here are part of a single allocation along with the operands, and thus spatial reuse is exploited. It'll be good to state somewhere whether the proposal preserves this.
to a block. OpResults require a whopping 24-bytes per-result(on x64), with an additional overhead of 4-bytes per operation. This means that to store zero results on an operation we use 4-bytes, 28-bytes for 1 result, 52-bytes for 2 results, and so on and so forth.
sizeof OpResult: 24-bytes
sizeof Operation(related to results): 4-bytes
Proposal:
I propose that we use a more tightly packed representation for Operation results. This new representation will optimize for the case where an operation only has a few number of results(1-6). The cases where the number of results is greater will have additional runtime complexity in a few places, but this seems reasonable given that this is far from the common case. The new
In fact, anything other than 0-result and 1-result ops is pretty uncommon (relatively speaking when compared to the number of 0-result and 1-result ops). Only the call op(?) when you are looking at mid/low level has multiple results?
representation is essentially a suggestion that we no longer allocate pre-result OpResult objects, but instead pack them directly into Operation:
Operation contains a single use-list.
There isn’t a strong reason why we need a use-list per-result object. I propose that Operation contains a single use-list that is shared between all of the result values. This means that all results iterate the same use-list, but filter on the result number that the use corresponds to.
Does this impact the efficiency of iterating over the use lists? For eg, if one of the results had 20 uses, but another one just one. I think this part isn't described later in the proposal.
Results are packed into a single Type instance.
To save space, we can pack the result types into a single Type instance on the Operation. For the single result cases, this type directly corresponds to the result. In the case of more than one result, this type corresponds to a tuple type containing each of the result types.
But MLIR doesn't have tuple types, and deliberately avoided them. Did you mean a tuple type that's meant for internal use (can't be used in the IR that's printed, parsed, or built)?
Provide an anchor for results greater than the optimized size(i.e. currently >6)
What's special about 6? Shouldn't this be related to the number of low bits available? Isn't that number 3 and so you can store up to 7?
This section ties into how we will represent the Value type in this new world. Without going into too much detail, as this will be covered in the next section, for the cases of a large number of results we will need to provide some anchor to point to. This is essentially a stripped version of OpResult, that just serves to identify the parent operation and the result number. More details on this are in the next section.
New Value Representation
Given the above, we will need a new representation for the Value class. As we have split apart the OpResult class, we will need a new value-typed representation that wraps a custom representation. This representation has 3 states:
The value is a block argument.
The value is the result of an operation, and the result number is small(<=6)
The value is the result of an operation, and the result number is large(>6)
With some tricky bit packing, we can encode all of these cases with a single pointer.
Case 1: Block Arguments
This proposal does not include changes to how block arguments are implemented, so these can be represented via a pointer to a structure containing the same elements as exists today.
Case 2: Operation Result (result number <= 6)
This may sound naive but it's not clear whether "result number" means the number of results or the number/position of the result among the list of results.
For the case of a small result number, we can encode the result number into the low bits of the owning Operation.
Yes, but what about the result values themselves? You are only talking about the result numbers here.
Case 3: Operation Result (result number > 6)
This is a tricky case. Unlike in case 2, the result number is too large to encode in the operation pointer. Therefore, we need to provide an anchor pointer that can identify both the operation and the result number. This can be achieved by having a trailing list of unsigned values in the parent operation. These values simply contain the result number. Then, when creating a value, we point
Could you rewrite this para? "result number" is a single thing - it's not clear what's meant by multiple values containing a result number.to the address of the specific unsigned value that corresponds to the result (after result 6). With a little pointer arithmetic, we can backtrace to the owning Operation from this anchor.
New Memory Cost
With the proposed changes, let’s look at the new memory cost of an operation:
sizeof Operation(related to results):
Let’s compare this cost, with the original:
Did you mean resultTypeAndIsPacked?
How does this impact the efficiency of getDefiningOp if at all. Currently, it's just a dyn_cast to an OpResult and accessing its owner field.
And what about IROperandUse? Is that changed in any way? How does getUses work? Can you sketch out the new Operation class?
In general, I find this proposal pretty hard to read - unless iterated upon multiple times. I'll just wait for your response first. Could you consider posting this as an online doc (either GitHub md / an issue / Google doc) and update it in place?~ Uday
NOTE:
A prerequisite to this RFC is that we will need to make `Value` value typed. This will make the transition to the new representation seamless, but can be done separately as an NFC refactoring.
Thoughts?
--
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlir+uns...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/mlir/ee06fb74-18ee-499c-aaae-53f2496fc215%40tensorflow.org.
Results are packed into a single Type instance.
To save space, we can pack the result types into a single Type instance on the Operation. For the single result cases, this type directly corresponds to the result. In the case of more than one result, this type corresponds to a tuple type containing each of the result types.
But MLIR doesn't have tuple types, and deliberately avoided them. Did you mean a tuple type that's meant for internal use (can't be used in the IR that's printed, parsed, or built)?MLIR does have a Tuple type, we don't support it for the standard operations though.
Provide an anchor for results greater than the optimized size(i.e. currently >6)
What's special about 6? Shouldn't this be related to the number of low bits available? Isn't that number 3 and so you can store up to 7?6 is the number that we can store, given that we have 2 other cases that we need to be able to identify. The value in the low bits is a combination of the Value `kind`, and in this specific case the result number.This section ties into how we will represent the Value type in this new world. Without going into too much detail, as this will be covered in the next section, for the cases of a large number of results we will need to provide some anchor to point to. This is essentially a stripped version of OpResult, that just serves to identify the parent operation and the result number. More details on this are in the next section.
New Value Representation
Given the above, we will need a new representation for the Value class. As we have split apart the OpResult class, we will need a new value-typed representation that wraps a custom representation. This representation has 3 states:
The value is a block argument.
The value is the result of an operation, and the result number is small(<=6)
The value is the result of an operation, and the result number is large(>6)
With some tricky bit packing, we can encode all of these cases with a single pointer.
Case 1: Block Arguments
This proposal does not include changes to how block arguments are implemented, so these can be represented via a pointer to a structure containing the same elements as exists today.
Case 2: Operation Result (result number <= 6)
This may sound naive but it's not clear whether "result number" means the number of results or the number/position of the result among the list of results.Result number means the number of results.
For the case of a small result number, we can encode the result number into the low bits of the owning Operation.
Yes, but what about the result values themselves? You are only talking about the result numbers here.What do you mean by "values"? The OpResult class as it is now won't exist anymore. This proposal is largely about decomposing it and merging the
duplication with the other OpResults of an operation. The pieces of the `result` is stored on the actual operation. That means that in this case, we just need a pointer to the operation and the result number. From there we can access the specific components necessary.
Case 3: Operation Result (result number > 6)
This is a tricky case. Unlike in case 2, the result number is too large to encode in the operation pointer. Therefore, we need to provide an anchor pointer that can identify both the operation and the result number. This can be achieved by having a trailing list of unsigned values in the parent operation. These values simply contain the result number. Then, when creating a value, we point
Could you rewrite this para? "result number" is a single thing - it's not clear what's meant by multiple values containing a result number.to the address of the specific unsigned value that corresponds to the result (after result 6). With a little pointer arithmetic, we can backtrace to the owning Operation from this anchor.
New Memory Cost
With the proposed changes, let’s look at the new memory cost of an operation:
sizeof Operation(related to results):
Let’s compare this cost, with the original:
Did you mean resultTypeAndIsPacked?No, the type is different depending on the number of types. If there are more than 1, the type correpsonds to a TupleType containing each of the individual result types.
How does this impact the efficiency of getDefiningOp if at all. Currently, it's just a dyn_cast to an OpResult and accessing its owner field.For the case where the number of results is small, getDefiningOp is O(1). For the large case, we need to do an address computation to find the Operation*. This is detailed slightly in the "case 3" section above.And what about IROperandUse? Is that changed in any way? How does getUses work? Can you sketch out the new Operation class?Operands don't change at all, they still hold a Value(though Value will now no longer be a pointer).
getUses, as detailed above, will need to filter on the specific result. This is necessary because we will only have one use list going forward.The `sizeof Operation` is intended to give a sketch of the required components on Operation (for each case) moving forward.
--In general, I find this proposal pretty hard to read - unless iterated upon multiple times. I'll just wait for your response first. Could you consider posting this as an online doc (either GitHub md / an issue / Google doc) and update it in place?~ Uday
NOTE:
A prerequisite to this RFC is that we will need to make `Value` value typed. This will make the transition to the new representation seamless, but can be done separately as an NFC refactoring.
Thoughts?
You received this message because you are subscribed to the Google Groups "MLIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ml...@tensorflow.org.