Issue 150325685: distance weighted blend possible improvements

buganize...@google.com

unread,

Feb 26, 2020, 6:16:18 PM2/26/20

to gav1-deve...@googlegroups.com

Replying to this email means your email address will be shared with the team that works on this product.

https://issuetracker.google.com/issues/150325685

Reference Info: 150325685 distance weighted blend possible improvements

component: 750480
status: New
reporter: jo...@gmail.com
cc: ga...@googlegroups.com, jo...@gmail.com
type: Feature Request
priority: P4
severity: S2
retention: Component default

jo...@gmail.com added comment #1:
In 8bpp calculations the input range for the DWB is [-5132, 9212] (see calculations in convolve.cc/warp.cc)

For the DWB that value is multiplied by [0, 16] and shifted by 4 (divide by 16) plus kInterPostRoundBit (difference between compound and non-compound final shifts, also 4).

ARM has vqdmulhq_s16(). It may be possible to use this instruction to keep the intermediate values in 16 bits instead of stepping up to 32. This would be done by adding 12 (+1 for halving) to |weight|. The result would be x * (weight + 13) >> 17 or x * weight >> 4. The results could be half-added to average them.

X86 has _mm_mulhi_epi16(). It may provide similar opportunities.

Generated by Google IssueTracker notification system

You're receiving this email because you are subscribed to updates on Google IssueTracker issue 150325685.
Unsubscribe from this issue.

buganize...@google.com

unread,

Feb 26, 2020, 6:28:55 PM2/26/20

to gav1-deve...@googlegroups.com

Replying to this email means your email address will be shared with the team that works on this product.

https://issuetracker.google.com/issues/150325685

Changed

jo...@google.com added comment #2:
Sorry, weight + (1 << 13), not weight + 13

_______________________________

Reference Info: 150325685 distance weighted blend possible improvements

component: 750480
status: New
reporter: jo...@gmail.com

cc: ga...@googlegroups.com, jo...@gmail.com, jo...@google.com

type: Feature Request
priority: P4
severity: S2
retention: Component default

buganize...@google.com

unread,

Apr 4, 2022, 10:36:32 PM4/4/22

to gav1-deve...@googlegroups.com

Replying to this email means your email address will be shared with the team that works on this product.

https://issuetracker.google.com/issues/150325685

Changed
status: New → Assigned
assignee: <none> → pe...@google.com

_______________________________

Reference Info: 150325685 distance weighted blend possible improvements

component: 750480

status: Assigned
reporter: jo...@google.com
assignee: pe...@google.com
cc: ga...@googlegroups.com, jo...@google.com

type: Feature Request
priority: P4
severity: S2
retention: Component default

buganize...@google.com

unread,

Apr 5, 2022, 12:03:32 AM4/5/22

to gav1-deve...@googlegroups.com

Replying to this email means your email address will be shared with the team that works on this product.

https://issuetracker.google.com/issues/150325685

Changed

pe...@google.com added comment #3:

The original operations are

RightShiftWithRounding(A*w0 + B*w1, 8)

with saturation and w0 + w1 == 16. The largest value of w0 is 13.

13 = 0xD, 13 << 13 = 0x1A000 is out of range for a 16-bit input. Shifting weight left by 7 would have vqdmulhq_s16 result in (x*weight) >> 8, which is close to what we want.

In general, switching to right-shifting before adding loses information, which results in missing carry bits. This case is no exception.

_______________________________

Reference Info: 150325685 distance weighted blend possible improvements

component: 750480

status: Accepted

reporter: jo...@google.com
assignee: pe...@google.com
cc: ga...@googlegroups.com, jo...@google.com

type: Feature Request
priority: P4
severity: S2
retention: Component default

buganize...@google.com

unread,

Apr 7, 2022, 3:05:05 PM4/7/22

to gav1-deve...@googlegroups.com

Replying to this email means your email address will be shared with the team that works on this product.

https://issuetracker.google.com/issues/150325685

Changed

sl...@google.com added comment #4:
I think what Johann was trying to say is the weight would be shifted by 12 and the +1 (total 13) would be done by the vqdmulhq_s16 instruction. Shifting weight left by 12 would still fit in 16 bits, however in your example, weight would then be viewed as a negative number by the instruction. (0xd000) Weight could be shifted left by 11 to get around the sign issue, however after the vqdmulhq_s16, 4 bits of precision would be lost and as you stated, the carry bits would be missing and the add would produce incorrect results.

_______________________________

Reference Info: 150325685 distance weighted blend possible improvements

component: 750480

status: Infeasible

reporter: jo...@google.com
assignee: pe...@google.com
cc: ga...@googlegroups.com, jo...@google.com

type: Feature Request
priority: P4
severity: S2
retention: Component default

buganize...@google.com

unread,

Apr 18, 2022, 9:51:34 PM4/18/22

to gav1-deve...@googlegroups.com

Replying to this email means your email address will be shared with the team that works on this product.

https://issuetracker.google.com/issues/150325685

Changed
status: Infeasible → Assigned

jz...@google.com added comment #5:

I think this is possible if weight and the shifts are factored out correctly using the equivalence of w0 = 16 - w1 and w1 = 16 - w0.

Original equation:

(p0w0 + p1w1 + 128) >> 8

Factored forms:

= (p0w0 + p1(16 - w0) + 128) >> 8
= (p0w0 + 16p1 - p1w0 + 128) >> 8
= ((p0 - p1)w0 + 16p1 + 128) >> 8
= (((p0 - p1)w0) >> 4) + p1 + 8) >> 4

= (p0(16 - w1) + p1w1 + 128) >> 8
= (16p0 - p0w1 + p1w1 + 128) >> 8
= ((p1 - p0)w1 + 16p0 + 128) >> 8
= (((p1 - p0)w1) >> 4) + p0 + 8) >> 4

For the weight scaling we then have:

= (((p0 - p1)(w0 << 12)) >> 16) + p1 + 8) >> 4
= (((p1 - p0)(w1 << 12)) >> 16) + p0 + 8) >> 4

pmulhw and sqdmulh provide the multiply and descale operation in the weight group and the final shift with saturation should take care of the rest. For the sign change with w0 << 12 the equation could likely be adjusted based on the fact that -w0 = w1 - 16.

_______________________________

Reference Info: 150325685 distance weighted blend possible improvements

component: 750480

status: Assigned

reporter: jo...@google.com
assignee: pe...@google.com
cc: ga...@googlegroups.com, jo...@google.com

type: Feature Request
priority: P4
severity: S2
retention: Component default

buganize...@google.com

unread,

May 7, 2022, 7:45:36 PM5/7/22

to gav1-deve...@googlegroups.com

Replying to this email means your email address will be shared with the team that works on this product.

https://issuetracker.google.com/issues/150325685

Changed

pe...@google.com added comment #6:

The NEON is done in cl/446264625

SSE4 is next

_______________________________

Reference Info: 150325685 distance weighted blend possible improvements

component: 750480

status: Accepted

reporter: jo...@google.com
assignee: pe...@google.com
cc: ga...@googlegroups.com, jo...@google.com

type: Feature Request
priority: P4
severity: S2
retention: Component default

buganize...@google.com

unread,

May 11, 2022, 3:05:43 PM5/11/22

to gav1-deve...@googlegroups.com

Replying to this email means your email address will be shared with the team that works on this product.

https://issuetracker.google.com/issues/150325685

Changed

pe...@google.com added comment #7:

cl/446264625 covered the NEON case, but there's a problem with SSE4. There's no doubling, so the weight has to be shifted by 12 rather than 11. Additionally, we can't predict the order of the weights (see GetDistanceWeights[1]), so 13 << 12 is always possible, and gets negative sign extended in _mm_mulhi_epi16, unless we start with a branch that switches which derived formula is used. I'm exploring that approach now.

[1]https://source.corp.google.com/piper///depot/google3/third_party/libgav1/src/tile/prediction.cc;l=145;bpv=1;bpt=1;rcl=447614816

_______________________________

Reference Info: 150325685 distance weighted blend possible improvements

component: 750480

status: Accepted
reporter: jo...@google.com
assignee: pe...@google.com
cc: ga...@googlegroups.com, jo...@google.com

type: Feature Request
priority: P4
severity: S2
retention: Component default

buganize...@google.com

unread,

May 11, 2022, 3:23:22 PM5/11/22

to gav1-deve...@googlegroups.com

Replying to this email means your email address will be shared with the team that works on this product.

https://issuetracker.google.com/issues/150325685

Changed

jz...@google.com added comment #8:
I don't think a branch should be necessary. I suggested one approach at the end of comment #5, multiply by -weight. pmulhrsw can likely be used for the remainder of the downscale.

_______________________________

Reference Info: 150325685 distance weighted blend possible improvements

component: 750480

status: Accepted
reporter: jo...@google.com
assignee: pe...@google.com
cc: ga...@googlegroups.com, jo...@google.com

type: Feature Request
priority: P4
severity: S2
retention: Component default

buganize...@google.com

unread,

May 11, 2022, 3:38:50 PM5/11/22

to gav1-deve...@googlegroups.com

Replying to this email means your email address will be shared with the team that works on this product.

https://issuetracker.google.com/issues/150325685

Changed

pe...@google.com added comment #9:

I think the negation identity still involves having to swap p0-p1 with p1-p0. I think it's reasonable to just start the function with:

 const auto* pred_0 = (weight_0 < 9)
                           ? static_cast<const int16_t*>(prediction_0)
                           : static_cast<const int16_t*>(prediction_1);
  const auto* pred_1 = (weight_0 < 9)
                           ? static_cast<const int16_t*>(prediction_1)
                           : static_cast<const int16_t*>(prediction_0);
  const uint8_t weight = (weight_0 < 9) ? weight_0 : weight_1;

_______________________________

Reference Info: 150325685 distance weighted blend possible improvements

component: 750480

status: Accepted
reporter: jo...@google.com
assignee: pe...@google.com
cc: ga...@googlegroups.com, jo...@google.com

type: Feature Request
priority: P4
severity: S2
retention: Component default

buganize...@google.com

unread,

May 12, 2022, 12:50:25 AM5/12/22

to gav1-deve...@googlegroups.com

Replying to this email means your email address will be shared with the team that works on this product.

https://issuetracker.google.com/issues/150325685

Changed

jz...@google.com added comment #10:

I believe ((((p1 - p0) * -w0) >> 4) + p1 + 8) >> 4 will work in all cases.

_______________________________

Reference Info: 150325685 distance weighted blend possible improvements

component: 750480

status: Accepted
reporter: jo...@google.com
assignee: pe...@google.com
cc: ga...@googlegroups.com, jo...@google.com

type: Feature Request
priority: P4
severity: S2
retention: Component default

buganize...@google.com

unread,

May 19, 2022, 1:13:20 PM5/19/22

to gav1-deve...@googlegroups.com

Replying to this email means your email address will be shared with the team that works on this product.

https://issuetracker.google.com/issues/150325685

Changed

pe...@google.com added comment #11:

I revisited the negative weight idea because it seems like it should work in principle, but found the following:

>>> p0 = 9212
>>> p1 = -5132
>>> hex((p0 - p1) * (13 << 12))
'0x2d868000'
>>> hex((p1 - p0) * ((-13) << 12))
'0x2d868000'
>>> hex((2**16 - 13) << 12)
'0xfff3000'
>>> hex((p1 - p0) * 0x3000)
'-0xa818000'

_______________________________

Reference Info: 150325685 distance weighted blend possible improvements

component: 750480

status: Accepted
reporter: jo...@google.com
assignee: pe...@google.com
cc: ga...@googlegroups.com, jo...@google.com

type: Feature Request
priority: P4
severity: S2
retention: Component default

buganize...@google.com

unread,

May 19, 2022, 1:15:31 PM5/19/22

to gav1-deve...@googlegroups.com

Replying to this email means your email address will be shared with the team that works on this product.

https://issuetracker.google.com/issues/150325685

Changed

pe...@google.com added comment #12:

However, ((p0-p1) << 1) * (w0 << 11) does work, at the cost of one more shift. It seems reasonable that this is worth the benefit of removing the overhead of those branches.

_______________________________

Reference Info: 150325685 distance weighted blend possible improvements

component: 750480

status: Accepted
reporter: jo...@google.com
assignee: pe...@google.com
cc: ga...@googlegroups.com, jo...@google.com

type: Feature Request
priority: P4
severity: S2
retention: Component default

Reply all

Reply to author

Forward