WebP Parallel Analysis and Implementation

93 views
Skip to first unread message

Pablo Enfedaque

unread,
May 18, 2016, 6:31:02 PM5/18/16
to WebP Discussion

Hello everyone,


I am working in a project to explore (and develop, if possible) a massive parallel implementation of a lossy WebP encoder. I am not an expert with the coding system, so I would like to share here my initial ideas respect the WebP lossy encoder parallelism. I would greatly appreciate if anyone could double check my statements and point to any mistake, or alternatively, share any comment or suggestion.



    WebP lossy encoder parallelism analysis

------------------------------------------------------------------


I list here all the main relevant points (performance wise) where I think the WepP encoder expose parallelism, from coarse- to fine-grained:



1.- Y, U and V planes can be coded each one of them independently from the others, and in parallel.



2.- The coding of different Macroblocks (MB) presents dependencies following this pattern:


1   2   3   4   5

3   4   5   6   7

5   6   7   8   9

7   8   9  10 11


Where each number corresponds to a different MB, and the number itself means the “dependency step” in which that MB can be process. This is:


step 1:  only the corner top-left MB (1,1) can be processed.

step 2:  only the right next MB (1,2) can be processed.

step 3:  2 MBs, (1,3) and (2,1), can be processed independently and in parallel.

etc.


So we have kind of an increasing-max-decreasing parallelism pattern, as the coding of the MB advances, where:


max parallelism = min(image_rows,(image_columns+1)/2)



3.- If I have understood it correctly, the MB dependencies to compute 16x16 and 8x8 prediction modes are like this:


x x

x o


(“o” is the MB to process and “x” are its dependent MBs)


But when using the 4x4 luma block prediction mode, the MB present a L-shape dependency like this:


x x x

x o


This is because, when predicting a MB using the 4x4 mode, the dependencies of each 4x4 subblock (SB) respect the previous SBs follow this same L-shape pattern. This means that SBs in the right border of the MB require the coefficients from the bottom-left corner SB from the top-right MB.


Because 4x4 block mode is not used in U or V, when coding these planes, the dependencies between MB are less constrained, and look like this:


1   2   3   4   5

2   3   4   5   6

3   4   5   6   7

4   5   6   7   8


Now with steeper (faster) increasing/decreasing parallelism transitions and:


max parallelism = min(image_rows, image_columns)



4.- The different 4 16x16 or 8x8 prediction modes inside a given MB can be computed independently and in parallel.



5.- The different 10 prediction modes inside a given SB can be computed independently and in parallel. Dependencies between different SBs inside the same MB follow the L-shape pattern described previously, and have to be respected, although they present exactly the same parallelism described in the point 1 of this discussion (now with a 4x4 matrix of SBs, instead of MBs, and max parallelism of 2).



6.- Inside a given luma MB, the 4 different 16x16 prediction modes along with the 10 prediction modes for the first SB can be computed independently and in parallel (i.e. points 4 and 5 of this discussion can be done in parallel).



------------------------------------------------------------------



I think that is all, thank you very much for your comments!


Pablo


If you are interested in any detail of the project or wanna share any further comment, feel free to contact me at:


penfe...@nvidia.com

pablo.e...@gmail.com



Pascal Massimino

unread,
May 19, 2016, 4:23:47 AM5/19/16
to WebP Discussion
Hi Pablo,

On Thu, May 19, 2016 at 12:31 AM, Pablo Enfedaque <pablo.e...@gmail.com> wrote:

Hello everyone,


I am working in a project to explore (and develop, if possible) a massive parallel implementation of a lossy WebP encoder. I am not an expert with the coding system, so I would like to share here my initial ideas respect the WebP lossy encoder parallelism. I would greatly appreciate if anyone could double check my statements and point to any mistake, or alternatively, share any comment or suggestion.



    WebP lossy encoder parallelism analysis

------------------------------------------------------------------


I list here all the main relevant points (performance wise) where I think the WepP encoder expose parallelism, from coarse- to fine-grained:



1.- Y, U and V planes can be coded each one of them independently from the others, and in parallel.


 yes. However, U and V planes share the same prediction mode. So, they are somehow tied together.
 



2.- The coding of different Macroblocks (MB) presents dependencies following this pattern:


1   2   3   4   5

3   4   5   6   7

5   6   7   8   9

7   8   9  10 11


Where each number corresponds to a different MB, and the number itself means the “dependency step” in which that MB can be process. This is:


step 1:  only the corner top-left MB (1,1) can be processed.

step 2:  only the right next MB (1,2) can be processed.

step 3:  2 MBs, (1,3) and (2,1), can be processed independently and in parallel.

etc.


So we have kind of an increasing-max-decreasing parallelism pattern, as the coding of the MB advances, where:


max parallelism = min(image_rows,(image_columns+1)/2)


yes, this ordering is often called 'wavefront parallel processing', see for instance page 136 here.



3.- If I have understood it correctly, the MB dependencies to compute 16x16 and 8x8 prediction modes are like this:


x x

x o


(“o” is the MB to process and “x” are its dependent MBs)


But when using the 4x4 luma block prediction mode, the MB present a L-shape dependency like this:


x x x

x o


This is because, when predicting a MB using the 4x4 mode, the dependencies of each 4x4 subblock (SB) respect the previous SBs follow this same L-shape pattern. This means that SBs in the right border of the MB require the coefficients from the bottom-left corner SB from the top-right MB.


yes, exactly.
 


Because 4x4 block mode is not used in U or V, when coding these planes, the dependencies between MB are less constrained, and look like this:


1   2   3   4   5

2   3   4   5   6

3   4   5   6   7

4   5   6   7   8


Now with steeper (faster) increasing/decreasing parallelism transitions and:


max parallelism = min(image_rows, image_columns)


Yes. Note that 16x16 modes also fall in this category. But that's not really usable (i think) because you have to evaluate 4x4 coding too, for each MB, and pick the best option between 16x16 and 4x4.



4.- The different 4 16x16 or 8x8 prediction modes inside a given MB can be computed independently and in parallel.


yes. However, evaluating distortion requires passing the reference pixels to each parallel evaluation, which can be costly.
The non-parallel version (that is: pass the boundary coded pixels, the reference pixels, generate all prediction modes at once and compute distortions for each mode) has less data traffic but less parallelism. Yet it can be pretty fast and lightweight. There's a CPU/memory-bandwidth trade-off here.



5.- The different 10 prediction modes inside a given SB can be computed independently and in parallel. Dependencies between different SBs inside the same MB follow the L-shape pattern described previously, and have to be respected, although they present exactly the same parallelism described in the point 1 of this discussion (now with a 4x4 matrix of SBs, instead of MBs, and max parallelism of 2).


yes. Note, additionally, that not all prediction modes requires the same L-shape pattern availability. For instance, the horizontal prediction one (HE4) only requires the left ones. The vertical one (VE4) only the immediate top ones. Not sure it can be exploited, but that something to keep in mind.
Another point: the right most SBs (marked with a x below) can't use top-right pixels because they are not yet known. So the don't really use a L-shape predictors per se. There's a special replication rules for these.

 0  1  2  3
 4  5  6  7x
 8  9 10 11x
12 13 14 15x

4x4 sub-block ordering within a 16x16 MB.



6.- Inside a given luma MB, the 4 different 16x16 prediction modes along with the 10 prediction modes for the first SB can be computed independently and in parallel (i.e. points 4 and 5 of this discussion can be done in parallel).


yes. 



------------------------------------------------------------------



I think that is all, thank you very much for your comments!


hope it helps. Interested to see you progress here! 
skal/


Pablo


If you are interested in any detail of the project or wanna share any further comment, feel free to contact me at:


penfe...@nvidia.com

pablo.e...@gmail.com



--
You received this message because you are subscribed to the Google Groups "WebP Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to webp-discuss...@webmproject.org.
To post to this group, send email to webp-d...@webmproject.org.
Visit this group at https://groups.google.com/a/webmproject.org/group/webp-discuss/.
For more options, visit https://groups.google.com/a/webmproject.org/d/optout.

Pablo Enfedaque

unread,
May 20, 2016, 10:59:06 PM5/20/16
to WebP Discussion

Hi skal,


Thank you very much for you response! It is really useful to have your opinion and comments regarding those aspects of the encoder. A couple of comments about your response:


1.- It is true that U and V share prediction modes. Checking out the code, I am a bit confused about how does it work though. So, if I want to asses the prediction+reconstruction+distortion of a MB in U and V, I understand that I could reconstruct both planes independently. The thing is that when measuring dist,rate, etc. for a given prediction mode in a MB I have to kind of measure them together? (add them up, do an average or something?).


2.- I am planning on developing a CPU-GPU implementation, so an increase in main memory B/W should not be a major problem, even more considering the near arrival of the 2.5D and 3D memories. Latencies could be an issue though if I could not expose enough parallelism at all times.


3.- IMO there is enough coarse-grained parallelism to exploit. You have not developed a true multi-parallel CPU implementation due to other priorities, or your analysis really showed that it won't be efficient performance-wise?


4.- Last and most important question: For what I understand, in the encoder I could run a good enough amount of GPU threads (those 32-wide-vector “threads” called warps in CUDA). They key question now is:


- There is pixel-level (or similar level) parallelism when computing TTransform, ITransform, QuantizeBlock, Disto4/16 and (I think) more importantly, GetResidualCost??


My initial guess was that there is (at least in the DCT, WHT, Quant.). After the first looks at the code I still think so, but I may easily be missing something.


I will keep you posted about my progress!


Cheers and thanks,


Pablo


Pascal Massimino

unread,
May 24, 2016, 9:53:18 AM5/24/16
to WebP Discussion
Hi Pablo,

On Sat, May 21, 2016 at 4:59 AM, Pablo Enfedaque <pablo.e...@gmail.com> wrote:

Hi skal,


Thank you very much for you response! It is really useful to have your opinion and comments regarding those aspects of the encoder. A couple of comments about your response:


1.- It is true that U and V share prediction modes. Checking out the code, I am a bit confused about how does it work though. So, if I want to asses the prediction+reconstruction+distortion of a MB in U and V, I understand that I could reconstruct both planes independently. The thing is that when measuring dist,rate, etc. for a given prediction mode in a MB I have to kind of measure them together? (add them up, do an average or something?).


You should really consider (mentally) each 8x8 U and V blocks to be the same kind. If you look at the code (vp8enci.h:57) we actually consider them being a single 16x8 sample block. They are always handled together in pair (rate, distortion, reconstruction, etc.)


2.- I am planning on developing a CPU-GPU implementation, so an increase in main memory B/W should not be a major problem, even more considering the near arrival of the 2.5D and 3D memories. Latencies could be an issue though if I could not expose enough parallelism at all times.


3.- IMO there is enough coarse-grained parallelism to exploit. You have not developed a true multi-parallel CPU implementation due to other priorities, or your analysis really showed that it won't be efficient performance-wise?


Mostly, the current code is trying to avoid inter-thread communication at most (barriers, locks, etc.), and split the jobs in two independent parts when possible (which is hard for the encoder).
 


4.- Last and most important question: For what I understand, in the encoder I could run a good enough amount of GPU threads (those 32-wide-vector “threads” called warps in CUDA). They key question now is:


- There is pixel-level (or similar level) parallelism when computing TTransform, ITransform, QuantizeBlock, Disto4/16 and (I think) more importantly, GetResidualCost??


My initial guess was that there is (at least in the DCT, WHT, Quant.). After the first looks at the code I still think so, but I may easily be missing something.



Well, a typical profile for lossy encoding job looks like this (that's just one run, don't take the exact % numbers for face value)

  9.86%     [.] GetResidualCostSSE2
  9.32%     [.] VP8PutBit
  7.43%     [.] PickBestIntra4
  7.18%     [.] ITransform
  6.62%     [.] GetCoeffs
  5.54%     [.] FTransform
  4.27%     [.] QuantizeBlock
  3.85%     [.] CollectHistogram
  3.81%     [.] Disto4x4
  3.80%     [.] SSE4x4
...

VP8PutBit is inherently serial. The blocking point is Intra4x4 decision, because you need to finish deciding one 4x4 block before going to the next
(unless using staggered wavefront, like we discussed). But the rest can be evaluated in parallel, i'd say.


Thanks!
skal/

I will keep you posted about my progress!


Cheers and thanks,


Pablo


Pablo Enfedaque

unread,
May 24, 2016, 5:46:23 PM5/24/16
to WebP Discussion

Hi skal,


Thanks for the answer! Everything clear now :D


Respect the  VP8PutBit (aka the boolean arithmetic coder), I was thinking about computing first the prediction/reconstruction/metrics for all MBs in a first pass, and afterwards do the arithmetic coding for all MBs in a second pass. Does that make sense??


The VP8 spec says that it runs a separate arithmetic decoder (AC) per partition, so I was being positive about arithmetic coding different MBs independently (i.e. run parallel instances of ACs equal to the number of MBs). I don't exactly know if that is possible though, because there may be (for instance) propagation of context probabilities between the coding of different MBs. Do you think multiple parallel ACs over different MBs is possible (after they have been all measured and reconstructed)?


Thank you very much!!


Pablo

Pablo Enfedaque

unread,
Jun 8, 2016, 8:29:24 PM6/8/16
to WebP Discussion
Hi there,

I have been exploring the alternative that I wrote in the post above. The arithmetic coder and its context probabilities are kind of a confusing part of the codec though, and I am still not sure if the idea posted above is possible or not.  Anyone with a better understanding of the codec than me has an opinion about it? :)

Cheers,

Pablo
Reply all
Reply to author
Forward
0 new messages