Initial OpenCL implementation

131 views
Skip to first unread message

Aaron Watry

unread,
Apr 18, 2011, 2:00:54 PM4/18/11
to Codec Developers
Hello all,

I've pushed the current code for my OpenCL VP8 decoder to a sandbox.
I have implemented initial subpixel prediction (sixtap and bilinear),
IDCT/Dequant, and loop filtering. The CL compiler and device
detection is present, and the presence of an OpenCL library is
detected at run-time through dlopen(). If the system is deemed unable
to use OpenCL for decoding, the CPU paths are used as a fallback.

While subpixel/IDCT/Loop Filtering is implemented, it's most
definitely not optimized. I'm planning on working on performance
optimization as a next step, starting with the loop filtering, then
working on refactoring the Macroblock decoding to increase the thread
count. If anyone wants to work on getting the Windows Cygwin or
Visual Studio configuration working, feel free. I've mainly been
developing in Linux/MacOS.

I've been doing most of my work in github.com/awatry/
libvpx.opencl.git, and I'll probably continue to use that as a primary
development store for now, but if anyone wants to take my current code
and run with it, go for it (or let me know about collaborating).

Anyway, let me know if you have any comments/questions. I can go into
more detail about the implementation for anyone who wants it.

--Aaron Watry

poreddy...@gmail.com

unread,
Sep 1, 2015, 6:58:13 AM9/1/15
to Codec Developers
Hi,
   can explain briefly about parallelism. whether cpu and gpu are runs in parallel , ex: if u schedule both inter prediction and idct/iquant on gpu, then cpu has to wait for its output. Please correct me my understanding. 

Aaron Watry

unread,
Sep 6, 2015, 3:36:04 PM9/6/15
to Codec Developers, poreddy...@gmail.com


On Tuesday, September 1, 2015 at 5:58:13 AM UTC-5, poreddy...@gmail.com wrote:
Hi,
   can explain briefly about parallelism. whether cpu and gpu are runs in parallel , ex: if u schedule both inter prediction and idct/iquant on gpu, then cpu has to wait for its output. Please correct me my understanding. 


The parallelism of the project that's on my github site isn't that amazing.  All of the decoding/subpixel filtering/idct/dequantization steps are done on the CPU unless you change some defines to force the subpixel/idct/dequantization steps to be done on the CPU.  The only thing that is always done on the GPU is the loop filtering step at the end of processing each frame, and since the final results of loop filtering feed into the next frame's decoding, that implies a full round-trip to the GPU and back just for loop filtering.

I was able to get the loop filtering algorithm to have a decent amount of parallelism by using a 2-macroblock horizontal offset when launching kernels to reduce the need for any sort of barrier/locking within the kernels themselves, but that still left me with 254 NDRange kernel invocations for a 1080p frame (down from a naive implementation of 130560 per 1080p frame when doing vertical MB edge, followed by vertical inner-block edge, then horizontal MB edge filtering, and then horizontal inner edge filtering).  In either case, we need the loop filtering stage to be completed by the time that anything attempts to use the result of the loop filtering stage, so there is some amount of waiting done by the CPU... but there are parts that allow the CPU/GPU to run in parallel.

Realistically, if we want to achieve really good gains via opencl decoding of VP8, we'll need to:
1) Offload more/all of the decoding pipeline to the GPU to hide latency of transferring things back and forth. This will require some changes to how the decoded macroblock data is stored, so that we can decompress all macroblocks for a frame up front, store them, and send them all to the GPU at once, instead of running the decompress/filter/idct/dequant steps repeatedly once per macroblock in a pipeline...  At least i believe that this was my conclusion back then... it's been a while.

2) If we do all of the macroblock decompression in the beginning, additional opportunities for parallelism in the subpixel/idct/dequant stages may present themselves.

3) Reduce the amount of data copied back and forth between varying memory spaces (although I haven't tried this with an APU with a combined memory space). I tried to do this in the loop filtering stage at least, but there's still room for improvement in the previous steps...

4) If we can modify the libvpx api to have a little more flexibility for VP8, we could add some ways to turn the resulting loop-filtered frames directly into GPU textures that can be passed off to the calling program, which would prevent an entire round-trip from/to the GPU's space when getting ready to render the frame. This would help the bandwidth situation a bit.

5) I never attempted to use OpenCL's image processing functionality.  At the time, not all of the platforms that I needed to support could handle images.  It may be worthwhile to convert the raw image data to GPU textures and use CL's image processing capabilities to see if that can help...  Like I said, I never tried... and for the most part, I never spent the time investigating the CL image API enough to know if it'd help.
 
--Aaron

Frank Galligan

unread,
Sep 8, 2015, 6:14:35 PM9/8/15
to Codec Developers, poreddy...@gmail.com
I know this is for VP8, but there was an attempt to create a GPU decoder for VP9 here [1]. It is a little old at this point, but they modified the decoder so all of the stages could be done per frame. Also VP9 has a frame parallel decode mode where you could do the loopfilter of the previous frame while in parallel decoding the next frame.



--
You received this message because you are subscribed to the Google Groups "Codec Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to codec-devel...@webmproject.org.
To post to this group, send email to codec...@webmproject.org.
Visit this group at http://groups.google.com/a/webmproject.org/group/codec-devel/.
For more options, visit https://groups.google.com/a/webmproject.org/d/optout.

Jeff Muizelaar

unread,
Sep 8, 2015, 9:08:05 PM9/8/15
to codec...@webmproject.org, poreddy...@gmail.com
On Tue, Sep 8, 2015 at 6:14 PM, 'Frank Galligan' via Codec Developers
<codec...@webmproject.org> wrote:
> I know this is for VP8, but there was an attempt to create a GPU decoder for
> VP9 here [1]. It is a little old at this point, but they modified the
> decoder so all of the stages could be done per frame. Also VP9 has a frame
> parallel decode mode where you could do the loopfilter of the previous frame
> while in parallel decoding the next frame.
>
> [1] https://chromium.googlesource.com/webm/libvpx/+/mcw2
>

There doesn't seem to be a GPU decoder in that branch.

-Jeff
Reply all
Reply to author
Forward
0 new messages