Zero Copy and Fastest way to render already decoded YUVFrames

William Bittner

unread,

Dec 7, 2015, 2:51:50 PM12/7/15

to Chromium-dev

Right now - after already being jitter buffered, parsed, demuxed, and decoded, as well as played back with proper timing - I have an external SDK with callbacks that give you a VideoFrame from WrapYUVExternalElement. I am using the CEF. Previously I created "faux" stream parser, and decoder, that did nothing but CPU and Memory usage is huge compared to direct OpenGL rendering from SDK without any CEF/Chromium. CEF/Chromium can use 300% CPU and 1.5GB for about 4-5 1080p streams that are all sized down considerably on the browser window.

The reason I need to avoid the Video Pipeline is because of both the unnecessary overhead of both CPU and Memory for options I don't need. Remember, I am making "FakeStreamParser" and "FakeVideoDecoder"

This is the bug I opened.

https://code.google.com/p/chromium/issues/detail?id=541705

I tried skipping from the point I was supposed to enqueue the data to the buffer on the HTMLMediaElelement all the way to void VideoFrameCompositor::UpdateCurrentFrame( (branch 2357) but it seems the _client doesnt exist and since it wasn't designed for this.... I will suffer.

Can you please help me and just provide me a pointer in the right direction, which part of the massive code base to start digging - for building a way to say:

Sit on a ipc/pipe from My SDK -> Get notification (( **VideoFrame**, **Element Id**)) and do shared memory zero copy to GPU Process to a SKIA Bitmap and bam it is rendered within the dom element ( like an <img tag>)

We are looking at 5+ video's a page ( up to 15 ) from 1080p to 4K ( that needs to be down sampled )

Here is what I was investigating:

Something like direct Skia or OpenGL access with a matrix to transform and translate the output to the right size and position inside of the DOM element would be great - or maybe like PPAPI's 2DContext.

Something else I was looking at was the Device Capture code for local camera - having a shared memory ringer buffer to the GPU would be great! ( Zero copy a must)

Just let me know what would be the most ideal path to take - that would be great! Any additional resources will be greatly appreciated!

Please help,

Thanks

William Bittner

unread,

Dec 9, 2015, 12:14:54 PM12/9/15

to Chromium-dev

Note - the best pipeline seems to be WebRTC with WebMediaPlayer_ms -> see my other post

William Bittner

unread,

Dec 9, 2015, 12:19:04 PM12/9/15

to Chromium-dev

I posted a bounty for someone who can have a one hour call with me and explain things - pleas see: https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/n5f89xwgCfY

Dan Sanders

unread,

Dec 9, 2015, 2:36:41 PM12/9/15

to Chromium-dev

I just wanted to offer some help if the form of explaining why it isn't easy for the Chrome Media team to engage on your question.

Not only do we not support CEF at all (ie. you should be asking in a CEF support channel), but what you are doing is impossible in Chromium (unless your SDK is software only, but then it should fit the current GPUMemoryBuffers path pretty well) because of sandbox restrictions on the renderer process. Frames coming from a platform API in Chromium are always produced in the GPU process, and so all of our expertise is around managing such things.

As I said above, GpuMemoryBuffers are the standard path for moving YUV data from the renderer to the GPU process. If you can arrange for your decoder to write into a mapped GpuMemoryBuffer, then you can render from it in the GPU process with zero copies. The FFmpeg software decode path is one-copy; after decode the result is copied into a GpuMemoryBuffer. One reason for this is that we must eventually unmap the GpuMemoryBuffer, but FFmpeg still needs the frame as a reference.

In general, the Chromium media pipeline is not horribly inefficient, and your numbers do not match expectations. Do your videos have exceptionally long reorder windows or many concurrent reference frames? If so, is your SDK hiding the memory usage out-of-process? Is the memory usage primarily video frames using memory, or is it something else? When profiled, what is using most of this CPU time? What platform are you even running on, and how (in terms of memory primitive) is this SDK delivering frames?

I can help if you pose specific questions about the actual media pipeline in Chromium, but otherwise there is not much I can offer.

- Dan

William Bittner

unread,

Dec 21, 2015, 7:59:31 PM12/21/15

to Chromium-dev

Hi,

Most of the memory when profiled - nearly all - was in the DecoderBuffer - but the CPU Usage profile I have not done yet as I know the memcpys at various places such as AppendBufferInternalAsync etc - I believe that the person who wrote this code I inherited when starting here - used code paths in Chromium that exist, but are mainly not used. Because FFMPeg uses a pre-allocated pool etc and has optimizations inside of the pipeline.

If I go straight to the GPU with GPUMemoryBuffers - how will I associate that buffer with the MediaStream / Video Tag in the DOM?

Reply all

Reply to author

Forward