Video Hardware Acceleration Browser issues with FFMPEG

375 views
Skip to first unread message

Lauren Post

unread,
Mar 9, 2010, 8:07:02 PM3/9/10
to Chromium-dev
I posted this on the wrong discussion board first so reposting on
chromium-dev.

I've hooked in an ffmpeg solution using our VPU hardware acceleration
but our performance is not optimal because of a few issues in the
chromium browser code.

FFMPEG in AVFrame has a get_buffer and release_buffer which we could
set in our ffmpeg plugin but the media engine does not call this to
allocate the buffer.  We can decode into buffers given to our plugin
if the buffers are allocated with physically contiguous memory.  If
the get_buffer and release_buffer function pointers are used instead
of calling avcodec_alloc_frame in video_decoder_impl.cc in src\media
\filters then we can guarantee the type of memory that will work with
our hardware codec.

We have a hardware CSC which we'd like to use as post processing on
our codec output to output RGB (and scale) but the decode engine in
the same file assumes all FFMPEG output is YUV.  In src\media\base
\buffers.h RGB support exists but in src\media\fiters
\ffmpeg_video_decode_engine no cases are for RGB and later in
EnqueueVideoFrame in video_decoder_impl.cc it assumes only YUV output
instead of looking at the surface format returned in GetSurfaceFormat
in ffmpeg_video_decode_engine.cc

I know in most FFMPEG apps they assume ffmpeg has internal buffers so
memcpys are done after ffmpeg plugin finishes decoding.  Is there any
way to add a new capability in ffmpeg similar to CODEC_CAP_DR1 where
the ffmpeg codec can decode into the buffers and a memcpy is not done.
 We have 720p working but performance is too slow with the memcpys
done in the browser.

I have more questions but I don't want to make this too long.  Our
main issue is that we need a hardware solution for the browser that
eliminates memcpys and CSC if a hardware solution is provided.

Yes, I'm trying to do these changes myself and see if I can get them
working but if anyone has any tips or issues I should be aware of I'd
appreciate it.

Andrew Scherkus

unread,
Mar 9, 2010, 8:27:10 PM3/9/10
to laure...@freescale.com, Chromium-dev
On Tue, Mar 9, 2010 at 5:07 PM, Lauren Post <laure...@freescale.com> wrote:
I posted this on the wrong discussion board first so reposting on
chromium-dev.

I've hooked in an ffmpeg solution using our VPU hardware acceleration
but our performance is not optimal because of a few issues in the
chromium browser code.

FFMPEG in AVFrame has a get_buffer and release_buffer which we could
set in our ffmpeg plugin but the media engine does not call this to
allocate the buffer.  We can decode into buffers given to our plugin
if the buffers are allocated with physically contiguous memory.  If
the get_buffer and release_buffer function pointers are used instead
of calling avcodec_alloc_frame in video_decoder_impl.cc in src\media
\filters then we can guarantee the type of memory that will work with
our hardware codec.

Very cool!!  I realized the same thing and it really does seem like some low-hanging performance fruit:
 
We have a hardware CSC which we'd like to use as post processing on
our codec output to output RGB (and scale) but the decode engine in
the same file assumes all FFMPEG output is YUV.  In src\media\base
\buffers.h RGB support exists but in src\media\fiters
\ffmpeg_video_decode_engine no cases are for RGB and later in
EnqueueVideoFrame in video_decoder_impl.cc it assumes only YUV output
instead of looking at the surface format returned in GetSurfaceFormat
in ffmpeg_video_decode_engine.cc

Yeah we have a mostly-baked-in assumption that all decoder output data is YV12 planar and we have to convert to RGB when rendering.  Adding support for an RGB path shouldn't be too large of a leap.
 
I know in most FFMPEG apps they assume ffmpeg has internal buffers so
memcpys are done after ffmpeg plugin finishes decoding.  Is there any
way to add a new capability in ffmpeg similar to CODEC_CAP_DR1 where
the ffmpeg codec can decode into the buffers and a memcpy is not done.
 We have 720p working but performance is too slow with the memcpys
done in the browser.

I haven't looked into CODEC_CAP_DR1 but it sounds similar to having application-allocated output buffers and using those for rendering.

I have more questions but I don't want to make this too long.  Our
main issue is that we need a hardware solution for the browser that
eliminates memcpys and CSC if a hardware solution is provided.

Keep them coming!  I've been working on a GPU-assisted CSC + scale + render path on Linux.  We package up the YV12 data and ship it off to our GPU process which creates a quad and scales + positions it relative to the other web content.  It's buggy and may break a little bit of website compatibility, but for large resolutions it's much, much faster than the CPU.

/src/chrome/gpu/gpu_video_layer_glx.* has the fun bits.

Out of curiosity does your CSC hardware output directly to the framebuffer?  We've considered using hardware overlays, etc.. but having the video output directly on top of the web page is the worst-case scenario for website compatibility (i.e., you can't render any playback controls on top of the video).
 
Yes, I'm trying to do these changes myself and see if I can get them
working but if anyone has any tips or issues I should be aware of I'd
appreciate it.

It sounds like we're working on similar problems and we've done a lot of research in this area as sandboxing video decoders and hardware scalers is a tricky issue :) 

Andrew

Lauren Post

unread,
Mar 9, 2010, 10:29:03 PM3/9/10
to Chromium-dev
>
> > I know in most FFMPEG apps they assume ffmpeg has internal buffers so
> > memcpys are done after ffmpeg plugin finishes decoding.  Is there any
> > way to add a new capability in ffmpeg similar to CODEC_CAP_DR1 where
> > the ffmpeg codec can decode into the buffers and a memcpy is not done.
> >  We have 720p working but performance is too slow with the memcpys
> > done in the browser.
>
> I haven't looked into CODEC_CAP_DR1 but it sounds similar to having
> application-allocated output buffers and using those for rendering.

Actually CODEC_CAP_DR1 probably can not be used since it is used by
many codecs in ffmpeg including h.264. It might be best to add a new
capability define in avcodec.h that would imply that buffers are
render fready and using the age field to determine how long the
buffers will be kept intact. We usually keep a pipeline of at least
2-3 buffers before releasing buffers for decode. I've seen comments
in vlc that imply that h264 SW codec uses internal buffers so no
direct render can be done on that codec even though it states DR1
capability. The media engine would just take the ffmpeg AVframe buf
field and save the pointers in the surface field instead of doing a
memcpy if this new capability were set.

> Keep them coming!  I've been working on a GPU-assisted CSC + scale + render
> path on Linux.  We package up the YV12 data and ship it off to our GPU
> process which creates a quad and scales + positions it relative to the other
> web content.  It's buggy and may break a little bit of website
> compatibility, but for large resolutions it's much, much faster than the
> CPU.

Our CSC does scaling also. I am not sure if ffmpeg can get
destination rectangle coordinates so that the output buffer can be one
that can be rendered from but if so that would also help on
performance. Our theory is that the CSC now writes to the destination
buffer than an optimized memcpy to the frame buffer would be needed
instead of the pixel by pixel write the current sw csc does. I'll let
our gpu team know of these gpu changes you have done.


>
> /src/chrome/gpu/gpu_video_layer_glx.* has the fun bits.
>
> Out of curiosity does your CSC hardware output directly to the framebuffer?
>  We've considered using hardware overlays, etc.. but having the video output
> directly on top of the web page is the worst-case scenario for website
> compatibility (i.e., you can't render any playback controls on top of the
> video).
>

> It sounds like we're working on similar problems and we've done a lot of


> research in this area as sandboxing video decoders and hardware scalers is a
> tricky issue :)
>

Yes we've implemented ou solution on several other frameworks (wince,
openmax, gstreamer). Our optimal performance for 720p and now HD is
using overlay with VPU decoding directly into the framebuffer with no
memcpys. CSC is done on the same frame buffer. I understand your
concerns with using overlay but it might be the only solution for
large resolutions that have power constraints like we do using ARM
processors.

As a note we are very concerned about the sandbox design and how it
will affect hardware acceleration access.

Lauren

Reply all
Reply to author
Forward
0 new messages