Use Android GraphicBuffer as Texture in Chrome Android WebView

1,073 views
Skip to first unread message

Roger Yi

unread,
Jul 10, 2014, 2:32:16 AM7/10/14
to graphi...@chromium.org
Hi, 

I notice Chrome Android WebView use Android GraphicBuffer as GL Texture's buffer, GraphicBuffer allocated by Android ION memory allocator, which is an SoC implement HAL module, I have try this before, but found some compatible issues:

1, Some device may have number limit of GraphicBuffer (Galaxy Nexus)
2, Some device cannot lock/unlock to write a GraphicBuffer when it bind to a Texture (Tegra2/3/4, Adreno 2xx...), that means it need to unbind the texture after the gl call, but the bind/unbind OP maybe slow
3, Some device the GraphicBuffer's lock/unlock OP is very slow, maybe 5ms+ (Nexus 10)

Is that mean SoC vendor need to provide new ION driver to solve the above issues for Android 4.4?

Bo Liu

unread,
Jul 10, 2014, 11:52:22 AM7/10/14
to Roger Yi, graphics-dev
On Wed, Jul 9, 2014 at 11:32 PM, Roger Yi <roge...@gmail.com> wrote:
Hi, 

I notice Chrome Android WebView use Android GraphicBuffer as GL Texture's buffer, GraphicBuffer allocated by Android ION memory allocator, which is an SoC implement HAL module, I have try this before, but found some compatible issues:

1, Some device may have number limit of GraphicBuffer (Galaxy Nexus)

The only know limit I'm aware of is the file descriptors. Each buffer uses 1 or 2 fds (depending on vendor). This can be mitigated somewhat in webview to not allocate so many buffers.
 
2, Some device cannot lock/unlock to write a GraphicBuffer when it bind to a Texture (Tegra2/3/4, Adreno 2xx...), that means it need to unbind the texture after the gl call, but the bind/unbind OP maybe slow

We had a workaround for nvidia: https://codereview.chromium.org/66033009
 
3, Some device the GraphicBuffer's lock/unlock OP is very slow, maybe 5ms+ (Nexus 10)

Can't do much about that...
 

Is that mean SoC vendor need to provide new ION driver to solve the above issues for Android 4.4?

Yes.

Roger Yi

unread,
Sep 13, 2014, 6:02:54 AM9/13/14
to graphi...@chromium.org, roge...@gmail.com, bo...@chromium.org
[Android WebView] Support async upload idle


Look like the trunk turn off the usage of GraphicBuffer for zero-copy upload as default? Can I know Why?


在 2014年7月10日星期四UTC+8下午11时52分22秒,Bo Liu写道:

Bo Liu

unread,
Sep 13, 2014, 11:04:08 AM9/13/14
to Roger Yi, graphics-dev
Unify code paths with chrome and get more consistent perf characteristics across devices.

Dana Jansens

unread,
Sep 13, 2014, 11:08:40 AM9/13/14
to Bo Liu, Roger Yi, graphics-dev
On Sat, Sep 13, 2014 at 11:03 AM, Bo Liu <bo...@chromium.org> wrote:
Unify code paths with chrome and get more consistent perf characteristics across devices.

Hm, its surprising to me that you'd want to prefer that over zero-copy by default. Is this a temporary thing because it's not well tested yet or there are known issues?

On desktop we're planning to very soon switch from idle uploads to the "one copy" rasterizer which uses the zero copy framework. And AFAIK we'd like to experiment with doing this on Android as well and deprecating the idle upload (PixelBufferRasterWorkerPool) code.

Zero copy should definitely be better performance than idle uploads, modulo bugs, so I would have expected that to be preferable.

Bo Liu

unread,
Sep 13, 2014, 11:40:22 AM9/13/14
to Dana Jansens, Roger Yi, graphics-dev
Known issues.

File descriptor issue is one. Which is always an exercise in frustration trying to balance performance with not going over the limit and outright crashing, when there are n webviews drawing.

Also GraphicsBuffer is slow on some devices.

My goal is for webview graphics right now is to be as similar to chrome on android as possible. If chrome is switching to a new uploader, then it should in theory just work in webview.

willy yu

unread,
Sep 13, 2014, 12:39:17 PM9/13/14
to bo...@chromium.org, Dana Jansens, Roger Yi, graphics-dev
Hi Bo,

Do you mean that the webview will use idle upload (PixelBufferRasterWorkerPool ) in the future ?

We actually found that the lock/unlock graphic buffer is slow on some vendors.
By definition that the graphic buffer can lock/unlock on non-main thread. 
Maybe we can operate the graphics buffer on the raster thread.



To unsubscribe from this group and stop receiving emails from it, send an email to graphics-dev...@chromium.org.

Bo Liu

unread,
Sep 13, 2014, 1:05:52 PM9/13/14
to willy yu, Dana Jansens, Roger Yi, graphics-dev
On Sat, Sep 13, 2014 at 9:39 AM, willy yu <jiaw...@gmail.com> wrote:
Hi Bo,

Do you mean that the webview will use idle upload (PixelBufferRasterWorkerPool ) in the future ?

It's already the case since crrev.com/289252

David Reveman

unread,
Sep 14, 2014, 7:30:54 PM9/14/14
to bo...@chromium.org, willy yu, Dana Jansens, Roger Yi, graphics-dev
On Sat, Sep 13, 2014 at 1:05 PM, Bo Liu <bo...@chromium.org> wrote:


On Sat, Sep 13, 2014 at 9:39 AM, willy yu <jiaw...@gmail.com> wrote:
Hi Bo,

Do you mean that the webview will use idle upload (PixelBufferRasterWorkerPool ) in the future ?

It's already the case since crrev.com/289252

This is alright for now but we should try moving to the 1-copy rasterizer once https://codereview.chromium.org/562833004/ lands. 1-copy rasterizer can use gralloc or shared memory. We can use gralloc when more efficient and we don't have to worry about file descriptor limits in this case as gralloc buffers are only used as temporary staging buffers and we can control how many are created.

I'm hoping that we can remove PixelBufferRasterWorkerPool and async uploads soon.
 
 

We actually found that the lock/unlock graphic buffer is slow on some vendors.
By definition that the graphic buffer can lock/unlock on non-main thread. 
Maybe we can operate the graphics buffer on the raster thread.

Yes, I've been planning to allow this and with some recent refactor we're almost able to do it (we're now creating each SkCanvas and perform all format conversions on the raster thread) but we still need to move mapping of buffers to the raster threads. I think the best approach is to expose the gfx::GpuMemoryBuffer type to the compositor and allow these buffers to be created and mapped outside the GLES2 interface. This should allow us pay the cost of both allocating (sync IPC) and mapping/unmapping these buffers on the raster threads.

David

Kimmo Kinnunen

unread,
Sep 15, 2014, 3:28:45 AM9/15/14
to David Reveman, bo...@chromium.org, willy yu, Dana Jansens, Roger Yi, graphics-dev
On 15.09.2014 02:30, David Reveman wrote:
> This is alright for now but we should try moving to the 1-copy
> rasterizer once https://codereview.chromium.org/562833004/ lands. 1-copy
> rasterizer can use gralloc or shared memory. We can use gralloc when
> more efficient and we don't have to worry about file descriptor limits
> in this case as gralloc buffers are only used as temporary staging
> buffers and we can control how many are created.
>
> I'm hoping that we can remove PixelBufferRasterWorkerPool and async
> uploads soon.
>
>
> We actually found that the lock/unlock graphic buffer is slow on
> some vendors.
> By definition that the graphic buffer can lock/unlock on
> non-main thread.
> Maybe we can operate the graphics buffer on the raster thread.

David,
Would it be possible to elaborate a bit on the Chrome plans wrt this
(and WebView, if it differs)? What does the one copy rasterizer
concretely mean?

What is the vision or goal for the mechanism to transfer the sw
rasterized bitmaps to the compositor textures?

Would it be preferable to have only one codepath, or do you see
acceptable to have few different but "first class" code paths, maybe one
for gralloc and one for texture upload code path?

Or is the vision that all HW should use is gralloc if that's suboptimal
for compositing, so be it?

The reason I'm asking is because on platforms that benefit of texture
swizzling, the gralloc api is a bit counter-intuitive. What appears to
be called "zero copy" will end up being rather many copies. These copies
are either copy operations or copied data.

The limiting factor of the gralloc API, as far as I understand, is that
the contract is that once you lock the buffer, the buffer needs to have
the expected bits in, eg. the texture data that was in the texture. This
is expensive operation for swizzled textures. From the point of view of
the rasterization, doing this work serves no purpose, because the data
will be overwritten. For these platforms, gralloc will be quite inferior
to texture upload.

Locking "write only" might be a solution, a tip to the implementation
that one might provide zeroed data in the buffer instead of readback or
cached copy. As far as I can guess form the API definition, this is not
the semantics of the flag, though? I read the API so that if one wants
to read the bits (as in rasterizer blending), one must have R+W flags.

Platforms that benefit form swizzling, it is quite hard to get more
optimal texture update than texture upload. This is the code that runs
quite many times, and thus is expected to get a fair deal of attention.
The code structure that threads and gl contexts force is of course
cumbersome, compared to more relaxed gralloc.. (Barring sw rasterizer
that swizzles as part of the rasterization process, at least)

Br,
Kimmo

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Bo Liu

unread,
Sep 15, 2014, 12:30:42 PM9/15/14
to Kimmo Kinnunen, David Reveman, willy yu, Dana Jansens, Roger Yi, graphics-dev
On Mon, Sep 15, 2014 at 12:28 AM, Kimmo Kinnunen <kkin...@nvidia.com> wrote:
On 15.09.2014 02:30, David Reveman wrote:
This is alright for now but we should try moving to the 1-copy
rasterizer once https://codereview.chromium.org/562833004/ lands. 1-copy
rasterizer can use gralloc or shared memory. We can use gralloc when
more efficient and we don't have to worry about file descriptor limits
in this case as gralloc buffers are only used as temporary staging
buffers and we can control how many are created.

I'm hoping that we can remove PixelBufferRasterWorkerPool and async
uploads soon.


        We actually found that the lock/unlock graphic buffer is slow on
        some vendors.
        By definition that the graphic buffer can lock/unlock on
        non-main thread.
        Maybe we can operate the graphics buffer on the raster thread.

David,
Would it be possible to elaborate a bit on the Chrome plans wrt this (and WebView, if it differs)? What does the one copy rasterizer concretely mean?

What is the vision or goal for the mechanism to transfer the sw rasterized bitmaps to the compositor textures?

Would it be preferable to have only one codepath, or do you see acceptable to have few different but "first class" code paths, maybe one for gralloc and one for texture upload code path?

There are already n paths.
 

Or is the vision that all HW should use is gralloc if that's suboptimal for compositing, so be it?

Gralloc is a private API on android, so chrome can't use it (without hacking around the ndk)

For webview, I'd like to reduce differences from chrome. What's fast enough for chrome should be fast enough for webview as well. So I'd push for not using gralloc.


The reason I'm asking is because on platforms that benefit of texture swizzling, the gralloc api is a bit counter-intuitive. What appears to be called "zero copy" will end up being rather many copies. These copies are either copy operations or copied data.

The limiting factor of the gralloc API, as far as I understand, is that the contract is that once you lock the buffer, the buffer needs to have the expected bits in, eg. the texture data that was in the texture. This is expensive operation for swizzled textures. From the point of view of the rasterization, doing this work serves no purpose, because the data will be overwritten. For these platforms, gralloc will be quite inferior to texture upload.

Locking "write only" might be a solution, a tip to the implementation that one might provide zeroed data in the buffer instead of readback or cached copy. As far as I can guess form the API definition, this is not the semantics of the flag, though? I read the API so that if one wants to read the bits (as in rasterizer blending), one must have R+W flags.

Yep
 

Platforms that benefit form swizzling, it is quite hard to get more optimal texture update than texture upload. This is the code that runs quite many times, and thus is expected to get a fair deal of attention. The code structure that threads and gl contexts force is of course cumbersome, compared to more relaxed gralloc.. (Barring sw rasterizer that swizzles as part of the rasterization process, at least)

This is some very enlightening information from the driver side. Thanks :)

Eric Penner

unread,
Sep 15, 2014, 7:56:23 PM9/15/14
to bo...@chromium.org, Kimmo Kinnunen, David Reveman, willy yu, Dana Jansens, Roger Yi, graphics-dev
I figure at this point we should just keep the best performing solution we have for each platform, while preparing for Ganesh that has different performance characteristics.

My concern is that the zero-copy framework doesn't provide any performance benefit on Android, since the actual zero-copy implementations ended up being slower (Gralloc was often faster but most gralloc operations remained on the CC thread).

If we really want to disable async, I recommend evaluating the N10 and other high-res devices first. The one-copy fallback is functionally equivalent to glMapTexImage, and we always had throttling of some form when we used to use that a few years ago.


Would it be preferable to have only one codepath, or do you see acceptable to have few different but "first class" code paths, maybe one for gralloc and one for texture upload code path?

There are already n paths.

Kimmo, what would be the recommended technique on NVidia? What I think would still warrant it's own code-path is if NVidia can provide an extension that allows for persistently mapping a PBO in another process. Otherwise, it's still great to know what works best on NVidia, but like Bo says we have a lot of code paths.

Roger Yi

unread,
Sep 15, 2014, 10:43:02 PM9/15/14
to graphi...@chromium.org, bo...@chromium.org, kkin...@nvidia.com, rev...@chromium.org, jiaw...@gmail.com, dan...@chromium.org, roge...@gmail.com, epe...@google.com
I am a little confuse that how many upload paths in chromium and their differences... 

I list what I know in below:

1, zero-copy upload, use buffer can shared between CPU/GPU as tile's buffer such as GraphicBuffer on Android
2, one-copy upload, use buffer can shared between CPU/GPU but just temporary, will need to copy to the buffer of tile once (by GPU?)
3, async upload, use normal bitmap, use glTexImage2D to upload by CPU when compositor thread is idle

Am I right? and chromium have more upload paths not listed above?

---

and BTW,

Even can put lock/unlock of GraphicBuffer in raster thread and actually I have try this before, but in some devices the lock/unlock is extreme slow, whick make each tile's rasterization time over 10ms, when you scroll the page fast enough, the screen will be empty for a long period...

在 2014年9月16日星期二UTC+8上午7时56分23秒,Eric Penner写道:

Bo Liu

unread,
Sep 16, 2014, 2:12:48 AM9/16/14
to Roger Yi, graphics-dev, kkin...@nvidia.com, rev...@chromium.org, willy yu, Dana Jansens, epe...@google.com
On Mon, Sep 15, 2014 at 7:43 PM, Roger Yi <roge...@gmail.com> wrote:
I am a little confuse that how many upload paths in chromium and their differences... 

I list what I know in below:

I'm not really the authority on this area, but here goes...
 

1, zero-copy upload, use buffer can shared between CPU/GPU as tile's buffer such as GraphicBuffer on Android
2, one-copy upload, use buffer can shared between CPU/GPU but just temporary, will need to copy to the buffer of tile once (by GPU?)

As David said, GpuMemoryBuffer can be backed by gralloc on Android, or by shared memory. I think this applies to both 1 and 2.
 
3, async upload, use normal bitmap, use glTexImage2D to upload by CPU when compositor thread is idle

For async upload, there's threaded upload (AsyncPixelTransferManagerEGL) and idle upload (AsyncPixelTransferManagerIdle)

willy yu

unread,
Sep 16, 2014, 5:17:57 AM9/16/14
to bo...@chromium.org, Roger Yi, graphics-dev, kkin...@nvidia.com, rev...@chromium.org, Dana Jansens, epe...@google.com
On Tue, Sep 16, 2014 at 2:12 PM, Bo Liu <bo...@chromium.org> wrote:


On Mon, Sep 15, 2014 at 7:43 PM, Roger Yi <roge...@gmail.com> wrote:
I am a little confuse that how many upload paths in chromium and their differences... 

I list what I know in below:

I'm not really the authority on this area, but here goes...
 

1, zero-copy upload, use buffer can shared between CPU/GPU as tile's buffer such as GraphicBuffer on Android
2, one-copy upload, use buffer can shared between CPU/GPU but just temporary, will need to copy to the buffer of tile once (by GPU?)

As David said, GpuMemoryBuffer can be backed by gralloc on Android, or by shared memory. I think this applies to both 1 and 2.
 
3, async upload, use normal bitmap, use glTexImage2D to upload by CPU when compositor thread is idle

For async upload, there's threaded upload (AsyncPixelTransferManagerEGL) and idle upload (AsyncPixelTransferManagerIdle)
 

Am I right? and chromium have more upload paths not listed above? 

---

and BTW,

Even can put lock/unlock of GraphicBuffer in raster thread and actually I have try this before, but in some devices the lock/unlock is extreme slow, whick make each tile's rasterization time over 10ms, when you scroll the page fast enough, the screen will be empty for a long period...
 
 
Yes, it is indeed slow on some devices.
But, By current Android WebView's design. It is a trade-off.
Move to raster thread can improve the smoothness. Especially on the high-resolution devices.
Hi Bo,
Is there any document or plan about this? which path is best for performance?
 
Thanks a lot

Kimmo Kinnunen

unread,
Sep 16, 2014, 9:06:57 AM9/16/14
to Eric Penner, bo...@chromium.org, David Reveman, willy yu, Dana Jansens, Roger Yi, graphics-dev
On 16.09.2014 02:56, Eric Penner wrote:
> I figure at this point we should just keep the best performing
> solution we have for each platform, while preparing for Ganesh that
> has different performance characteristics.

Ok, good to know. Thanks.

> Kimmo, what would be the recommended technique on NVidia?

(Talking a from the mobile perspective, as this was related to the
android gralloc work)

The old devices would benefit a bit from the gralloc path code-path.
However, with the new hw such as the gpu with the "K1", current
thinking is that normal texture upload with glTexImage2D would be the
fastest and probably also the most "asynchronous" way to update the
pixels. There's some memcpying done to achieve asynchronous upload, but
if I understand correctly, that should be quite fast compared to
cross-thread synchronisation. Whether or not uploading in an aux thread
vs in the main compositor helps prevent janks is still not entirely
clear for me at least, so I'd need to try to experiment with
AsyncPixelTransferManagerEGL. Before that, I can't give any good
recommendation, apart from suggestion that preserving a non-gralloc
code-path would be great :)

Our current generation mobile stuff wouldn't probably benefit of
switching to PBOs, due to the hw not having full cache coherency.

> What I think would still warrant it's own code-path is if NVidia can
> provide an extension that allows for persistently mapping a PBO in
> another process.

How would that work with sandboxing / command buffer? Would references
to particular PBOs be sent cross-process as file descriptors and then
maybe mmapped in the raster process? I guess that could work, though I'm
no driver expert or spec writer. Sounds a bit tricky to specify..
Probably at this point, it's not worth the complication, since you
optimize away only a memcpy, and there's the cache issue..

Eric Penner

unread,
Sep 16, 2014, 2:54:03 PM9/16/14
to Kimmo Kinnunen, bo...@chromium.org, David Reveman, willy yu, Dana Jansens, Roger Yi, graphics-dev
On Tue, Sep 16, 2014 at 6:06 AM, Kimmo Kinnunen <kkin...@nvidia.com> wrote:
On 16.09.2014 02:56, Eric Penner wrote:
I figure at this point we should just keep the best performing
solution we have for each platform, while preparing for Ganesh that
has different performance characteristics.

Ok, good to know. Thanks.

Kimmo, what would be the recommended technique on NVidia?

(Talking a from the mobile perspective, as this was related to the
android gralloc work)

The old devices would benefit a bit from the gralloc path code-path.
However, with the new hw such as the gpu with the "K1", current
thinking is that normal texture upload with glTexImage2D would be the fastest and probably also the most "asynchronous" way to update the pixels. There's some memcpying done to achieve asynchronous upload, but if I understand correctly, that should be quite fast compared to cross-thread synchronisation. Whether or not uploading in an aux thread vs in the main compositor helps prevent janks is still not entirely clear for me at least, so I'd need to try to experiment with AsyncPixelTransferManagerEGL. Before that, I can't give any good recommendation, apart from suggestion that preserving a non-gralloc code-path would be great :)

Our current generation mobile stuff wouldn't probably benefit of switching to PBOs, due to the hw not having full cache coherency.


What I think would still warrant it's own code-path is if NVidia can
provide an extension that allows for persistently mapping a PBO in
another process.

How would that work with sandboxing / command buffer? Would references to particular PBOs be sent cross-process as file descriptors and then maybe mmapped in the raster process? I guess that could work, though I'm no driver expert or spec writer. Sounds a bit tricky to specify.. Probably at this point, it's not worth the complication, since you optimize away only a memcpy, and there's the cache issue..


Does non-cache coherency imply you do a full CPU copy out of the PBO synchronously on the CPU? If so then indeed nothing is going to beat texSubImage.

Ideally an extension would let us map one large persistently mapped PBO in another process, using mmap to map it and standard fences and PBO operations on the GL side. If mmap doesn't work, next best would be a standardized ioctl on a file descriptor to do the same, but not sure how Android would feel about that. But yeah, that assumes PBOs provide some kind of benefit.

David Reveman

unread,
Sep 16, 2014, 3:56:55 PM9/16/14
to bo...@chromium.org, Kimmo Kinnunen, willy yu, Dana Jansens, Roger Yi, graphics-dev
On Mon, Sep 15, 2014 at 12:29 PM, Bo Liu <bo...@chromium.org> wrote:


On Mon, Sep 15, 2014 at 12:28 AM, Kimmo Kinnunen <kkin...@nvidia.com> wrote:
On 15.09.2014 02:30, David Reveman wrote:
This is alright for now but we should try moving to the 1-copy
rasterizer once https://codereview.chromium.org/562833004/ lands. 1-copy
rasterizer can use gralloc or shared memory. We can use gralloc when
more efficient and we don't have to worry about file descriptor limits
in this case as gralloc buffers are only used as temporary staging
buffers and we can control how many are created.

I'm hoping that we can remove PixelBufferRasterWorkerPool and async
uploads soon.


        We actually found that the lock/unlock graphic buffer is slow on
        some vendors.
        By definition that the graphic buffer can lock/unlock on
        non-main thread.
        Maybe we can operate the graphics buffer on the raster thread.

David,
Would it be possible to elaborate a bit on the Chrome plans wrt this (and WebView, if it differs)? What does the one copy rasterizer concretely mean?

I created a document that describes the different mechanism for updated textures in chromium that I'm hoping will be sufficient:

It's pretty high level and it doesn't yet describe how 1-copy can be used without gralloc/SurfaceTexture support. But as Eric mentioned, 1-copy updates without a native GpuMemoryBuffer implementation is similar to using our old CHROMIUM_map_sub extension. There's one important difference; 1-copy updates allow us to control exactly how much memory is used. The memory management is done behind the GLES interface in the map_sub case while the compositor does the memory management in the 1-copy case. We can increase or decrease the usage as needed, take the current gpu memory limit into account and even wait for previous updates to complete without blocking if necessary.
 

What is the vision or goal for the mechanism to transfer the sw rasterized bitmaps to the compositor textures?

Would it be preferable to have only one codepath, or do you see acceptable to have few different but "first class" code paths, maybe one for gralloc and one for texture upload code path?

The number of codepaths are effectively reduced to 1 (not counting standard glTexImage2D) in the document mentioned above as 0-copy and 1-copy are just two slightly different ways of using the GpuMemoryBuffer framework. 90% of the code is the same.

Async uploads and the complicated logic required on the compositor side to handle these type of texture updates has been making tile management related changes harder to do on the compositor side. They've been a source of bugs and significantly affected the time it's taken to add proper Ganash support imo. I think we can improve the velocity at which we can make changes to this part of the compositor if async uploads would be removed. 
 

There are already n paths.
 

Or is the vision that all HW should use is gralloc if that's suboptimal for compositing, so be it?

Gralloc is a private API on android, so chrome can't use it (without hacking around the ndk)

For webview, I'd like to reduce differences from chrome. What's fast enough for chrome should be fast enough for webview as well. So I'd push for not using gralloc.

Sgtm. I think we should try using 1-copy updates without gralloc for Webview once we've switched to that on desktop.

David

Roger Yi

unread,
Sep 16, 2014, 10:54:02 PM9/16/14
to graphi...@chromium.org, bo...@chromium.org, kkin...@nvidia.com, jiaw...@gmail.com, dan...@chromium.org, roge...@gmail.com, rev...@google.com


在 2014年9月17日星期三UTC+8上午3时56分55秒,David Reveman写道:



On Mon, Sep 15, 2014 at 12:29 PM, Bo Liu <bo...@chromium.org> wrote:


On Mon, Sep 15, 2014 at 12:28 AM, Kimmo Kinnunen <kkin...@nvidia.com> wrote:
On 15.09.2014 02:30, David Reveman wrote:
This is alright for now but we should try moving to the 1-copy
rasterizer once https://codereview.chromium.org/562833004/ lands. 1-copy
rasterizer can use gralloc or shared memory. We can use gralloc when
more efficient and we don't have to worry about file descriptor limits
in this case as gralloc buffers are only used as temporary staging
buffers and we can control how many are created.

I'm hoping that we can remove PixelBufferRasterWorkerPool and async
uploads soon.


        We actually found that the lock/unlock graphic buffer is slow on
        some vendors.
        By definition that the graphic buffer can lock/unlock on
        non-main thread.
        Maybe we can operate the graphics buffer on the raster thread.

David,
Would it be possible to elaborate a bit on the Chrome plans wrt this (and WebView, if it differs)? What does the one copy rasterizer concretely mean?

I created a document that describes the different mechanism for updated textures in chromium that I'm hoping will be sufficient:

Can I have the authority to access the document? Thanks. 

Roger Yi

unread,
Sep 17, 2014, 5:05:03 AM9/17/14
to graphi...@chromium.org, bo...@chromium.org, kkin...@nvidia.com, jiaw...@gmail.com, dan...@chromium.org, roge...@gmail.com, rev...@google.com
Thanks for the sharing, and I have a question in below:

TexImage2D/TexSubImage2D

Standard OpenGL mechanism to initialize or update a texture. This will copy the provided data into the command buffer and perform a matching texture upload on the GPU process side. Essential for WebGL support and sufficient in many use cases. 


Is that mean when not use GpuMemoryBuffer, actually we need to copy twice, the first is put the command into command buffer, and the second is flush the command buffer?


在 2014年9月17日星期三UTC+8上午3时56分55秒,David Reveman写道:

Kimmo Kinnunen

unread,
Sep 17, 2014, 7:00:55 AM9/17/14
to David Reveman, bo...@chromium.org, willy yu, Dana Jansens, Roger Yi, graphics-dev
On 16.09.2014 22:56, 'David Reveman' via Graphics-dev wrote:
> On Mon, Sep 15, 2014 at 12:28 AM, Kimmo Kinnunen
> <kkin...@nvidia.com <mailto:kkin...@nvidia.com>> wrote:
>> David, Would it be possible to elaborate a bit on the Chrome plans
>> wrt this (and WebView, if it differs)? What does the one copy
>> rasterizer concretely mean?
>
>
> I created a document that describes the different mechanism for
> updated textures in chromium that I'm hoping will be sufficient:
> https://docs.google.com/a/chromium.org/document/d/1J4lpHqVw9CmIiM3BeVCRT-SIzDcy-EWfUGBjP0yR_S8/edit?usp=sharing
> It's pretty high level and it doesn't yet describe how 1-copy can be
> used without gralloc/SurfaceTexture support. But as Eric mentioned,
> 1-copy updates without a native GpuMemoryBuffer implementation is
> similar to using our old CHROMIUM_map_sub extension.

If I'm understanding correctly, CHROMIUM_map_sub
MapTexSubImage2DCHROMIUM call access flag "write only" tries to say that
the memory area returned by the function is readable, but the initial
contents is undefined. If this is preserved, I think GpuMemoryBuffer
could be implemented in terms of shared memory and glTexImage2D for
platforms that benefit of texture upload instead of direct access to the
data.

I'm looking at the doc and the GpuMemoryBuffer implementation. Both kind
of give the implication that this design is for hardware that can do cpu
access to gpu accessible memory. The design doc says things
like "update textures without having to perform a texture upload". It's
worth noting that even if you have these direct mapping APIs (gralloc,
etc) and they kind of work, it does not really mean that you necessarily
map the memory directly or that it's particularly efficient. Thus, if
the design explicitly *requires* actual direct access, it'd be nice to
know. Eg. the requirements -section could be specific as you can make it:
- unmap needs to be equivalent of a no-op (unmap is planned to be
called from a realtime thread ?)
- no extra copies are designed to be made (memory bw
optimization reason?)
- etc

As it stands currently, the design doc is a bit scary for HW that does
not benefit of direct mapping (or do benefit from storing the textures
in gpu-friendly formats or memory). If support for direct mapping is not
an explicit goal, it'd be great to have a comment in
GpuMemoryBuffer::Map regarding the buffer contents (maybe another
variant of the call if the data is sometimes needed).

Kimmo Kinnunen

unread,
Sep 17, 2014, 7:10:14 AM9/17/14
to Eric Penner, bo...@chromium.org, David Reveman, willy yu, Dana Jansens, Roger Yi, graphics-dev
On 16.09.2014 21:54, Eric Penner wrote:
> On Tue, Sep 16, 2014 at 6:06 AM, Kimmo Kinnunen <kkin...@nvidia.com
> <mailto:kkin...@nvidia.com>> wrote:
> Our current generation mobile stuff wouldn't probably benefit of
> switching to PBOs, due to the hw not having full cache coherency.
>
>
>
> What I think would still warrant it's own code-path is if NVidia can
> provide an extension that allows for persistently mapping a PBO in
> another process.
>
>
> How would that work with sandboxing / command buffer? Would
> references to particular PBOs be sent cross-process as file
> descriptors and then maybe mmapped in the raster process? I guess
> that could work, though I'm no driver expert or spec writer. Sounds
> a bit tricky to specify.. Probably at this point, it's not worth the
> complication, since you optimize away only a memcpy, and there's the
> cache issue..
>
>
> Does non-cache coherency imply you do a full CPU copy out of the PBO
> synchronously on the CPU? If so then indeed nothing is going to beat
> texSubImage.

In this case it means that CPU reads of GPU accessible memory are not
cached by CPU caches, rather fetches go directly to memory hardware.
Thus memcpying to a PBO would be somewhat fast (writes), but Skia
blending directly to a PBO would be slow (reads+writes).

David Reveman

unread,
Sep 17, 2014, 11:12:29 AM9/17/14
to Roger Yi, graphics-dev, bo...@chromium.org, Kimmo Kinnunen, willy yu, Dana Jansens
On Wed, Sep 17, 2014 at 5:05 AM, Roger Yi <roge...@gmail.com> wrote:
Thanks for the sharing, and I have a question in below:

TexImage2D/TexSubImage2D

Standard OpenGL mechanism to initialize or update a texture. This will copy the provided data into the command buffer and perform a matching texture upload on the GPU process side. Essential for WebGL support and sufficient in many use cases. 


Is that mean when not use GpuMemoryBuffer, actually we need to copy twice, the first is put the command into command buffer, and the second is flush the command buffer?

Correct. Not the most efficient but good enough for a lot of use-cases and you don't have to worry about shared memory usage.

David Reveman

unread,
Sep 17, 2014, 11:13:59 AM9/17/14
to Kimmo Kinnunen, bo...@chromium.org, willy yu, Dana Jansens, Roger Yi, graphics-dev
On Wed, Sep 17, 2014 at 7:00 AM, Kimmo Kinnunen <kkin...@nvidia.com> wrote:
On 16.09.2014 22:56, 'David Reveman' via Graphics-dev wrote:
On Mon, Sep 15, 2014 at 12:28 AM, Kimmo Kinnunen
<kkin...@nvidia.com <mailto:kkin...@nvidia.com>> wrote:
David, Would it be possible to elaborate a bit on the Chrome plans
wrt this (and WebView, if it differs)? What does the one copy
rasterizer concretely mean?


I created a document that describes the different mechanism for
updated textures in chromium that I'm hoping will be sufficient:
https://docs.google.com/a/chromium.org/document/d/1J4lpHqVw9CmIiM3BeVCRT-SIzDcy-EWfUGBjP0yR_S8/edit?usp=sharing
 It's pretty high level and it doesn't yet describe how 1-copy can be
used without gralloc/SurfaceTexture support. But as Eric mentioned,
1-copy updates without a native GpuMemoryBuffer implementation is
similar to using our old CHROMIUM_map_sub extension.

If I'm understanding correctly, CHROMIUM_map_sub
MapTexSubImage2DCHROMIUM call access flag "write only" tries to say that
the memory area returned by the function is readable, but the initial
contents is undefined. If this is preserved, I think GpuMemoryBuffer
could be implemented in terms of shared memory and glTexImage2D for platforms that benefit of texture upload instead of direct access to the data.

That's exactly what we're doing. We use a standard shared memory implementation of GpuMemoryBuffers if no driver specific support is available. It's not yet described in the doc but I'll try to add some more details about this asap.
 

I'm looking at the doc and the GpuMemoryBuffer implementation. Both kind
of give the implication that this design is for hardware that can do cpu
access to gpu accessible memory. The design doc says things
like "update textures without having to perform a texture upload". It's
worth noting that even if you have these direct mapping APIs (gralloc,
etc) and they kind of work, it does not really mean that you necessarily
map the memory directly or that it's particularly efficient. Thus, if
the design explicitly *requires* actual direct access, it'd be nice to
know. Eg. the requirements -section could be specific as you can make it:
- unmap needs to be equivalent of a no-op (unmap is planned to be
called from a realtime thread ?)
 - no extra copies are designed to be made (memory bw
optimization reason?)
 - etc

Good questions. I'll add and try to answer them in more detail in the doc asap. Here are some quick answers.

- map and unmap don't have to be fast as we'll likely do this on a worker thread in the renderer and not on the critical path. if they are fast, that's good of course. however, if all the driver did was move the cost of the upload to unmap that would still be a major improvement.

- binding a gpu-memory-buffer to a texture needs to be efficient as this is done on the critical path. if an implementation is doing a texture upload when being bound to a texture, then we're better off using our own shared memory based gpu-memory-buffer implementation.
 

As it stands currently, the design doc is a bit scary for HW that does
not benefit of direct mapping (or do benefit from storing the textures
in gpu-friendly formats or memory). If support for direct mapping is not an explicit goal, it'd be great to have a comment in GpuMemoryBuffer::Map regarding the buffer contents (maybe another variant of the call if the data is sometimes needed).

Good idea. I'll add something about this to the description of the GpuMemoryBuffer interface.

Eric Penner

unread,
Sep 17, 2014, 5:41:39 PM9/17/14
to David Reveman, Kimmo Kinnunen, bo...@chromium.org, willy yu, Dana Jansens, Roger Yi, graphics-dev
My remaining concern is that all of these things exist to solve a problem, and the result of removing async would be that Android has no solutions that actually solve any known problem on Android. The only problem ever caused by uploads that I'm aware of is their impact on critical threads (CC / GPU) when trying to produce frames. All prior solutions to that involved letting frames get produced in the middle of uploading, which always adds some level of complexity. Do you think that's no longer an issue going forward?  I could see it worth a try with Ganesh, bit didn't think we wanted to go that route for software-raster paths.

Roger Yi

unread,
Sep 18, 2014, 3:14:24 AM9/18/14
to graphi...@chromium.org, kkin...@nvidia.com, bo...@chromium.org, jiaw...@gmail.com, dan...@chromium.org, roge...@gmail.com, rev...@google.com
From the docs and discussions above, can I assume that the most efficient and stable way of texture upload for tiles on Android is :

Use GpuMemoryBuffer as staging buffer and the 1-copy mechanism, and

1, Use gralloc as backing when the map/unmap (lock/unlock) OP is fast, use glCopyTexSubImage2D to do the GPU upload;
2, Use ashmem as backing when gralloc is slow or not stable, use normal glTexImage2D to do the CPU upload;

BTW, as far as I know the OP 'binding a gpu-memory-buffer to a texture' only need once because it do not need to unind it when map it again, except Tegra.

Please correct me if I am wrong, thanks.

在 2014年9月17日星期三UTC+8下午11时13分59秒,David Reveman写道:

David Reveman

unread,
Sep 18, 2014, 8:40:53 AM9/18/14
to Eric Penner, Kimmo Kinnunen, bo...@chromium.org, willy yu, Dana Jansens, Roger Yi, graphics-dev
On desktop, simply issuing uploads at the same rate that we're able to raster tiles works better than idle uploads and doesn't seem to affect the critical GPU thread in a more negative way than idle uploads. I expect that to also be true on Android in cases where we currently use the idle upload mechanism. I don't know yet how this mechanism compares to async uploads on a separate thread. It might be worse in some cases. We'll have to give it a try and find out. I'm hoping it's good enough.

David

David Reveman

unread,
Sep 18, 2014, 8:53:09 AM9/18/14
to Roger Yi, graphics-dev, Kimmo Kinnunen, bo...@chromium.org, willy yu, Dana Jansens
On Thu, Sep 18, 2014 at 3:14 AM, Roger Yi <roge...@gmail.com> wrote:
From the docs and discussions above, can I assume that the most efficient and stable way of texture upload for tiles on Android is :

Use GpuMemoryBuffer as staging buffer and the 1-copy mechanism, and

1, Use gralloc as backing when the map/unmap (lock/unlock) OP is fast, use glCopyTexSubImage2D to do the GPU upload;
2, Use ashmem as backing when gralloc is slow or not stable, use normal glTexImage2D to do the CPU upload;

(Ignoring whether using the private gralloc API is a good idea or not) Yes, moving forward I'm hoping that the above will be true. Whether 2 is better than async uploads on a different thread still needs to be evaluated. 

Keep in mind that a few more patches need to land before the above is true. Allocate/map/unmap of GpuMemoryBuffers needs to be moved to the worker threads and copy commands need to be issued at the rate of raster.

David
Reply all
Reply to author
Forward
0 new messages