Re: Thoughts on OOP raster model

Adrienne Walker

unread,

Apr 20, 2017, 4:14:44 PM4/20/17

to Victor Miura, Eric Karl, graphics-dev

+graphics-dev

2017-04-20 10:12 GMT-07:00 Victor Miura <vmi...@google.com>:

I've been thinking some more about differences between the Canvas Command Buffer (streaming) model as the one I prototyped, and a closer-to-Salamander (deferred) model.

To summarize what I think the shape of these looks like.

Streaming
- Like with current Ganesh model, we use the R-tree to cull and play back display lists in the impl-side raster tasks.
- Those paint-ops post culling are serialized to the GPU-side via a ring-buffer.
- For paint-ops with images, paths, text-blobs, and other cache-able objects, we push those objects to GPU-side via a resource cache.
- Caching happens in LRU based on the paint order after culling, and display list objects that are culled are never serialized to GPU.

Deferred
- We transfer display lists that intersect a tile and run playback for a tile in the GPU-service.
- We need to do culling in the GPU-side so we need to serialize the R-tree, or build it up on service side.
- We would either
a) have the paint-ops already in a mem-copyable buffer, along with references to other objects saved on the side, or
b) iterate through all paint-ops and serialize them and their objects similar to the Streaming model, or
c) some combination.
- All cache-able objects in the display lists would be pushed to the GPU before drawing a tile.
- All images needed for a tile would be pushed before drawing the tile.
- For bad memory cases, we may need an Image Decode Service so the GPU-side can pull images during playback.

It seems to me like shipping with the Deferred option depends on completing most of the Salamander data model, and possibly DisplayList deltas and more incremental R-tree computaiton.

In my mind I still feel that we should implement the Streaming model first, while building towards the Deferred. Keep playback on the Impl-side, but do the streaming "serialization" and "resource cache" parts in a way that is mostly re-usable for the deferred case. wdyt?

I think we should do the easiest thing and then evaluate, but disagree on which that is.

Re: image decoding. I do not think that there are any differences in image decoding needs between these two approaches. If there are at raster images in either case, the raster task can block. Whether the raster task is sending a "draw display list 3 at this scale and rect" or the raster task is sending fine grained ops, I'm not sure I see any difference between the two approaches with respect to which images are needed, predecoded, locked, or transported.

ericrk@ also mentioned that for image decodes for gpu raster, Skia already holds onto the lock for all images for the entirety of the raster task, and so there's no incremental locking that occurs to have a lower high water mark of memory during the raster task itself. (I don't imagine custom display list transport occurring for software raster prior to Salamander.)

Re: performance. You rightly bring up the difference here in that shipping the entire display list is good for the invalidate the world approach and less good for the blinking cursor case. There are trade-offs here and this hasn't been measured. I think in a long term sense, we want to have incremental updates to the display list, and I think a model where we're sending the entire display list (or sub display lists) fits that future approach a lot better than shipping something short term for the sake of shipping it.

Personally, I don't think that custom display list transport is going to ship any time soon. There's a lot of work that's still left to be designed or is in flight. Discardable gpu memory, transport caching, fonts are entirely unknown, there's no fuzzing, devirtualizing PaintFlags, cleaning up shaders/loopers, sorting out gpu scheduling and backpressure, etc. I think this custom display list transport is something we should do and get done, but I'm concerned about doing it right.

I think the most important thing is to get something (anything) working behind a flag, so that we can start flushing out the unknowns of scheduling and fonts and caching and performance. However, I don't think the set of things remaining to be designed are deeply affected by this decision of how to transport the custom display list. I subjectively think sending the whole thing is less code to write and will get done faster. In any case, we should measure, understand performance, and do that in light of how important it is to ship.

Eric Karl

unread,

Apr 20, 2017, 8:32:32 PM4/20/17

to Adrienne Walker, Victor Miura, graphics-dev

I did some additional testing to make sure this behaved as I expected, and it turns out I was wrong on this point. It looks like Skia does allocate and deletes textures rapidly when we're in at-raster mode - giving pretty true at-raster experience. Sorry for the confusion here.

Victor Miura

unread,

Apr 21, 2017, 4:50:09 AM4/21/17

to Adrienne Walker, Eric Karl, graphics-dev

On Thu, Apr 20, 2017 at 1:14 PM Adrienne Walker <en...@chromium.org> wrote:

I'll defer to ericrk@ to confirm, though that seems - bad? I was under the impression that Skia would flush and wait for a fence when it went over budget.

EDIT: Sounds like ericrk@ confirmed that Skia has the expected behavior of not keeping images locked for the entire raster.

Re: performance. You rightly bring up the difference here in that shipping the entire display list is good for the invalidate the world approach and less good for the blinking cursor case. There are trade-offs here and this hasn't been measured. I think in a long term sense, we want to have incremental updates to the display list, and I think a model where we're sending the entire display list (or sub display lists) fits that future approach a lot better than shipping something short term for the sake of shipping it.

We do want to have incremental updates in the long term, but taking a dependency on that is concerning. Plus the image memory concern above only applies in this intermediate state before MUS-Salamander.

I think some of the future things we want to do will make more sense / be easier to do together with the full Impl-side move to GPU process.

Personally, I don't think that custom display list transport is going to ship any time soon. There's a lot of work that's still left to be designed or is in flight. Discardable gpu memory, transport caching, fonts are entirely unknown, there's no fuzzing, devirtualizing PaintFlags, cleaning up shaders/loopers, sorting out gpu scheduling and backpressure, etc. I think this custom display list transport is something we should do and get done, but I'm concerned about doing it right.

I think the most important thing is to get something (anything) working behind a flag, so that we can start flushing out the unknowns of scheduling and fonts and caching and performance. However, I don't think the set of things remaining to be designed are deeply affected by this decision of how to transport the custom display list. I subjectively think sending the whole thing is less code to write and will get done faster. In any case, we should measure, understand performance, and do that in light of how important it is to ship.

I think I'm more bullish on the fact that the streaming model is a good step in it's own right which we should aim to ship this year, and I'm advocating that we be focused on that and not block on things like incremental display list updates.

Serializing the whole display list to a buffer may be quicker, I don't know. I think updating the SkCanvas Command Buffer to PaintCanvas wouldn't take a lot of work.

Regarding shipping, importance of shipping, it's all a balance of things. I think there are good reasons to implement and ship this:

1) I think we agree that splitting a milestone from the MUS-Salamander uber-project that delivers earlier impact and validates many things is good.

2) We've shown success doing what this OOP-raster aims to do: Reducing CPU time, running Skia Vulkan back-end, enabling Skia to use more GL features e.g. Geometry Shaders. I do want to enable this sooner rather than later.

3) The model has good behavior: reduction in transfer data, reduction in latency, and similar memory behavior to Ganesh today. That's quite a good combo that makes me optimistic we can succeed.

4) As you also mentioned, many of the things remaining to be designed aren't deeply affected by this decision.

This may be large enough that we should do a Chrome design review.

dan...@chromium.org

unread,

Apr 21, 2017, 1:14:30 PM4/21/17

to Victor Miura, Adrienne Walker, Eric Karl, graphics-dev

Are these things tied exclusively to either model? We could build a subset of the display list via rtree (for the set of raster tasks we're scheduling instead of culling for each individual task?) and ship that over IPC instead of sending the entire display list.

Istm that the argument for using a command buffer would be for pipelining if there's significant overhead on the client side to package up what it wants to IPC and we want the raster task to start before the client is done? Is this true?

Vladimir Levin

unread,

Apr 21, 2017, 2:54:40 PM4/21/17

to Dana Jansens, Victor Miura, Adrienne Walker, Eric Karl, graphics-dev

I think both approaches would hit similar challenges when we start working out the details. The main difference I see is exactly in the transfer mechanism:

1. For the full display list transfer, we need to package that up and ship it to gpu, and then use the raster source id as a reference in the raster tasks so that the gpu knows what we're talking about when we say "use this rect and this raster id".

2. For the raster task serialization (iiuc), the gpu doesn't know or care about the raster source id and is instead getting a stream of display list commands for each raster task, so there is no mapping that needs to happen and the gpu doesn't need to store any extra state.

3. For your proposed transfer the display list chunk for things that we've scheduled, it would have similar issues to 1, in the sense that we need to package a thing and ship that over. The gpu would either have to have a merge algorithm to build up a one raster source, in which case we can still reference to it using a raster source id. Or, this process would need to generate a different id identifying just the subset that we're sending and raster tasks can reference that (there might be some overhead of sending overlapping chunks).

I'm weakly on the side of either 1 or 3, in that I believe that's going to be the way we're doing this in the future anyway (ie, we _are_ going to send stuff ahead of time that gpu will need to store, right?). 2 does have the nice property that we don't need to worry about that and just deal with other issues like image/font caches, which will come up for any of the approaches anyway.

Victor Miura

unread,

Apr 21, 2017, 2:56:28 PM4/21/17

to dan...@chromium.org, Adrienne Walker, Eric Karl, graphics-dev

I think the peak memory behavior is exclusive to a "streaming" model. The other aspects depend on how much we can optimize and incrementalize the display list updates.

We could build a subset of the display list via rtree (for the set of raster tasks we're scheduling instead of culling for each individual task?) and ship that over IPC instead of sending the entire display list.

Istm that the argument for using a command buffer would be for pipelining if there's significant overhead on the client side to package up what it wants to IPC and we want the raster task to start before the client is done? Is this true?

That's certainly one aspect. I imagine sub-setting and transferring display lists for all tiles before we schedule any tiles would add more latency & overhead. Would there be an advantage to doing that compared to doing that sub-setting in the raster task?

In the Salamander world the split is at LTHI so I think we won't do an approach like analyze tiles and subset the display list on the Renderer side. I think for that future we'll need the the display list deltas approach.

A nice thing with of the command buffer approach is the Compositor code looks identical to Ganesh today. We just swapped what the display list plays into from Ganesh SkCanvas to a SkCommandBufferCanvas with fairly few commands (~36 I think?) built on existing command buffer infrastructure and can be fuzzed by our GPU fuzzer. It took a weekend essentially.

dan...@chromium.org

unread,

Apr 21, 2017, 3:05:34 PM4/21/17

to Victor Miura, Adrienne Walker, Eric Karl, graphics-dev

TBC I'm assuming that we're painting into shared memory and the transfer costs are therefore very little. There's no world where I want to serialize each op into another buffer. A possible advantage to doing this for the set of tiles is that if we were to do something like send over ranges of commands to play back for raster (the culling subsets as ranges instead of sending copies of the subsets basically), many ops may overlap multiple tiles, so it might be less work to build up a single set of ranges for all tiles, at the expense of rastering more ops that would be culled per tile. This is a place we have a lever we can move around basically is something I wanted to point out.

In the Salamander world the split is at LTHI so I think we won't do an approach like analyze tiles and subset the display list on the Renderer side. I think for that future we'll need the the display list deltas approach.

A nice thing with of the command buffer approach is the Compositor code looks identical to Ganesh today. We just swapped what the display list plays into from Ganesh SkCanvas to a SkCommandBufferCanvas with fairly few commands (~36 I think?) built on existing command buffer infrastructure and can be fuzzed by our GPU fuzzer. It took a weekend essentially.

I'm not totally sure here, but the skia/ganesh code tends to depend on the raster and paint APIs being the same thing, which is something I feel strongly we want to not reproduce here. We should make paint structures -> raster structures a one-way trip. IOW no Raster(PaintCanvas*), only Raster(SkCanvas*). I care more about this principle than the approach taken for implementing transport. But I will note that command buffer requires copying our paint ops by design but sending shm-painted buffers (with culling ranges if we want) over does not.

Victor Miura

unread,

Apr 21, 2017, 3:12:47 PM4/21/17

to dan...@chromium.org, Adrienne Walker, Eric Karl, graphics-dev

Could you dig more into why you think this is an important principle?

In the CDL prototype I maintained layering such that the PaintRecord only needed to know how to store and retrieve PaintOps. You record PaintOps and get PaintOps back. The conversion from PaintOps to SkCanvas or something else is in a separate layer. I totally don't get why we made this asymmetrical.

dan...@chromium.org

unread,

Apr 21, 2017, 3:17:25 PM4/21/17

to Victor Miura, Adrienne Walker, Eric Karl, graphics-dev

Because going back and forth makes for complex code and incorrect APIs.

Look at the methods we've been removing from PaintCanvas such as writePixels. Those don't make sense for recording, but they do for raster. The result of the paint->raster->paint has led us to do things like

a) Build a display list

b) Raster it into an SkPicture

c) Add that SkPicture into another display list.

When really all you're looking to do is change some metadata about how the ops in the original list work. But a recording API that lets you realize youre building a display list and make rational operations on it doesn't make sense for a raster API, so it pretends like that's all it is right now. If we just want an SkCanvas API we can allow going paint->raster->paint but I want something better so we get better performant, and more understandable code.

Victor Miura

unread,

Apr 23, 2017, 3:58:42 PM4/23/17

to dan...@chromium.org, Adrienne Walker, Eric Karl, graphics-dev

Right, agree with that. The SkCanvas interface is kind of overloaded between a recording canvas, and one that rasters pixels. But the PaintCanvas can be more clearly "recordable APIs only".

The result of the paint->raster->paint has led us to do things like
a) Build a display list
b) Raster it into an SkPicture
c) Add that SkPicture into another display list.

When really all you're looking to do is change some metadata about how the ops in the original list work.

Partly I guess that's because display lists and SkPictures were different things, and there wasn't a way to reference display list (a) directly in display list (c) other than via (b).

With CDL I think we want "playback" to an SkCanvas to be the end of our pipeline and there will be no way back from SkCanvas to PaintRecords.

Currently playing back the display list into a PaintCanvas provides an abstraction we depend on, not only what I mentioned above (raster via Skia in-process or via serializing PaintRecords through a command buffer can be abstracted). We use this playback model today for several passes (solid color analysis, extracting image metadata, skipping images for low-rez tiles).

We're still doing those passes with an SkCanvas today but will want to do it with a PaintCanvas once we start customizing things. Playback to PaintCanvas provides a consistent way to iterate ops from multiple sources, and maintain the clip/layer stack and CTM, which the passes need. In the fullness of time I think we want to change those passes, but I think we'll be thoughtful about what that looks like, and where that sits on the order of things to do.

dan...@chromium.org

unread,

Apr 25, 2017, 1:17:57 PM4/25/17

to Victor Miura, Adrienne Walker, Eric Karl, graphics-dev

Right and these all seem super suboptimal to have to pretend to "raster" and then intercept draw calls in order to walk over the elements in the list. Once the paint ops are visible we can simply walk thru them as needed, or just skip commands (like low-res tiles).

We're still doing those passes with an SkCanvas today but will want to do it with a PaintCanvas once we start customizing things.

I don't think this part holds, as this requires is to raster thru a PaintCanvas again. They will stay SkCanvas as they are part of raster, until we move the logic earlier and do it before rastering. At that point we won't be intercepting raster and won't be using any canvas-like APIs to intercept/override.

Playback to PaintCanvas provides a consistent way to iterate ops from multiple sources, and maintain the clip/layer stack and CTM, which the passes need.

We'd have to reproduce the SkCanvas logic of maintaining a clip/transform to do this with PaintCanvas, and I think we should just do this outside of canvas in the code that walks thru and initiates raster of each operation. I don't see what PaintCanvas is providing here when it's no longer SkCanvas, and it would force the APIs to remain the same(ish)?

Reply all

Reply to author

Forward