best approach for multiple threads handing image requests

439 views
Skip to first unread message

Michael Katz

unread,
Jun 23, 2023, 8:26:27 AM6/23/23
to skia-discuss
I have a server application that responds to http requests for images. The images are typically 256x256 map tiles that the server draws dynamically, but also some other dynamic images of various sizes. The server uses a thread for each request.

I have just become aware that multiple threads can't access the same "drawing context". I say "drawing context" because it's not clear to me where in the hierarchy things break. Specifically:

* I am using SDL to get a GL context. I call SDL_Init() once when the program starts.

* I call SDL_CreateWindow() once when the program starts. I am currently making a 4096x4096 window, because that's the maximum size my program could need.

* I call SDL_GL_CreateContext() once when the program starts.

* I call SDL_GL_MakeCurrent() once with that window and that context, when the program starts.

* I call glViewport( 0, 0, 4096, 4096 ) once when the program starts.

* I call GrDirectContext::MakeGL( NULL ) once when the program starts.

* I then call SkSurface::MakeRenderTarget() with that direct context, again just once.

* I call _mySurfaceGL->getCanvas() once to get a canvas for that surface.

Then I to do various clearing/drawing on that canvas, and then I call canvas->flush() and canvas->readPixels() to copy out some pixels (usually just a small region on the top-left of that large canvas) into an SkBitmap, and from there I can get PNG data.

That all works great when done from the main thread, but when it's a worker thread bad things happen, like canvas->flush() never returning, and getting no data when trying to get PNG encoded data.

So my question is: When I handle an http request with a worker thread, how much of that stack do I have to recreate for the thread? Is it way up at SDL_CreateWindow() or SDL_GL_CreateContext(), or perhaps at GrDirectContext::MakeGL(), or SkSurface::MakeRenderTarget()?

And if I need to do relatively expensive create functions for each thread, should I have a pool of premade windows/contexts/surfaces to draw from?

What is the standard pattern for multi-threaded apps?

K. Moon

unread,
Jun 23, 2023, 12:55:08 PM6/23/23
to skia-d...@googlegroups.com
I don't think the OpenGL API supports multithreading. Is there a reason you can't just use a CPU surface? Hardware acceleration typically is meant for clients, not servers.

--
You received this message because you are subscribed to the Google Groups "skia-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to skia-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/skia-discuss/c3942999-a418-4fed-a428-44f59f61a1ccn%40googlegroups.com.

Michael Katz

unread,
Jun 23, 2023, 2:24:54 PM6/23/23
to skia-discuss
A CPU surface is way too slow for my needs. My app is an Electron desktop application that runs both the client process and server process on the user's machine.

I understand that OpenGL doesn't support multi-threading at some level. My question was asking if there was some level of the stack (see the list above) where I can duplicate the infrastructure for each thread, in way that works with OpenGL. I understand it's possible that the answer is "No, at no level of duplication of that stack can you get OpenGL to work in threads. Even if you had each thread call its own SDL_CreateWindow(), its own SDL_GL_CreateContext(), its own GrDirectContext::MakeGL(), its own SkSurface::MakeRenderTarget() to get its own surface -- even then, multi-threading is not supported."

It's okay if that's the answer. But I was hoping the answer was that things could work if each thread did all of that, or some subset of that. There must be other programs (Chrome?) that use threads while doing hardware rendering?

K. Moon

unread,
Jun 24, 2023, 9:13:25 AM6/24/23
to skia-d...@googlegroups.com
This varies depending on the platform/OpenGL implementation, but from what limited research I've done, you definitely can't use a single OpenGL context from multiple threads simultaneously, and it's not useful to use multiple contexts:

The reason APIs like Vulkan and DirectX 12 were invented is precisely because they handle multithreading better than the respective legacy APIs. OpenGL, in particular, heavily relies on a state machine that doesn't parallelize well.

Chrome doesn't necessarily use OpenGL for rendering (Skia supports other backends), but in any case, the architecture routes all rendering to a single GPU process. Multithreaded rendering isn't really useful in most cases, as Chrome mostly needs fast compositing (which can be done fast enough on a single thread). Renderer processes batch up drawing commands and send them to the GPU process to be executed. Sometimes it's still more efficient to do most of the work on the CPU, and just move textures around.

The Chrome graphics stack continues to change, so I'd do my own research (I believe there continues to be work to improve Canvas performance, for example), but this is how most rendering still happens today, to my knowledge.



K. Moon

unread,
Jun 24, 2023, 9:24:22 AM6/24/23
to skia-d...@googlegroups.com
A couple of other things to keep in mind:

1. Most high performance graphics architectures (such as games) have used a single thread for a long time. Only the recent adoption of Vulkan, DirectX 12, and Metal has changed that.

2. Reading from GPU memory is an expensive operation (as it requires synchronizing state between the GPU and the CPU), and generally avoided in high performance applications.

YMMV, so it's best to do your own benchmarks with your specific application.
Message has been deleted

Brian Osman

unread,
Jun 24, 2023, 10:05:42 AM6/24/23
to skia-d...@googlegroups.com
However: If you don't need to share any resources (eg, each thread is basically doing independent rendering to service unrelated requests), you can definitely use GL from multiple threads, but each one needs a dedicated GL context.

On Sat, Jun 24, 2023 at 10:03 AM craste...@gmail.com <craste...@gmail.com> wrote:
What if the tile-making logic was in separate processes on the server? Instead of separate threads? 

craste...@gmail.com

unread,
Jun 24, 2023, 3:52:20 PM6/24/23
to skia-discuss
Curious:  Why was my post deleted?

Michael Katz

unread,
Jun 25, 2023, 10:31:34 AM6/25/23
to skia-discuss
Thanks to all of you for the additional information (on a Saturday morning!).

Yes, the tiles are drawn by each http server thread. They access the same geographic data, but it's all read-only so they work in parallel. I want them to work in parallel so they can access disk data and do other preparatory steps in parallel.

So I can see two ways to go:

- Based on Brian Osman's comment, I could create a GL context (which I think also means a hidden SDL window backing each context) for each thread, the thread does its drawing, and then I do a single read-back using  _surface->makeImageSnapshot() at the end of each thread. There could be like 100 threads working at once, so I don't know if that many GL contexts can exist happily.

- Based on K. Moon's comment, I could have each thread "batch up" drawing commands, and at the end of the thread's work it would share those drawing commands with the main thread and do a blocking call. The main thread would pick up these jobs and, one at a time, do the rendering on its one surface, do a read-back with _surface->makeImageSnapshot(), and hand off the resulting image data to the thread. However, I don't know how to "batch up" the drawing commands. For instance, when I call Skia drawLine(), my understanding is that it draws the line on the given canvas/surface. But I don't know how to make it "give out" the gl commands that I would then batch up. My understanding from games is that, for each frame, they batch up drawing commands by actually calling gl functions, but nothing actually happens until they call a submit() function. But I don't know how to translate that in my case.

(@crasterimage, yes, if the a GL context per thread doesn't work for some reason, I could consider each thread handing off work to a new process. But that's a bit drastic so I'll try the others for now.)


On Saturday, June 24, 2023 at 7:05:42 AM UTC-7 brian...@google.com wrote:

K. Moon

unread,
Jun 25, 2023, 11:18:54 AM6/25/23
to skia-d...@googlegroups.com
I'm also unsure how much multiple processes would help over multiple threads, as the limit on multiple context efficiency is on the driver side.

An easy way to batch up rendering work is to record an SkPicture, then play it back using drawPicture(). It depends on which part of your pipeline is compute-intensive, though.

I'm wondering at this point to what extent the rendering is actually expensive, and how much is in the stages before the rendering happens. That's something specific to your application, though. If you haven't already, I'd make sure to take detailed profiles to make sure the parts you think are slow really are the slow parts.

K. Moon

unread,
Jun 25, 2023, 11:28:25 AM6/25/23
to skia-d...@googlegroups.com
As an aside, large numbers of parallel threads work well for I/O, but the compute-bound parts of your application are going to be limited by the number of available cores/threads. Unless we're talking workstation-grade hardware, that's likely to be a rather small number like 4-16. You may want to consider passing off the work to a smaller thread pool once I/O is complete.

Michael Katz

unread,
Jun 26, 2023, 8:44:31 AM6/26/23
to skia-discuss
Thanks for the additional thoughts.

Here are some actual measurements. In all of these cases I am just generating random coordinates, so there is no disk access or processing other than pure drawing.

My benchmark is the CImg library, which is what previous versions of my map program use. It is amazingly fast at CPU rendering. However, it lacks functionality such as drawing lines with width other than 1, and drawing polygons with a fill pattern. I built upon it to add those features, but it got a bit complicated. Hence looking into Skia.

All of these tests used a 256x256 bitmap/surface/canvas.

The first test was drawing 100 random line segments on a white canvas, where the line segments were fully on or intersected the canvas. I did that 10,000 times, so it was 1 million line segments drawn in all. The lines were all width 1, black, and drawn without antialiasing.

// width 1, no antialias, black lines
skia raster -- 5.48 sec
skia GL -- 1.34 sec
cimg -- 3.32 sec

I tried making the lines random colored, just to see if it made a difference:

// width 1, no antialias, multicolor lines
skia raster -- 5.72 sec
skia GL -- 1.50 sec
cimg -- 3.46 sec

Then I tried turning on antialising:

// width 1, antialias, black lines
skia raster -- 28.66 sec
skia GL -- 1.72 sec
cimg -- N/A (but time to beat is about 3.40 sec)

Then I tried using paths, where each time I drew the 100 line segments, I added them all to a Path, and then drew the path.

// using path (each set of 100 lines is a path), width 1, antialias, black lines
skia raster -- 4.32 sec
skia GL -- 8.91 sec
cimg -- N/A (but time to beat is about 3.40 sec)

Then I turned off antialiasing:

// using path, width 1, no antialias, black lines
skia raster -- 1.95 sec
skia GL -- 0.28 sec
cimg -- N/A (but time to beat is about 3.40 sec)

Then I tried wide lines:

// using path, width 5, no antialias, black lines
skia raster -- 2.08 sec
skia GL -- 0.39 sec
cimg -- N/A (but time to beat is about 3.40 sec)

At this point I was thinking, you know, now that I know that paths are so efficient, maybe I'll stick with CPU rendering as K. Moon originally suggested. It's 5x slower than GPU, but it beats CImg's 3.40 sec, and CImg can't even do thick lines (my hack to draw thick lines as rectangles is *much* slower than any of these times).

But then I tried images. The map program draws images to represent points of interest. These are usually about 30x30 pixels and include transparency (often the images are circular icons). In this case I drew 100 images in random locations on the canvas. Again I did this 100 times, for a total of 10,000 images drawn. The raster time was disappointing.

// symbol at 50% transparent (draw 100 symbols, 100 times)
end skia raster -- 2.54 sec
skia GL -- 0.25 sec
cimg -- 0.27 sec

Perhaps there is a faster way to draw the images? My code looks like:

void DrawSkiaImagesRaster()
{
NImg phone; // NImg manages a skia raster surface and its canvas
phone.LoadFromFile( "C:\\Users\\michael.katz\\Pictures\\phone.png" ); // about 30x30
sk_sp<SkImage> phoneImage = phone._surface->makeImageSnapshot();
SkCanvas *canvas = _skSurfaceRaster->getCanvas();
// draw into canvas
for ( int j = 0 ; j < 100 ; j++ )
{
canvas->clear( SK_ColorWHITE );
for ( int i = 0 ; i < 100 ; i++ )
{
SkPoint p0 = SkPoint::Make( (SkScalar)GetRandomNumberInRange( -20, 255 + 20 ),
(SkScalar)GetRandomNumberInRange( -20, 255 + 20 ) );
SkPaint paint;
paint.setAlpha( 128 );
SkSamplingOptions sampling;
canvas->drawImage( phoneImage, p0.x(), p0.y(), sampling, &paint );
}
}
}

Assuming that's the fastest I can get raster images to draw, I think my plan will be to do everything with CPU rasters using paths for polylines, polygons, etc., except when it comes to drawing sets of images, I'll hand off the job to the main thread to do in the GPU.

(By the way, a person could reasonably ask why you'd ever want to draw 10,000 30x30 images on a single 256x256 map tile. But for better or worse that's what the program does when you zoom out from an area with lots of points of interest.)

Brian Osman

unread,
Jun 26, 2023, 8:49:34 AM6/26/23
to skia-d...@googlegroups.com
Yes, those results all look plausible. One question: What SkColorType are you using for your surface? The CPU backend does have faster code if the surface is "N32" (which is always either RGBA_8888 or BGRA_8888). If you're using the other channel order, everything still works, but certain simple operations (like image draws) might be slower than necessary. There is a kN32_SkColorType enum value that will be equal to either kRGBA_8888 or kBGRA_8888, so you can try using that.

Regardless - the GPU backend is going to be substantially faster, particularly for some of the workloads you're describing. However, I'm curious how well it's going to scale, if you're doing lots of rendering in parallel -- there's still only (presumably) one GPU, so having a dozen GL contexts still means they're fighting for the actual hardware.

craste...@gmail.com

unread,
Jun 26, 2023, 8:58:25 AM6/26/23
to skia-discuss
Also, I am curious why these SkImage-s are being created dynamicly, rather than having been loaded up-front and left alone?  Is there some aspect of these tiled images that is changing on every draw? Could that difference be performed in some kind of alternate way?  Also, have you looked at using an atlas?

K. Moon

unread,
Jun 26, 2023, 2:06:40 PM6/26/23
to skia-d...@googlegroups.com
There was another recent thread where drawAtlas() provided a substantial speedup, if that works well for your use case.

Reference for SkCanvas::drawAtlas(): https://api.skia.org/classSkCanvas.html#ace6d74f7e43162984c184cac4edc4363

If you can write your example as a Skia Fiddle (https://fiddle.skia.org/), that can be helpful for anyone else trying to replicate your results. (Benchmarking isn't going to be a good fit for a shared service like Skia Fiddle, of course, but it represents a consistent execution environment that can be replicated locally.)

Michael Katz

unread,
Jun 28, 2023, 8:18:12 AM6/28/23
to skia-discuss
Thanks for all the feedback.

@brian, yes, I'm using SkImageInfo::MakeN32( 256, 256, kUnpremul_SkAlphaType ) for all raster surfaces.

@craster, yes the actual code uses a cache of created skImage, so they are not created dynamically. The code above just creates an image (once) to have something to draw.

@K.Mood, drawAtlas() didn't help me, and in fact made the times a bit worse. My guess is that where it helps is when you are drawing a large number of different sprites, and it's better to load them (especially when loading to the GPU) as a single graphic/texture instead of each one being its own. That's not the case for me. The usual case is drawing the same image thousands of times.

To add to the data, I created an irregular, 20-sided (not self-crossing) polygon and drew it 100,000 times at various scales and in various positions (the scaling and positioning was done by me, not by canvas properties, so that part is identical for all three runs).

// 1000 iterations * 100 polygons, alpha 1, no antialias
skia raster -- 2.94 sec
skia GL -- 0.92 sec
cimg -- 5.18 sec

// 1000 iterations * 100 polygons, alpha random, no antialias
skia raster -- 101.26 sec
skia GL -- 0.77 sec
cimg -- 4.59 sec

It's interesting that the GPU actually did a bit better with random alphas. The fact that cimg beat skia raster by some much with random alpha is not a fair comparison because it was doing a replace instead of a blend (so it's basically the same as alpha 1).

Michael Katz

unread,
Jul 12, 2023, 1:25:40 AM7/12/23
to skia-discuss
I have my code working well, using a separate SDL window and gl context per thread. The code is here: https://stackoverflow.com/questions/76587353/poco-http-server-with-custom-thread-pool-to-have-custom-expensive-thread-data/76667345
Reply all
Reply to author
Forward
0 new messages