image/draw performance on blending operations

951 views
Skip to first unread message

Denis Cheremisov

unread,
Apr 25, 2012, 1:56:59 PM4/25/12
to golan...@googlegroups.com
Hello!
I need to write a service which applies particular layers upon the pictures (it's yet another map service, I need to put a balloon and apply watermark on a top of a picture).
Images I should apply layers on are taken at another web service, that can handle hundreds of requests per second with little delays.
Here's the function that does this task:

func PutLayers(data []byte, tile_h, tile_w int) (*bytes.Buffer,error) {
    read := bytes.NewBuffer(data)
    tile, err := png.Decode(read)
    if err != nil {
        return new(bytes.Buffer), LayerError("cannot read input stream")
    }

    // Extract drawable object from the image
    b := tile.Bounds()
    dst := image.NewRGBA(image.Rect(0,0,b.Dx(),b.Dy()))
    draw.Draw(dst, dst.Bounds(), tile, b.Min, draw.Src)

    // Put baloon mark
    b = balloon_img.Bounds()
    draw.Draw(dst, image.Rect(tile_w/2, tile_h/2-b.Dy(), tile_w/2+b.Dx(), tile_h/2), balloon_img, image.ZP, draw.Over)

    // Put watermark
    b = watermark_img.Bounds()
    draw.Draw(dst, image.Rect(tile_w/2-b.Dx()/2, tile_h - b.Dy() - 6, tile_w/2 + 3*b.Dx()/2, tile_h-6), watermark_img, image.ZP, draw.Over)

    result := dst.SubImage(dst.Bounds())
    buffer := new(bytes.Buffer)
    png.Encode(buffer,result)

    return buffer,nil
}

Unfortunately, it turned to be a bottleneck. It's very CPU hog and doesn't seem to be fast enough. Without this function I can handle as many requests as that service can, but I'm only able to do 70-80 RPS at max when turn it on, with 100% CPU usage.
I still hope I have foolishly done something wrong and a couple of tweaks will help me. Or kind of workaround.
Suggestions?

Kyle Lemons

unread,
Apr 25, 2012, 4:56:29 PM4/25/12
to Denis Cheremisov, golan...@googlegroups.com
On Wed, Apr 25, 2012 at 10:56 AM, Denis Cheremisov <denis.ch...@gmail.com> wrote:
Hello!
I need to write a service which applies particular layers upon the pictures (it's yet another map service, I need to put a balloon and apply watermark on a top of a picture).
Images I should apply layers on are taken at another web service, that can handle hundreds of requests per second with little delays.
Here's the function that does this task:

func PutLayers(data []byte, tile_h, tile_w int) (*bytes.Buffer,error) {
    read := bytes.NewBuffer(data)
    tile, err := png.Decode(read)
    if err != nil {
        return new(bytes.Buffer), LayerError("cannot read input stream")

return nil, ...
 
    }

    // Extract drawable object from the image
    b := tile.Bounds()
    dst := image.NewRGBA(image.Rect(0,0,b.Dx(),b.Dy()))
 
    draw.Draw(dst, dst.Bounds(), tile, b.Min, draw.Src)

Why not just draw onto tile?
 
    // Put baloon mark
    b = balloon_img.Bounds()
    draw.Draw(dst, image.Rect(tile_w/2, tile_h/2-b.Dy(), tile_w/2+b.Dx(), tile_h/2), balloon_img, image.ZP, draw.Over)

For clarity (and maybe performance?) you might want to save w/2, h/2, dx and dy

    // Put watermark
    b = watermark_img.Bounds()
    draw.Draw(dst, image.Rect(tile_w/2-b.Dx()/2, tile_h - b.Dy() - 6, tile_w/2 + 3*b.Dx()/2, tile_h-6), watermark_img, image.ZP, draw.Over)

Are these all using the same colorspaces?  If not, you may not be taking advantage of the fast paths. 
 
    result := dst.SubImage(dst.Bounds())

Is this necessary?
 
    buffer := new(bytes.Buffer)

You may want to pool buffers, especially if your images are about the same size, to minimize garbage

Guillermo Estrada

unread,
Apr 25, 2012, 8:31:16 PM4/25/12
to golan...@googlegroups.com
 Kyle covered it up pretty much.
Return nil on object on non nil errors is a common Go idiom. It is either nil the return value struct or nil the error.

Performance wise it would be a lot faster to write into tile. Once you have decoded the buffer from PNG use it on each draw.


b := tile.Bounds()
dst := image.NewRGBA(image.Rect(0,0,b.Dx(),b.Dy()))

You could just do
dst := imageNewRGBA(tile.Bounds())

But then again you are duplicating the image on memory (because you are going to write tile there anyway).
just try to do a type assertion on tile to make it draw.Image type.

tile, ok := tile.(draw.Image)
if !ok {
   // something is wrong as PNG decode will most likely return draw.Image interfaces
}

Once tile is a draw.Image Interface just draw the balloon and the watermark on top of it.
draw.Draw(tile, image.Rect(tile_w/2, tile_h/2-b.Dy(), tile_w/2+b.Dx(), tile_h/2), balloon_img, image.ZP, draw.Over)
This will save a lot of memory allocation (that is pretty hungry performance wise)

And you don't need to send the function width and height of the tile as the tile Rectangle already know is.
size := tile.Rectangle().Size()
size.X (width)
size.Y (height)

If your tiles are the same size (as most tiles on maps are) this could be a constant.
image.Rect(tile_w/2-b.Dx()/2, tile_h - b.Dy() - 6, tile_w/2 + 3*b.Dx()/2, tile_h-6)

And once again, if balloon and watermark are not the same type as tile (either come from a jpeg or gif or another type of png)
you're not making use of the fast paths in draw.Draw function. make sure they are of the same type.


Then again, this next line is not only not necessary, but hits the performance of your function.
result := dst.SubImage(dst.Bounds())

So reviewing, you got a tile image, you copy it into dst image and at the end you copy it again into result image.
You actually don't need the tile copy on memory after exiting the function so just do every draw operation
on top of tile, this will have better performance by not having to do so much memory allocation operations.

Hope it helps! Can we see the demo? :D

Nigel Tao

unread,
Apr 25, 2012, 10:01:50 PM4/25/12
to Kyle Lemons, Denis Cheremisov, golan...@googlegroups.com
On 26 April 2012 06:56, Kyle Lemons <kev...@google.com> wrote:
> Are these all using the same colorspaces?  If not, you may not be taking
> advantage of the fast paths.

Specifically, make sure that your balloon and watermark images are
RGBA, to get the fastest draw.Draw code. Decoding a partially
transparent PNG (if that's what your balloon and watermarks are) will
yield NRGBA images, which aren't as fast. You will have to explicitly
convert them to RGBA.

Denis Cheremisov

unread,
Apr 26, 2012, 6:13:04 AM4/26/12
to golan...@googlegroups.com, Kyle Lemons, Denis Cheremisov
Yes, I've converted them to RGBA.
But the problem is actually in slow PNG decoding and/or encoding. I tested decoding and it took 13 seconds to just decode 250x250 png file (of course, I have metered decode time only) 1000 times. It's really underwhelming . I can handle 150-160 RPS with gevent+PIL on the same machine, using the same algorithm. I'll stick with python.

четверг, 26 апреля 2012 г., 6:01:50 UTC+4 пользователь Nigel Tao написал:

minux

unread,
Apr 26, 2012, 3:33:38 PM4/26/12
to Denis Cheremisov, golan...@googlegroups.com, Kyle Lemons
On Thu, Apr 26, 2012 at 6:13 PM, Denis Cheremisov <denis.ch...@gmail.com> wrote:
Yes, I've converted them to RGBA.
But the problem is actually in slow PNG decoding and/or encoding. I tested decoding and it took 13 seconds to just decode 250x250 png file (of course, I have metered decode time only) 1000 times. It's really underwhelming 
This is unbelievable... Could you please make your png file available so that I can benchmark it myself?
And on what machine do you get this result?

Denis Cheremisov

unread,
Apr 26, 2012, 7:32:30 PM4/26/12
to golan...@googlegroups.com, Denis Cheremisov, Kyle Lemons
See the "benchmark" and sample picture in the attachment.


четверг, 26 апреля 2012 г., 23:33:38 UTC+4 пользователь minux написал:
png_open.go
test.png

andrey mirtchovski

unread,
Apr 26, 2012, 8:13:51 PM4/26/12
to Denis Cheremisov, golan...@googlegroups.com, Kyle Lemons
I don't think it's really unbelievable. This benchmark gives ~30ms per
decode/encode cycle of the test.png image, that's 30 per second:

http://play.golang.org/p/AtWdDqlqXX

Guillermo Estrada

unread,
Apr 26, 2012, 8:37:31 PM4/26/12
to golan...@googlegroups.com, Denis Cheremisov, Kyle Lemons
On Thursday, April 26, 2012 2:33:38 PM UTC-5, minux wrote:
On Thu, Apr 26, 2012 at 6:13 PM, Denis Cheremisov
Yes, I've converted them to RGBA.

But the problem is actually in slow PNG decoding and/or encoding. I tested decoding and it took 13 seconds to just decode 250x250 png file (of course, I have metered decode time only) 1000 times. It's really underwhelming 
This is unbelievable... Could you please make your png file available so that I can benchmark it myself?
And on what machine do you get this result?

I think he meant 13 seconds on 1000 runs thats way better than the aam example of 30ms each cycle. All other things previously mentioned, if the bottleneck is on the png decode, then it's going to be hard to get it better (pimp your own decode function), the 1000 reqs per second can still be achieved by using a load balancer and many instaces of your server though, so it can be worked out in infraestructure.

andrey mirtchovski

unread,
Apr 26, 2012, 9:27:45 PM4/26/12
to Guillermo Estrada, golan...@googlegroups.com, Denis Cheremisov, Kyle Lemons
it's trivially parallelized. on my 4-cpu macbook pro, a set of 1000
does this walltime at 8 workers:

$ for i in 1 4 8 16; do echo -n "$i " ; GOMAXPROCS=$i /usr/bin/time
./decodepara; done
1 30.58 real 30.57 user 0.00 sys
4 11.70 real 38.00 user 0.06 sys
8 9.37 real 59.04 user 0.24 sys
16 9.59 real 61.98 user 0.27 sys

code is here if you want to play around with or improve it (i make no
claims that this is the best way to parallelize):
http://play.golang.org/p/uLknaAcskT

Guillermo Estrada

unread,
Apr 26, 2012, 10:28:28 PM4/26/12
to golan...@googlegroups.com, Guillermo Estrada, Denis Cheremisov, Kyle Lemons

I think paralleling (firefox spell correction :S) the example of decoding a png won't help him in any fashion, as he is building a web service, every request that goes through his function is already spawn in a "go" call by the server, he surely have a lot of "go threads" running already each decoding a single png (if the service works as I think it does).

But Denis, it wouldn't hurt to check GOMAXPROCS is set on the machine you're testing, that can actually help in the req/s performance.

Nigel Tao

unread,
Apr 27, 2012, 2:04:59 AM4/27/12
to Denis Cheremisov, golan...@googlegroups.com
On 27 April 2012 09:32, Denis Cheremisov <denis.ch...@gmail.com> wrote:
> See the "benchmark" and sample picture in the attachment.

Ah, the PNG decoder hasn't been optimized yet. As a simple change,
there were some slow loops that were easily replaced by the built-in
copy function (http://codereview.appspot.com/6127051, which has been
submitted). On your program:
Before: 4.79 seconds.
After: 3.08 seconds.
Delta: -36%.

If you've installed Go from source
(http://golang.org/doc/install/source), sync'ing to tip should pick up
this change.

Profiling (http://blog.golang.org/2011/06/profiling-go-programs.html)
suggests that the bulk of the time is now in compress/flate code. I
don't think that package has been optimized yet either. There may be
further easy wins.
Reply all
Reply to author
Forward
0 new messages