html/template: slow performance on big structures

677 views
Skip to first unread message

vadi...@gmail.com

unread,
Jul 12, 2015, 12:05:36 AM7/12/15
to golan...@googlegroups.com

go version is 1.5 beta 1, the same was on "stock" 1.3.3 from Debian repos
OS: Debian 8.1 x64

I'm stress testing (with loader.io) this type of code in Go to create an array of 100 items along with some other basic variables and parse them all in the template:

package main

import (
    "html/template"
    "net/http"
)

var templates map[string]*template.Template

// Load templates on program initialisation
func init() {
    if templates == nil {
        templates = make(map[string]*template.Template)
    }

    templates["index.html"] = template.Must(template.ParseFiles("index.html"))
}

func handler(w http.ResponseWriter, r *http.Request) {
    type Post struct {
        Id int
        Title, Content string
    }

    var Posts [100]Post

    // Fill posts
    for i := 0; i < 100; i++ {
        Posts[i] = Post{i, "Sample Title", "Lorem Ipsum Dolor Sit Amet"}
    }

    type Page struct {
        Title, Subtitle string
        Posts [100]Post
    }

    var p Page

    p.Title = "Index Page of My Super Blog"
    p.Subtitle = "A blog about everything"
    p.Posts = Posts

    tmpl := templates["index.html"]

    tmpl.ExecuteTemplate(w, "index.html", p)
}

func main() {
    http.HandleFunc("/", handler)
    http.ListenAndServe(":8888", nil)
}
My test with Loader is using 5k concurrent connections/s through 1 minute. The problem is, just after a few seconds after starting the test, I get a high average latency (almost 10s) and as a result 5k successful responses and the test stops because it reaches the Error Rate of 50% (timeouts).

On the same machine, PHP gives 50k+.

I understand that it's not Go performance issue, but probably something related to html/template. Go can easily manage hard enough calculations a lot faster than anything like PHP of course, but when it comes to parsing the data to the template, why is it so awful?

Any workarounds, or probably I'm just doing it wrong (I'm new to Go)?

P.S. Actually even with 1 item it's exactly the same... 5-6k and stopping after huge amount of timeouts. But that's probably because the array with posts is staying of the same length.

Tamás Gulácsi

unread,
Jul 12, 2015, 1:05:03 AM7/12/15
to golan...@googlegroups.com
Can you proof your code?

First, to be sure it's template the culprit, move all your data creation into init, and use the same p for each call.
So onlz have ExecuteTemplate in the handler.

vadi...@gmail.com

unread,
Jul 12, 2015, 10:29:50 AM7/12/15
to golan...@googlegroups.com
Do you mean profiling results or results of stress testing via loader.io? I can share anything.
As for profiling results, these are using github.com/pkg/profile, here's top10 of cpu usage profiling:

root@Test:~# go tool pprof app /tmp/profile311243501/cpu.pprof
Possible precedence issue with control flow operator at /usr/lib/go/pkg/tool/linux_amd64/pprof line 3008.
Welcome to pprof!  For help, type 'help'.
(pprof) top10
Total: 2054 samples
      97   4.7%   4.7%      726  35.3% reflect.Value.call
      89   4.3%   9.1%      278  13.5% runtime.mallocgc
      85   4.1%  13.2%       86   4.2% syscall.Syscall
      66   3.2%  16.4%       75   3.7% runtime.MSpan_Sweep
      58   2.8%  19.2%     1842  89.7% text/template.(*state).walk
      54   2.6%  21.9%      928  45.2% text/template.(*state).evalCall
      51   2.5%  24.3%       53   2.6% settype
      47   2.3%  26.6%       47   2.3% runtime.stringiter2
      44   2.1%  28.8%      149   7.3% runtime.makeslice
      40   1.9%  30.7%      223  10.9% text/template.(*state).evalField

Then after some time refining the code (as suggested in an answer here) I've run the test again, but the results were the same:

root@Test:~# go tool pprof app /tmp/profile501566907/cpu.pprof
Possible precedence issue with control flow operator at /usr/lib/go/pkg/tool/linux_amd64/pprof line 3008.
Welcome to pprof!  For help, type 'help'.
(pprof) top10
Total: 2811 samples
     137   4.9%   4.9%      442  15.7% runtime.mallocgc
     126   4.5%   9.4%      999  35.5% reflect.Value.call
     113   4.0%  13.4%      115   4.1% syscall.Syscall
     110   3.9%  17.3%      122   4.3% runtime.MSpan_Sweep
     102   3.6%  20.9%     2561  91.1% text/template.(*state).walk
      74   2.6%  23.6%      337  12.0% text/template.(*state).evalField
      68   2.4%  26.0%       72   2.6% settype
      66   2.3%  28.3%     1279  45.5% text/template.(*state).evalCall
      65   2.3%  30.6%      226   8.0% runtime.makeslice
      57   2.0%  32.7%       57   2.0% runtime.stringiter2

Sorry, I'm not quite sure how to move the data properly to init(). If I move it there, go compiler gives such an error: undefined: p

воскресенье, 12 июля 2015 г., 8:05:03 UTC+3 пользователь Tamás Gulácsi написал:

Tamás Gulácsi

unread,
Jul 12, 2015, 12:16:21 PM7/12/15
to golan...@googlegroups.com, vadi...@gmail.com
TL;DR: this is a very artificial, very specific test, please don't decide on its result!
Go does what it's written down, only - no magic caching, JITing &so on.

If you wan't some comparison, than do it on a real application! For this nonsense payload, I can cache the whole generated template, and write it unconditionally, which means 5500 req/s instead of 156 req/s - see below!



With GOMAXPROCS=1, 1000 concurrent requests (same machine):

gthomas@waterhouse:/tmp$ boom -n 10000 -c 1000 http://localhost:8888
10000 / 10000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00 %

Summary:
  Total:        15.3071 secs.
  Slowest:      15.2237 secs.
  Fastest:      0.0022 secs.
  Average:      1.1133 secs.
  Requests/sec: 653.2921

Status code distribution:
  [200] 10000 responses

Response time histogram:
  0.002 [1]     |
  1.524 [8480]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  3.046 [495]   |∎∎
  4.569 [749]   |∎∎∎
  6.091 [4]     |
  7.613 [123]   |
  9.135 [13]    |
  10.657 [4]    |
  12.179 [0]    |
  13.702 [0]    |
  15.224 [131]  |

Latency distribution:
  10% in 0.1168 secs.
  25% in 0.2253 secs.
  50% in 0.4019 secs.
  75% in 1.2766 secs.
  90% in 3.0859 secs.
  95% in 3.5142 secs.
  99% in 15.1037 secs.


With GOMAXPROCS=1, 5000 concurrent requests:
 
gthomas@waterhouse:/tmp$ boom -n 10000 -c 5000 http://localhost:8888
10000 / 10000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00 %

Summary:
  Total:        63.6932 secs.
  Slowest:      63.2481 secs.
  Fastest:      0.0021 secs.
  Average:      7.9291 secs.
  Requests/sec: 157.0026

Status code distribution:
  [200] 10000 responses

Response time histogram:
  0.002 [1]     |
  6.327 [5988]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  12.651 [1461] |∎∎∎∎∎∎∎∎∎
  18.976 [1355] |∎∎∎∎∎∎∎∎∎
  25.300 [0]    |
  31.625 [567]  |∎∎∎
  37.950 [592]  |∎∎∎
  44.274 [0]    |
  50.599 [0]    |
  56.923 [0]    |
  63.248 [36]   |

Latency distribution:
  10% in 0.3334 secs.
  25% in 0.9190 secs.
  50% in 1.7652 secs.
  75% in 15.1519 secs.
  90% in 31.2858 secs.
  95% in 31.8165 secs.
  99% in 32.1586 secs.


With GOMAXPROCS=8, 5000 concurrent requests (phisical CPUs: 4)

gthomas@waterhouse:/tmp$ boom -n 10000 -c 5000 http://localhost:8888
10000 / 10000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00 %

Summary:
  Total:        64.2826 secs.
  Slowest:      63.5283 secs.
  Fastest:      0.0020 secs.
  Average:      8.7877 secs.
  Requests/sec: 155.5630

Status code distribution:
  [200] 10000 responses

Response time histogram:
  0.002 [1]     |
  6.355 [6796]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  12.707 [855]  |∎∎∎∎∎
  19.060 [956]  |∎∎∎∎∎
  25.413 [0]    |
  31.765 [771]  |∎∎∎∎
  38.118 [352]  |∎∎
  44.470 [0]    |
  50.823 [0]    |
  57.176 [0]    |
  63.528 [269]  |∎

Latency distribution:
  10% in 0.5582 secs.
  25% in 1.2234 secs.
  50% in 3.3164 secs.
  75% in 7.7569 secs.
  90% in 31.3622 secs.
  95% in 32.0004 secs.
  99% in 63.3140 secs.


What I've spoken about (simplest handler): http://play.golang.org/p/p-Yab4yN1U


gthomas@waterhouse:/tmp$ boom -n 10000 -c 5000 http://localhost:8888
10000 / 10000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00 %

Summary:
  Total:        63.9838 secs.
  Slowest:      63.2345 secs.
  Fastest:      0.0025 secs.
  Average:      6.7039 secs.
  Requests/sec: 156.2896

Status code distribution:
  [200] 10000 responses

Response time histogram:
  0.003 [1]     |
  6.326 [6410]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  12.649 [1766] |∎∎∎∎∎∎∎∎∎∎∎
  18.972 [1033] |∎∎∎∎∎∎
  25.295 [0]    |
  31.619 [723]  |∎∎∎∎
  37.942 [12]   |
  44.265 [0]    |
  50.588 [0]    |
  56.911 [0]    |
  63.234 [55]   |

Latency distribution:
  10% in 0.4180 secs.
  25% in 1.1932 secs.
  50% in 3.1853 secs.
  75% in 7.4987 secs.
  90% in 15.7431 secs.
  95% in 31.2491 secs.
  99% in 31.5867 secs.




You should cache as much as possible, and don't repeat what's already cached: http://play.golang.org/p/F85MmNQi1_


gthomas@waterhouse:/tmp$ boom -n 10000 -c 5000 http://localhost:8888
10000 / 10000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00 %

Summary:
  Total:        1.8361 secs.
  Slowest:      1.6736 secs.
  Fastest:      0.0015 secs.
  Average:      0.6732 secs.
  Requests/sec: 5446.2187

Status code distribution:
  [200] 10000 responses

Response time histogram:
  0.001 [1]     |
  0.169 [26]    |
  0.336 [202]   |∎
  0.503 [1766]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.670 [2754]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.838 [4308]  |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  1.005 [412]   |∎∎∎
  1.172 [173]   |∎
  1.339 [3]     |
  1.506 [172]   |∎
  1.674 [183]   |∎

Latency distribution:
  10% in 0.4224 secs.
  25% in 0.5773 secs.
  50% in 0.6800 secs.
  75% in 0.7444 secs.
  90% in 0.8201 secs.
  95% in 1.0264 secs.
  99% in 1.6086 secs.

vadi...@gmail.com

unread,
Jul 12, 2015, 1:23:14 PM7/12/15
to golan...@googlegroups.com, vadi...@gmail.com
Thank you for this extensive answer with examples. They really help a lot.

I understand, truly PHP has some "inside magic" like opcache, though I don't really understand how it works inside.

Your "caching" approach works flawlessly, except there's one point, if this is a web application, even a simple blog where data is not static, seems like at program start (initialization) all data has already been gathered and processed within a template (cached). So e.g. if I request DB data, any data, it will be a frozen cached stack of some old data, right? It's not going to be ok, of course, - in case of a dynamic blog.

So if I need dynamic data, I need to process it with each request, meaning process it within a template again, meaning slow performance again and again?

Again, in case if one needs caching, looks like in this case it's even simpler just to cache the page on nginx side - the result is the same.

воскресенье, 12 июля 2015 г., 19:16:21 UTC+3 пользователь Tamás Gulácsi написал:

vadi...@gmail.com

unread,
Jul 12, 2015, 5:32:36 PM7/12/15
to golan...@googlegroups.com, vadi...@gmail.com
Also, why are the results for GOMAXPROCS=1, 5000 and GOMAXPROCS=8, 5000 are almost the same?
I heard Go scales great in utilizing CPU power by just tuning this option (of course if we have available CPU power).

воскресенье, 12 июля 2015 г., 19:16:21 UTC+3 пользователь Tamás Gulácsi написал:

Matt Harden

unread,
Jul 12, 2015, 5:40:25 PM7/12/15
to vadi...@gmail.com, golan...@googlegroups.com
That just means the task is heavily I/O bound, not CPU bound.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

linux...@gmail.com

unread,
Jul 14, 2015, 1:59:32 AM7/14/15
to golan...@googlegroups.com, vadi...@gmail.com
You could also stress testing with other Go template engines, for example, https://github.com/linuxerwang/goats-html.  I am curious how it'd perform.

Jesper Louis Andersen

unread,
Jul 14, 2015, 5:49:01 AM7/14/15
to vadi...@gmail.com, golang-nuts

On Sun, Jul 12, 2015 at 11:32 PM, <vadi...@gmail.com> wrote:
Also, why are the results for GOMAXPROCS=1, 5000 and GOMAXPROCS=8, 5000 are almost the same?

This is a surefire indicator.

Something is amiss in the understanding of the problem space in the first place. You expected a higher GOMAXPROCS setting to improve the results, but nothing happened. This suggests the bottleneck is somewhere else, for instance in I/O which Matt Harden supposes.

Chances are you aren't really comparing equal solutions here. In fact, if the discrepancy in performance is more than an order of magnitude in difference, you should always investigate the approach, as this is way too much to be explained by mere differences.


--
J.
Reply all
Reply to author
Forward
0 new messages