Realizing SSD random read IOPS

Manish Rai Jain

unread,

May 16, 2017, 7:59:50 AM5/16/17

to golang-nuts

Hey guys,

We wrote this simple program to try to achieve what Fio (linux program) does. Fio can easily achieve 100K IOPS on an Amazon i3.large instance with NVMe SSD. However, with Go we're unable to achieve anything close to that.

https://github.com/dgraph-io/badger-bench/blob/master/randread/main.go

This program should be simple to run. It uses Fio generated files. And basically tries 3 things: 1. random reads in a single goroutine (turned off by default), 2. random reads using specified number of goroutines, 3. same as 2, but using a channel.

3 is slower than 2 (of course). But, 2 is never able to achieve the IOPS that Fio can achieve. I've tried other things, to no luck. What I notice is that Go and Fio are close to each other as long as number of Goroutines is <= number of cores. Once you exceed cores, Go stays put, while Fio IOPS keeps on improving, until it reaches SSD thresholds.

So, how could I change my Go program to realize the true throughput of an SSD? Or, is this something that needs further work in Go (saw a thread about libaio).

Cheers,

Manish

Dave Cheney

unread,

May 16, 2017, 8:26:00 AM5/16/17

to golang-nuts

I'd start with the execution profile, specially how many goroutines are running concurrently. Your workload may be accidentally sequential due to the interaction between the scheduler and the syspoll background thread.

Ian Lance Taylor

unread,

May 16, 2017, 9:36:48 AM5/16/17

to Manish Rai Jain, golang-nuts

On Tue, May 16, 2017 at 4:59 AM, Manish Rai Jain <manis...@gmail.com> wrote:
>
> 3 is slower than 2 (of course). But, 2 is never able to achieve the IOPS
> that Fio can achieve. I've tried other things, to no luck. What I notice is
> that Go and Fio are close to each other as long as number of Goroutines is
> <= number of cores. Once you exceed cores, Go stays put, while Fio IOPS
> keeps on improving, until it reaches SSD thresholds.

One thing I notice about your program is that each goroutine is
calling rand.Intn and rand.Int63n. Those functions acquire and
release a lock, so that single lock is being contested by every
goroutine. That's an unfortunate and unnecessary slowdown. Give each
goroutine its own source of pseudo-random numbers by using rand.New.

You also have a point of contention on the local variable i, which you
are manipulating using atomic functions. It would be cheaper to give
each goroutine a number of operations to do rather than to compute
that dynamically using a contended address.

I'll also note that if a program that should be I/O bound shows a
behavior change when the number of parallel goroutines exceeds the
number of CPUs, then it might be interesting to try setting GOMAXPROCS
to be higher. I don't know what effect that would have here, but it's
worth checking.

Ian

Manish Rai Jain

unread,

May 16, 2017, 8:39:22 PM5/16/17

to Ian Lance Taylor, golang-nuts

So, I fixed the rand and removed the atomics usage (link in my original post).

Setting GOMAXPROCS definitely helped a lot. And now it seems to make sense, because (the following command in) fio spawns 16 threads; and GOMAXPROCS would do the same thing. However, the numbers are still quite a bit off.

I realized fio seems to overestimate, and my Go program seems to underestimate, so we used sar to determine the IOPS.

$ fio --name=randread --ioengine=psync --iodepth=32 --rw=randread --bs=4k --direct=0 --size=2G --numjobs=16 --runtime=120 --group_reporting

Gives around 62K, tested via sar -d 1 -p, while

$ go build . && GOMAXPROCS=16 ./randread --dir ~/diskfio --jobs 16 --num 2000000 --mode 1

Gives around 44K, via sar. Number of cores on my machine are 4.

Note that this is way better than the earlier 20K with GOMAXPROCS = number of cores, but still leaves much to be desired.

Manish Rai Jain

unread,

May 16, 2017, 11:05:49 PM5/16/17

to Ian Lance Taylor, golang-nuts

On further thought about GOMAXPROCS, and its impact on throughput:

A file::pread would block the OS thread. Go runs one OS thread per core. So, if an OS thread is blocked, no goroutines can be scheduled on this thread, therefore even pure CPU operations can't be run. This would lead to core wastage.

This is probably the reason why increasing GOMAXPROCS improves throughput, and running any number of goroutines >= GOMAXPROCS has little impact on anything. The underlying OS threads are already blocked, so goroutines can't do much.

If this logic is valid, then in a complex system, which is doing many random reads, while also performing calculations (like Dgraph) would suffer; even if we set GOMAXPROCS to a factor more than number of cores.

Ideally, the disk reads could be happening via libaio, causing the OS threads to not block, so all goroutines can make progress, increasing the number of read requests that can be made concurrently. This would then also ensure that one doesn't need to set GOMAXPROCS to a value greater than number of cores to achieve higher throughput.

Dave Cheney

unread,

May 17, 2017, 12:01:39 AM5/17/17

to golang-nuts, ia...@golang.org

> So, if an OS thread is blocked, no goroutines can be scheduled on this thread, therefore even pure CPU operations can't be run.

The runtime will spawn a new thread to replace the one that is blocked.

Manish Rai Jain

unread,

May 17, 2017, 12:27:52 AM5/17/17

to Dave Cheney, golang-nuts, Ian Lance Taylor

> The runtime will spawn a new thread to replace the one that is blocked.

Realized that after writing my last mail. And that actually explains some of the other crashes we saw, about "too many threads", if we run tens of thousands of goroutines to do these reads, one goroutine per read.

It is obviously lot more expensive to spawn a new OS thread. It seems like this exact same problem was already solved for network via netpoller (https://morsmachine.dk/netpoller). Blocking OS threads for disk reads made sense for HDDs, which could only do 200 IOPS; for SSDs we'd need a solution based on async I/O.

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/jPb_h3TvlKE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ian Lance Taylor

unread,

May 17, 2017, 1:01:50 AM5/17/17

to Manish Rai Jain, golang-nuts

On Tue, May 16, 2017 at 8:04 PM, Manish Rai Jain <manis...@gmail.com> wrote:
>
> Ideally, the disk reads could be happening via libaio, causing the OS
> threads to not block, so all goroutines can make progress, increasing the
> number of read requests that can be made concurrently. This would then also
> ensure that one doesn't need to set GOMAXPROCS to a value greater than
> number of cores to achieve higher throughput.

libaio sounds good on paper, but at least on GNU/Linux it's all in
user space. In effect it does exactly what the Go runtime does
already: it hands file I/O operations off to separate threads. The Go
runtime would gain nothing at all by switching to using libaio.

Ian

Ian Lance Taylor

unread,

May 17, 2017, 1:03:32 AM5/17/17

to Manish Rai Jain, Dave Cheney, golang-nuts

On Tue, May 16, 2017 at 9:26 PM, Manish Rai Jain <manis...@gmail.com> wrote:
>> The runtime will spawn a new thread to replace the one that is blocked.
>
> Realized that after writing my last mail. And that actually explains some of
> the other crashes we saw, about "too many threads", if we run tens of
> thousands of goroutines to do these reads, one goroutine per read.
>
> It is obviously lot more expensive to spawn a new OS thread. It seems like
> this exact same problem was already solved for network via netpoller
> (https://morsmachine.dk/netpoller). Blocking OS threads for disk reads made
> sense for HDDs, which could only do 200 IOPS; for SSDs we'd need a solution
> based on async I/O.

Note that in the upcoming Go 1.9 release we now use the netpoller for
the os package as well. However, it's not as effective as one would
hope, because on GNU/Linux you can't use epoll for disk files. It
mainly helps with pipes.

Ian

Dave Cheney

unread,

May 17, 2017, 1:25:57 AM5/17/17

to David Klempner, Ian Lance Taylor, golang-nuts, Manish Rai Jain

Rather than guessing what is going on, I think it's time to break out the profiling tools Manish.

On Wed, 17 May 2017, 15:23 David Klempner <klem...@google.com> wrote:

There's a not very well documented API to make AIO completions kick an eventfd.

It
mainly helps with pipes.

Ian

--

You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Manish Rai Jain

unread,

May 17, 2017, 3:30:47 AM5/17/17

to Dave Cheney, David Klempner, Ian Lance Taylor, golang-nuts

> libaio sounds good on paper, but at least on GNU/Linux it's all in user space.

I see. That makes sense. Reading a bit more, Linux native I/O sounds like it does exactly what we expect, i.e. save OS threads, and push this to kernel: http://man7.org/linux/man-pages/man2/io_submit.2.html

But, I suppose this can't be part of Go, because it's not portable. Is my understanding correct?

Also, any explanations about why GOMAXPROCS causes throughput to increase, if new OS threads are being spawned by blocked goroutines anyway? I thought I understood it before but now I don't.

Dave, profiler doesn't show any issues with the code itself. It's just blocked waiting on syscalls.

$ go tool pprof randread /tmp/profile398062565/cpu.pprof ~/go/src/github.com/dgraph-io/badger-bench/randread

Entering interactive mode (type "help" for commands)

(pprof) top

19.48s of 19.76s total (98.58%)

Dropped 27 nodes (cum <= 0.10s)

flat flat% sum% cum cum%

19.34s 97.87% 97.87% 19.52s 98.79% syscall.Syscall6

0.07s 0.35% 98.23% 0.11s 0.56% runtime.exitsyscall

0.03s 0.15% 98.38% 19.56s 98.99% os.(*File).ReadAt

0.02s 0.1% 98.48% 0.10s 0.51% math/rand.(*Rand).Intn

0.01s 0.051% 98.53% 19.70s 99.70% main.Conc2.func1

0.01s 0.051% 98.58% 19.53s 98.84% syscall.Pread

0 0% 98.58% 0.13s 0.66% main.getIndices

0 0% 98.58% 19.53s 98.84% os.(*File).pread

0 0% 98.58% 19.70s 99.70% runtime.goexit

(pprof)

$ go tool pprof randread /tmp/profile192709852/block.pprof ~/go/src/github.com/dgraph-io/badger-bench/randread

Entering interactive mode (type "help" for commands)

(pprof) top

58.48s of 58.48s total ( 100%)

Dropped 8 nodes (cum <= 0.29s)

flat flat% sum% cum cum%

58.48s 100% 100% 58.48s 100% sync.(*WaitGroup).Wait

0 0% 100% 58.48s 100% main.Conc2

0 0% 100% 58.48s 100% main.main

0 0% 100% 58.48s 100% runtime.goexit

0 0% 100% 58.48s 100% runtime.main

(pprof)

Ian

--

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.

Dave Cheney

unread,

May 17, 2017, 3:37:43 AM5/17/17

to Manish Rai Jain, David Klempner, Ian Lance Taylor, golang-nuts

Can you post the svg versions of those profiles?

Also, I recommend the execution trace profiler for this job, it'll show you a lot of detail about how the runtime is interacting with your program.

Ian

--

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

David Klempner

unread,

May 17, 2017, 8:49:55 AM5/17/17

to Ian Lance Taylor, Dave Cheney, golang-nuts, Manish Rai Jain

There's a not very well documented API to make AIO completions kick an eventfd.

It
mainly helps with pipes.

Ian

--

You received this message because you are subscribed to the Google Groups "golang-nuts" group.

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.

Ian Lance Taylor

unread,

May 17, 2017, 10:36:44 AM5/17/17

to Manish Rai Jain, Dave Cheney, David Klempner, golang-nuts

On Wed, May 17, 2017 at 12:29 AM, Manish Rai Jain <manis...@gmail.com> wrote:
>
>> libaio sounds good on paper, but at least on GNU/Linux it's all in user
>> space.
>
> I see. That makes sense. Reading a bit more, Linux native I/O sounds like it
> does exactly what we expect, i.e. save OS threads, and push this to kernel:
> http://man7.org/linux/man-pages/man2/io_submit.2.html
> But, I suppose this can't be part of Go, because it's not portable. Is my
> understanding correct?

We could use io_submit and friends on GNU/Linux. We want to provide a
consistent API to Go code, but the internal code can be different on
different operating systems. For example the implementations on
WIndows and Unix systems are of course quite different.

It's not obvious to me that io_submit would be a win for normal
programs, but if anybody wants to try it out and see that would be
great.

> Also, any explanations about why GOMAXPROCS causes throughput to increase,
> if new OS threads are being spawned by blocked goroutines anyway? I thought
> I understood it before but now I don't.

My guess is that it's the timing. The current runtime doesn't spawn a
new OS thread until an existing thread has been blocked in a syscall
for 20us or more. Having more threads ready to go avoids that delay.

I agree with Dave that looking at the execution tracer is likely to help.

Ian

Manish Rai Jain

unread,

May 19, 2017, 6:27:32 AM5/19/17

to Ian Lance Taylor, Dave Cheney, David Klempner, golang-nuts

Sorry for the delay in replying. Got busy with a presentation at Go meetup.

> I agree with Dave that looking at the execution tracer is likely to help.

I tried to run it, but nothing renders on my chrome (running on Arch Linux). Typical about:tracing works, but this doesn't. And there isn't much documentation to troubleshoot.

> It's not obvious to me that io_submit would be a win for normal
programs, but if anybody wants to try it out and see that would be
great.

Yeah, my hunch is that the cost of threads context switching is going to be a hindrance to achieving the true throughput of SSDs. So, I'd like to try it out. A few guiding pointers would be useful:

- This can be done directly via Syscall and Syscall6, is that right? Or should I use Cgo?

- I see SYS_IO_SUBMIT in syscall package. But, no aio_context_t, or iocbpp structs in the package.

- Similarly, other structs for io_getevents etc.

- What's the best way to generate them, so syscall.Syscall would accept these?

Ian Lance Taylor

unread,

May 19, 2017, 3:32:06 PM5/19/17

to Manish Rai Jain, Dave Cheney, David Klempner, golang-nuts

On Fri, May 19, 2017 at 3:26 AM, Manish Rai Jain <manis...@gmail.com> wrote:
>
>> It's not obvious to me that io_submit would be a win for normal
> programs, but if anybody wants to try it out and see that would be
> great.
>
> Yeah, my hunch is that the cost of threads context switching is going to be
> a hindrance to achieving the true throughput of SSDs. So, I'd like to try it
> out. A few guiding pointers would be useful:
>
> - This can be done directly via Syscall and Syscall6, is that right? Or
> should I use Cgo?

You should be able to use syscall.Syscall.

> - I see SYS_IO_SUBMIT in syscall package. But, no aio_context_t, or iocbpp
> structs in the package.
> - Similarly, other structs for io_getevents etc.
> - What's the best way to generate them, so syscall.Syscall would accept
> these?

The simplest way is to get them via cgo. The better way is to add
them to the x/sys/unix package as described at
https://github.com/golang/sys/blob/master/unix/README.md .

Ian

Manish Rai Jain

unread,

Aug 7, 2017, 4:40:03 AM8/7/17

to Ian Lance Taylor, Dave Cheney, golang-nuts

Hey folks,

Just wanted to update the status of this.

During Gophercon, I happened to meet Russ Cox and asked him the same question. If File::Read blocks goroutines, which then spawn new OS threads, in a long running job, there should be plenty of OS threads created already, so the random read throughput should increase over time and stabilize to the maximum possible value. But, that's not what I see in my benchmarks.

And his explanation was that the GOMAXPROCS in a way acts like a multiplexer. From docs, "the GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously." Which basically means, all reads must first be run only via GOMAXPROCS number of goroutines, before switching over to some OS thread (not really a switch, but conceptually speaking). This introduces a bottleneck for throughput.

I re-ran my benchmarks with a much higher GOMAXPROCS and was able to then achieve the maximum throughput. The numbers are here:

https://github.com/dgraph-io/badger-bench/blob/master/randread/maxprocs.txt

To summarize these benchmarks, Linux fio achieves 118K IOPS, and with GOMAXPROCS=64/128, I'm able to achieve 105K IOPS, which is close enough. Win!

Regarding the point about using io_submit etc., instead of goroutines; I managed to find a library which does that, but it performed worse than just using goroutines.

https://github.com/traetox/goaio/issues/3

From what I gather (talking to Russ and Ian), whatever work is going on in user space, the same work has to happen in kernel space; so there's not much benefit here.

Overall, with GOMAXPROCS set to a higher value (as I've done in Dgraph), one can get the advertised SSD throughput using goroutines.

Thanks, Ian, Russ and the Go community in helping solve this problem!

Reply all

Reply to author

Forward