IO performance analysis tools

1 view
Skip to first unread message

Danny Robson

unread,
Jun 15, 2022, 3:38:40 AM6/15/22
to mlug

Hi all,

I've recently been running some entirely IO bound workloads which is
something that I don't have a lot of experience with.

Coupled with my unusual storage architecture I'm having a difficult time
reasoning about performance of my system.

Does anyone know of tools that are able to provide insight here?

I'd ideally like to attribute IO statistics to lines of code but
recognise this might be difficult; I'd settle for tools that show
latency and throughput by disk over time.

(For the record I'm using XFS on bcache with RAID1 NVMe cache and RAID5
SATA HDD backing and suspect something funky is happening on my SSDS
given some _very_ high periodic latencies)

Cheers,
Danny Robson

AJ

unread,
Jun 15, 2022, 4:47:37 AM6/15/22
to mlu...@googlegroups.com

first thing that pops into my head is looking at S.M.A.R.T. stats.
not sure how you can query them real-time, and/or how that might itself
cause bottlenecks.

AJ

unread,
Jun 15, 2022, 4:57:30 AM6/15/22
to mlu...@googlegroups.com


https://linoxide.com/linux-iostat-command/
this stuff looks interesting,  but is sounds more like you want
mid-level stats..   which doesn't sound like something that would be
standardised for your custom setup.
most things are going to be either too low level or too high level, and
that's not going to reveal whom the IO latency culprit is.

<non technical> can't you "just" throw better/faster hardware at the
problem </>


line-by-line latency stats sounds like everything is blocking/sync..    
Y U NO USE  io_uring/non-blocking/async IO?!    especially with SSDs
having queue depth > 1.





On 15/06/2022 5:38 pm, Danny Robson wrote:

zak martell

unread,
Jun 15, 2022, 5:04:55 AM6/15/22
to mlu...@googlegroups.com
Hi Danny,

What kind of io statistics are you looking for? You cant really do lines of code, You can likely do function-level though. 

It also depends on the underlying operating system if your looking for more kernel things. I know this is a Linux user group, but Windows for e.g. has a much higher stats/reporting capabilities. 

--
You received this message because you are subscribed to the Google Groups "mlug-au" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlug-au+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mlug-au/048250ca-e1ad-35ba-1897-760388214cd0%40gmail.com.

Duncan Roe

unread,
Jun 15, 2022, 5:13:04 AM6/15/22
to mlu...@googlegroups.com
Hi Danny,
Is iotop any help? Home page http://guichaz.free.fr/iotop/

Cheers ... Duncan.

Michael Pope

unread,
Jun 15, 2022, 6:14:31 AM6/15/22
to mlu...@googlegroups.com
Danny,

Maybe using bonnie++ or 'hdparam -t' might help as they give some stats about the speed/throughput of your drive. A lot of people use these tools so you could compare with other people. I know you setup is weird but if something here is really out of wack then it should be faster than a single drive stats.

from
Mick
-- 
You received this message because you are subscribed to the Google Groups "mlug-au" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlug-au+u...@googlegroups.com.

zak martell

unread,
Jun 15, 2022, 6:26:16 AM6/15/22
to mlu...@googlegroups.com
Danny, 

Honestly if you are dealing with "very high latencies" you are likely not even using the cache, or the cache is filling up too much that it just jumps back to spinning. I would like how the cache is setup, and likely see if your I/O usage exceeds the NVMe. 

Can you also just run the workload on the NVMe RAID 1 pool directly, ignoring the SATA HDDs? Your app might be waiting for the spinning at times. 



Darren Wurf

unread,
Jun 15, 2022, 8:06:08 PM6/15/22
to mlu...@googlegroups.com
iostat and sat would be my go-to tools here

Richard C

unread,
Jun 16, 2022, 11:01:38 PM6/16/22
to mlug-au
A quick note that high performance I/O is generally asynchronous these days (unless you plan on rebuilding much of the kernel infrastructure). Assuming it's AsyncIO, it's not possible to relate actual hardware behaviour to lines of code.

Strace (and similar) will uncover the system calls that your application is making and examining them may shed some light (regardless sync/async IO).

Building little benchmark scripts to simulate different IO patterns (be sure to keep detailed loss of their behaviour) will help uncover what's going on and give you a path forward.

Good luck,

Richard

Reply all
Reply to author
Forward
0 new messages