Monitoring files on a Amazon EBS backed device

kimb...@gmail.com

unread,

Jun 15, 2017, 12:17:51 PM6/15/17

to sysdig

Hi,

I am working on a performance issue and want to see a list of files and how many reads are being executed against them. On Linux (centos 7) this is surprisingly difficult!!

That said I've been using sysdig for other things and this looks like it will do the job. However whilst I know my device is being read from (iotop. iostat etc show that), sysdig does not seem to report anything for this device. This device is an EBS volume mounted on an Amazon instance. The normal root partition is reported on trouble is my solr index files aren't on that device!

Does anybody know of anything special I need to do to get stats from the EBS backed device or point out something I'm doing wrong?

As some context sysdig commands I'm trying are:

sudo sysdig -c topfiles_bytes
sudo sysdig -p "%user.name %proc.name %fd.name %fd.directory" "evt.type=read" -w writetrace.scap
sudo sysdig -G 600 -W 1 -w dump.scap evt.is_io_read=true

None of them ever pick up either device xvdh), the directory (/data/) or any files being accessed under /data

etc

Thanks

Mike

kimb...@gmail.com

unread,

Jun 21, 2017, 8:42:42 AM6/21/17

to sysdig, kimb...@gmail.com

Assume silence means I'm on my own!

Gianluca Borello

unread,

Jun 21, 2017, 12:15:43 PM6/21/17

to sysdig

Hi,

Sysdig should clearly be able to see every I/O system call that is done against your system, regardless of where the files live, I myself use it against EBS storages all the time.

The first thing that comes to mind is: is it possible that your application is doing memory mapped I/O? In that case, files are read by the kernel while your application directly accesses the memory, as a result of page faults. Those would show up under iotop but not under sysdig, since at the moment we're heavily system call oriented and memory mapped I/O is transparent from that point of view. We could probably work on adding more system events that interact directly with the page cache and things like that, but it's not trivial.

So, can you check with lsof? You should be able to see if your files under /data show up as "mem" rather than with a proper fd. If that's not the case, we can proceed with some more troubleshooting (i.e. look for more obscure I/O system calls such as splice/sendfile/...).

Thanks

--
www.sysdig.org
https://github.com/draios/sysdig
www.draios.com
---
You received this message because you are subscribed to the Google Groups "sysdig" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sysdig+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

kimb...@gmail.com

unread,

Jun 22, 2017, 5:24:49 AM6/22/17

to sysdig

Gianluca,

Thanks for getting back to me. Your analysis would seem to be correct:

* I'm seeing writes using sysdig
* "sudo lsof -d mem -p 4080 | grep data" gives me a list of the files I'm looking to see reads for and the ones I have not been able to see read events for using sysdig are memory-mapped

For reference, this is a SolrCloud cluster which is multi-tenanted. I'm trying to debug a performance issue with one tenant that seems to be disk access related; hence the need to identify the files that are being read as they identify the tenant whereas the process is a generic "Java" for all IO (whether it be from the disk of disk cache).

I've been able to identify cache misses using https://github.com/brendangregg/perf-tools/blob/master/examples/cachestat_example.txt but again it's not fine-grained enough to allow a the explicit solr tenant to be identified that's causing the cache miss.

https://github.com/tobert/pcstat has got me a file-level view of what's cached, but you need to know what files your looking for first; which again is hard in a MT environment.

The BCC tools such e.g. https://github.com/iovisor/bcc/blob/master/tools/filetop_example.txt seem to give me what I want, but I need to upgrade to kernel 4.2 to use them; which I don't want to do

It would be great if sysdig could incorporate some of the functionality at some stage so we have one tool that covers the majority fo what we need and we don't have to upgrade our Kernel. Unfortunately, I suspect enhancing sysdig myself is beyond my abilities!

As I said in my original post identiting which files are being hit for reads on Linux seems to be surprisingly difficult.

Again thank you for your help

Mike

Gianluca Borello

unread,

Jun 23, 2017, 1:49:05 AM6/23/17

to sysdig

On Thu, Jun 22, 2017 at 2:24 AM, <kimb...@gmail.com> wrote:

https://github.com/tobert/pcstat has got me a file-level view of what's cached, but you need to know what files your looking for first; which again is hard in a MT environment.

Thank you for sharing this tool, I admit I wasn't at all familiar with the mincore() system call (which is what this tool uses internally), what a treat! But I agree it's not enough to solve your use case, although you could probably script-generate a list of files in your directory and pass it to the tool.

The BCC tools such e.g. https://github.com/iovisor/bcc/blob/master/tools/filetop_example.txt seem to give me what I want, but I need to upgrade to kernel 4.2 to use them; which I don't want to do

I am quite familiar with the BCC tools myself, but I don't think they would help you in this instance. In particular, the filetop tool just attaches to vfs_read/vfs_write (https://github.com/iovisor/bcc/blob/master/tools/filetop.py#L157), which is essentially the same as intercepting the I/O system calls: you won't be able to see page fault-driven I/O like in your case. That being said, if you write a custom script using BCC/eBPF, you will be able to solve your problem (but it might not be necessary, more on this in the next point).

It would be great if sysdig could incorporate some of the functionality at some stage so we have one tool that covers the majority fo what we need and we don't have to upgrade our Kernel. Unfortunately, I suspect enhancing sysdig myself is beyond my abilities!

I think there is something you can do just by using perf (and potentially sysdig). Here I have a little test program that, whenever executed, reads a big file sequentially, byte by byte, by memory mapping it first:

gianluca@sid:~$ ./mmap ./bigfile

Read 0 MB

Read 1 MB

Read 2 MB

Read 3 MB

Read 4 MB

...

If I inspect the memory mappings of this process, I see the file is 3GB and mapped in memory in the range 7fba37bcc000-7fbb07cfe000 (you could get the same information by running sysdig and looking for the mmap events):

gianluca@sid:~$ cat /proc/29010/maps

...

7fba37bcc000-7fbb07cfe000 r--s 00000000 08:01 1443751 /home/gianluca/bigfile

...

If I now use perf to trace all the page faults or additions to the page cache generated by my process with:

gianluca@sid:~$ sudo perf trace --no-syscalls --event exceptions:page_fault_user --event filemap:mm_filemap_add_to_page_cache -p 29010

I will be able to distinguish three scenarios in the output:

Case 1) The process is reading the content of the file for the first time, and the content is not present in the page cache. This generates both page faults as well as page cache events, and this is very likely the case you care about, since that's the workflow that is bottlenecked by the disk access:

...

19216.582 exceptions:page_fault_user:address=0x7fbb07c8d000f ip=0x560fcdf0fa91f error_code=0x4)

19216.621 exceptions:page_fault_user:address=0x7fbb07c90000f ip=0x560fcdf0fa91f error_code=0x4)

19216.787 exceptions:page_fault_user:address=0x7fbb07ca0000f ip=0x560fcdf0fa91f error_code=0x4)

19216.916 exceptions:page_fault_user:address=0x7fbb07cac000f ip=0x560fcdf0fa91f error_code=0x4)

19216.935 filemap:mm_filemap_add_to_page_cache:dev 8:1 ino 1607a7 page=0x13b92f pfn=1292591 ofs=3490709504)

19216.943 filemap:mm_filemap_add_to_page_cache:dev 8:1 ino 1607a7 page=0x13b930 pfn=1292592 ofs=3490713600)

19216.945 filemap:mm_filemap_add_to_page_cache:dev 8:1 ino 1607a7 page=0x13b931 pfn=1292593 ofs=3490717696)

19216.948 filemap:mm_filemap_add_to_page_cache:dev 8:1 ino 1607a7 page=0x13b932 pfn=1292594 ofs=3490721792)

...

The output might be a bit cryptic, but it's actually quite insightful. Notice how the page_fault_user events generated are all related to addresses in the range 7fba37bcc000-7fbb07cfe000, and you can see each one is within 1MB each (calculate the diff between sequential addresses), so that's the amount that gets read from disk at every fault. This tells you that the process is trying to do activity in the /home/gianluca/bigfile file. Notice also how, since the kernel didn't have these pieces of the file in the page cache, it's caching them for the first time (mm_filemap_add_to_page_cache events). Notice how the inode number of the mm_filemap_add_to_page_cache events matches the one of the file:

gianluca@sid:~$ stat ./bigfile

File: ./bigfile

Size: 3490912256 Blocks: 6818280 IO Block: 4096 regular file

Device: 801h/2049d Inode: 1443751 Links: 1

So, in this case you have two obvious clues about what memory mapped files your process is actually reading (without having to guess them like with the other tool), and how much those files are being read. It might take a bit of shell scripting to calculate the files if your process is very busy, but it can be done and it's very precise (and ultimately we might include such functionality in sysdig).

Case 2) The process is reading the content of the file for the first time, and the content is already cached in the page cache because some other process already read it. This generates just page fault events which, like in the previous case, you can easily correlate to the actual files by looking at the address and correlating it with /proc/PID/maps. This case is likely not the one to be causing a performance bottleneck, because the file content is already in memory:

...

19216.582 exceptions:page_fault_user:address=0x7fbb07c8d000f ip=0x560fcdf0fa91f error_code=0x4)

19216.621 exceptions:page_fault_user:address=0x7fbb07c90000f ip=0x560fcdf0fa91f error_code=0x4)

19216.787 exceptions:page_fault_user:address=0x7fbb07ca0000f ip=0x560fcdf0fa91f error_code=0x4)

...

Case 3) The process is reading the content of the file for the nth time: in this case, no page faults or page cache events are generated, since everything is already in place, in the page cache and mapped in the process memory. This case is likely not the one to be causing a performance bottleneck

Thanks

kimb...@gmail.com

unread,

Jun 30, 2017, 6:26:34 AM6/30/17

to sysdig, kimb...@gmail.com

Gianluca,

Apologies for not replying sooner I've been in meetings all week. The above makes sense; although a tad more complex than good old Windows performance manager :-) . The difficulty suggests that my use case is a very much an edge case on Linux; which I do find surprising but can't argue with it!