Performance degradation?

321 views
Skip to first unread message

atanamir

unread,
Mar 22, 2018, 3:15:45 AM3/22/18
to WinFsp
Hi Bill,

First off, great work on WinFSP.  I discovered it only recently after trying out (and being disappointed by) Dokany's performance.  I was really intrigued by the writeup you did about WinFSP's performance and how it can beat NTFS (albeit on memory vs NTFS on disk).

Unfortunately in my own testing, I haven't really been able to reproduce your results.  WinFSP is definitely a lot faster than Dokany, but I'm still a couple times slower than NTFS (on an SSD, if that makes a difference).  I'm wondering if the recent patches Microsoft did to address Spectre/Meltdown may have observed this?  Have you observed any degradation in performance in your latest tests?  I'm wondering if I should set up a computer with an unpatched version of Windows to see if performance is better.  

Using memfs and a small C program I wrote that calls CreateFile(OPEN_EXISTING) and CloseFile 10,000 times and times the results on my laptop, it appears NTFS takes roughly 130ms to do the aforementioned test, whereas memfs/winfsp takes 500ms to perform the same test.  If it's due to the patch, that's fine, I understand.  I just wanted to make sure I haven't misconfigured anything.

Thanks!

Bill Zissimopoulos

unread,
Mar 22, 2018, 2:48:14 PM3/22/18
to WinFsp
Interesting. I have not run these benchmark tests in a while and it is quite plausible that something has been messed up recently and performance has degraded.

Can you please provide information:
  • Windows version
  • WinFsp version
  • Your NTFS configuration (see for example http://www.ntfs.com/ntfs_optimization.htm)
  • Memfs command line. Do you use kernel caching? Do you start it as a “disk” drive or a “network” drive?
  • Benchmark that you are using.
Although I have not studied the Spectre/Meltdown vulnerabilities and I only know about them from the mainstream press, I would not expect that these patches make such a difference.

MEMFS performed very well in the original “file_open” test so this would have to be a huge degradation:

Bill



--
You received this message because you are subscribed to the Google Groups "WinFsp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to winfsp+un...@googlegroups.com.
To post to this group, send email to win...@googlegroups.com.
Visit this group at https://groups.google.com/group/winfsp.
For more options, visit https://groups.google.com/d/optout.

atanamir

unread,
Mar 22, 2018, 5:43:51 PM3/22/18
to WinFsp
I'm running Windows 10 (10.0.16299.309), with WinFSP 2018.1 B2 (1.3.18023).  My NTFS configuration should be the default settings across the board as I have not tuned anything.  So I assume that a 4K allocation size with 12.5% MFT reservation?  

memfs is started using "memfs-x64 -m X:" (so that's a 'disk' drive).  I proceed to copy C:\windows\system32\notepad.exe to X:.  I then run the following benchmark program (#includes elided for brevity):

int main(int argc, char** argv)
{
    string count_str(argv[1]);
    istringstream c(count_str);

    uint64_t count;
    c >> count;

    std::string path(argv[2]);

    cout << "Test file: " << path << endl;

    auto start = chrono::steady_clock::now();

    for (uint64_t i = 0; i < count; i++)
    {
        HANDLE hFile = CreateFileA(
            path.c_str(),
            FILE_READ_ATTRIBUTES,
            FILE_SHARE_READ,
            NULL,
            OPEN_EXISTING,
            FILE_ATTRIBUTE_NORMAL,
            NULL);
        if (hFile == INVALID_HANDLE_VALUE)
        {
            cout << "CreateFileA failed: " << GetLastError() << endl;
            break;
        }
        CloseHandle(hFile);
    }

    auto end = chrono::steady_clock::now();
    auto duration = end - start;

    cout << "Took " << chrono::duration_cast<chrono::milliseconds>(duration).count() <<  " ms to create/close " << count << " times." << endl;
}

Compiled using the command line, cl /EHsc /Ox iobench.cpp.  The test is invoked as follows:

> iobench 10000 C:\Windows\system32\notepad.exe
Test file: C:\Windows\system32\notepad.exe
Took 273 ms to create/close 10000 times.

>iobench 10000 X:\notepad.exe
Test file: X:\notepad.exe
Took 833 ms to create/close 10000 times.

The range of results I got for roughly 20 tests for memfs was in the range of 775-915, and the NTFS case (on SSD) was between 270 and 322ms.  

And finally, yes, the performance testing page I had read thoroughly before trying out WinFSP.  It's a great writeup with a thorough performance analysis that dokany currently lacks.  (Trying to find benchmark numbers for dokany is what helped me discover winfsp)

Thanks!

atanamir

unread,
Mar 22, 2018, 5:55:20 PM3/22/18
to WinFsp
Oh, I forgot to mention:  I'm aware my test is a bit different than your file_open test since I'm only hammering a single file over and over again... not sure what the implications of that are, however.

Bill Zissimopoulos

unread,
Mar 22, 2018, 7:37:11 PM3/22/18
to WinFsp
I think this explains the difference!

When opening and closing the same file, NTFS is able to utilize all cached information about the file. It only needs to hit the disk once.

OTOH, WinFsp *always* sends the Open to the user mode file system process. This was a conscious decision that I struggled with a lot in the original design of WinFsp. The primary reason is that I wanted access checks to be done in user mode for maximum flexibility.

When you are issuing CreateFileW/CloseHandle, the following calls may be sent to the FSD:
  • Open
  • GetFileInfo
  • Cleanup
  • Close
In an unoptimized scenario WinFsp will forward all these calls to the user-mode file system incurring two context switches per call! However it is able to optimize away all calls except Open. It satisfies GetFileInfo using cached information from the Open and only sends Cleanup when the file is modified. It also sends Close asynchronously.

(Aside: I have at times contemplated caching some security information for recently open files and performing the access checks in the kernel, if the user mode file system asks the FSD to do so (e.g. using a configuration parameter). However this could potentially introduce security problems and subtle bugs for such file systems and is not worth the added complexity.)

Bill


On 3/22/18, 2:55 PM, win...@googlegroups.com on behalf of atanamir wrote:
Oh, I forgot to mention:  I'm aware my test is a bit different than your file_open test since I'm only hammering a single file over and over again... not sure what the implications of that are, however.

--

atanamir

unread,
Mar 23, 2018, 5:23:46 AM3/23/18
to WinFsp
So I spent some time re-writing my little test harness based on your cache hypothesis.  

I now have a program that goes and creates N files, with a random size between 1 byte and 20 KiB.  It then fills the file with random ASCII chars.  (source is attached as 'create.cpp').  I created 5,000 files in my case on NTFS, then I started the memfs and copied the files over.  

I then have another program that creates/closes the files at random.  My v1 of the second program first enumerated all the files into a vector, then picked files at random from the vector.  I then refined that a bit because I feared FindFirstFile/FindNextFile would also trigger NTFS caching of the file data, so I removed that and just generated the filename randomly.  However, results are still disheartening:

NTFS:
> iobench2.exe 10000 5000
Took 250 ms to create/close 10000 times.

memfs:
> iobench2.exe 10000 5000
Took 2342 ms to create/close 10000 times.

MemFS showed more variance here... about half the time it would be 2200-2300ms, and the other half it would be between 750-800ms.  Not sure why the big gap, nor why there was any result in between 800 and 2200ms.  Perhaps sometimes it hit winfsp caches? 

NTFS was reliably between 200 and 300 on every invocation.

I'm aware you also have test suites as part of the code.  I was just trying to write my own as I feel I have a better sense of exactly what's happening.  

Thanks for your time!
create.cpp
iobench2.cpp

Bill Zissimopoulos

unread,
Mar 23, 2018, 12:22:24 PM3/23/18
to WinFsp
It is hard for me to speculate without actually seeing your tests. And it might very well be that a recent change has introduced some kind of performance issue.

If you make your tests available in a public place (e.g. as a gist, etc.) I may try them myself at some time in the near future. I may also try them against the WinFsp baseline-perf branch (https://github.com/billziss-gh/winfsp/tree/baseline-perf), which is from the time I run the original tests.

(You may also consider making your tests available as a GitHub PR against fsbench (https://github.com/billziss-gh/winfsp/blob/master/tst/fsbench/fsbench.c))

Regarding the variance. This indicates to me that there is other activity in your system. When you are doing testing of this kind there are many variables involved. To minimize such variables you should not run any other programs and you should reboot the machine between each test.

(As a completely hypothetical example, if you run the MEMFS test immediately after the NTFS test, one possibility is that the MEMFS test is running during NTFS doing its Lazy Writes and thus is paying a penalty for NTFS's hidden activity.)

But if you are seeing MEMFS being an order of magnitude slower than NTFS in your tests, that is something that I definitely want to understand why and fix if possible. I note that a lot of WinFsp optimizations stemmed from my effort to improve the results of those original benchmarks.

Bill



--

Bill Zissimopoulos

unread,
Mar 23, 2018, 12:46:19 PM3/23/18
to WinFsp
I am just realizing that you did attach your tests to your earlier post. My apologies for not seeing them earlier.

Bill


--

Bill Zissimopoulos

unread,
Mar 23, 2018, 2:56:28 PM3/23/18
to WinFsp
Having now seen your benchmark code, you are correct that I would expect MEMFS to perform as well as it did in the original file_open_test.

To summarize here are the reasons that I can think of for why it does not:
  • A recent performance regression that has gone undetected. Unfortunately it is hard to test for such regressions as CI testing does not provide reliable timings.
  • Undetected activity in the system used for testing. I base this reason on the variance you have observed.
  • Various cache effects that can be eliminated by rebooting the system between each test run. (E.g. Reboot, run NTFS tests, reboot, run MEMFS tests, etc.)
Finally I am not sure if I have mentioned this before, but I am assuming that you run MEMFS with full caching enabled (i.e. using the –t –1 switch (which I believe is the default)).

Bill


--

atanamir

unread,
Mar 23, 2018, 4:12:28 PM3/23/18
to WinFsp
Thanks for your responsiveness.  I'll give it another go tonight.  I did try it once this morning, and you're probably right -- both the NTFS time and MEMFS time fell; the NTFS got a consistent 150ms or so, whereas MEMFS got consistently around 600ms.  Much better than 2+ seconds, but still quite a bit slower.  I'll try the reboot cycling tests tonight to see if it changes the results in any way.  

atanamir

unread,
Mar 25, 2018, 10:08:14 PM3/25/18
to WinFsp
Hi,

So in an attempt to rule out any potential slowdown that the application could be causing, I wrote a 'nullfs' file system which simply will succeed all open() requests and report that the file is 64 bytes long.  I've attached the source it.  Running the iobench2 test again, I can still see that on my machine it's consistently 4x slower than NTFS -- I've tried your suggestions of rebooting, and ensuring that Windows Update isn't running or there being any other background activity on my machine.  

I put it through the VS2017 profiler, but it doesn't want to load symbols for the .sys driver.  Any ideas on how I can help track down on what's making it go slowly?  

For reference, NTFS is about 150-200ms, whereas the nullfs for 10,000 creates on 5,000 files is between 750-825ms.

Thanks
Stephen

atanamir

unread,
Mar 25, 2018, 10:08:47 PM3/25/18
to WinFsp
Forgot to attach the file...

-Stephen
nullfs.cpp

Bill Zissimopoulos

unread,
Mar 26, 2018, 8:18:38 PM3/26/18
to WinFsp
Stephen:

Sorry for the late response. I am currently very busy with some other work.

The only way that I know to profile a Windows driver is using xperf, which is included in the Windows Performance Toolkit, which is in turn included in the Windows Assessment and Deployment Kit. (I know! Microsoft right? :)


If you decide to go that route be prepared to learn a lot. Xperf is a fantastic tool, but it takes a long time to understand it and make sense of what it is telling you.

Xperf helped me identify and fix a number of performance issues including a very hard problem with thread scheduling that I describe here:


Your findings strongly suggest some kind of performance regression that I would like to address. I am hoping to have time to look into this more closely in 2-3 weeks. In the meantime it will be tremendously helpful if you intend to do some more analysis yourself.

BTW, have you run the fsbench “file_open_test” to see if you get similar results as your iobench2?

Bill


--

atanamir

unread,
Apr 23, 2018, 10:39:20 PM4/23/18
to WinFsp
Hi Bill,

No problem for the late response.  As you can tell, my response is quite delayed too.  

I found some time to take an xperf trace using my nullfs + iobench2 combination.  The same slowdown was observed.  Unfortunately it seems that most of the time is spend in ExAllocatePoolWithTag and ExFreePoolWithTag calls.  They add a significant overhead,  

Looking at nullfs.exe in particular, the entire run totalled 509ms.  Of that, winfsp-x64.dll!FspFileSystemDispatcherThread consumes 507ms (inclusive total).  As we drill down further into the call stack, we see:

  kernel32!DeviceIoControlImplementation 500ms
    kernelbase!DeviceIoControl 499ms
      ntdll!NtFsControlFile 494ms
        ntoskrnl!KiSystemServiceCopyEnd 473ms
          ntoskrnl!NtFsControlFile 473ms
            ntoskrnl!IopXxxControlFile 473ms
              ntoskrnl!IopSynchronousServiceTail 398ms
                ntoskrnl!ofCallDriver 304ms
                  ... (elided for brevity)
                ntoskrnl!opCompleteRequest 92ms
                  ntoskrnl!ExFreePoolWithTag 48ms
                  ntoskrnl!ovFreeIrpPrivate 38ms
                    ntoskrnl!ExFreePoolWithTag 35ms
                  ...
              ntoskrnl!ovAllocateIrp 40ms
                ntoskrnl!VeAllocatePoolWithTag 22ms
                ...
              
So even before we get to nullfs' or WinFSP's code, we've already spent ~100ms allocating and freeing memory.  The story doesn't get better as we drill in, though.  Lots of time is spent in various freepool, allocatepool calls all over the stack trace.  

Not sure how this can easily be fixed...

If you want, I can share the ETL file I have.

Stephen

Bill Zissimopoulos

unread,
Apr 24, 2018, 6:33:33 PM4/24/18
to atanamir, WinFsp
Stephen:

I looked at the main loop of your test again:

    for (uint64_t i = 0; i < count; i++)
    {
        stringstream s;
        s << "file_" << file_rand() << ".txt";
        auto n = s.str();

        HANDLE hFile = CreateFileA(
            n.c_str(),
            FILE_READ_ATTRIBUTES,
            FILE_SHARE_READ,
            NULL,
            OPEN_EXISTING,
            FILE_ATTRIBUTE_NORMAL,
            NULL);
        if (hFile == INVALID_HANDLE_VALUE)
        {
            cout << "CreateFileA(" << n << ") failed: " << GetLastError() << endl;
            break;
        }
        CloseHandle(hFile);
    }

Assuming that you have optimizations enabled (FileInfoTimeout==-1, PostCleanupWhenModifiedOnly=1), this is what is supposed to happen in every loop iteration.
  • CreateFileA: the FSD (File System Driver) will receive IRP_MJ_CREATE, which it will forward to the user mode file system as an Open request. The response will include file metadata that is cached within the FSD and eventually the IRP will be completed and return control to the test program. Result: 2 context switches per call.
  • Any additional information IRP’s (Windows or filters may send some) will be satisfied from the FSD caches. Result: 0 context switches.
  • CloseHandle: will send up to 2 IRP’s: IRP_MJ_CLEANUP and IRP_MJ_CLOSE. The CLEANUP IRP will be handled by the FSD directly, because the file was not modified and because we have PostCleanupWhenModifiedOnly. The CLOSE IRP (if sent) is posted asynchronously and will in all likelihood be piggybacked to the  processing of the CREATE IRP’s. (When the file system thread is switched in to handle the CREATE IRP, it will also find a prior CLOSE and handle it.) Result: 0 context switches.
This means that there should be 2 context switches per loop iteration or count*2 total context switches.

BTW, I observed that your test code opens the file with FILE_READ_ATTRIBUTES, whereas fsbench opens the file with GENERIC_READ | GENERIC_WRITE. I am completely speculating here, but I am wondering if it accounts for the discrepancy and better NTFS performance (presumably with FILE_READ_ATTRIBUTES NTFS has to only check metadata).

Yes, please send me the ETL file if possible. You can send it directly to <billziss at navimatics dot com> if you do not want to make it publicly available.

BTW, the ExAllocatePool/ExFreePool timings may be fine. They may appear excessive but this may simply be the result of the user mode file system calling DeviceIoControl too often while it is attempting to fetch the next IRP to process. The true culprit may be excessive context switching for some reason.

It is possible to configure xperf to capture context switch data, but unfortunately I do not recall how off the top of my head.

Bill

PS: I am not sure if I have mentioned this before, but I am assuming that you have the driver verifier off while running these tests.



--
Reply all
Reply to author
Forward
0 new messages