Optimized filesystem directory scanning

876 views
Skip to first unread message

Kristian Rosenvold

unread,
Nov 20, 2013, 3:14:04 PM11/20/13
to mechanica...@googlegroups.com
I have been working on a project that is basically an optimized
version of the ant/maven filesystem scanner; which is also used by a
significant amount of other java tools. I have been streamlining and
parallelizing this code based on some overall goals (which can be seen
on the github site). The cool part about this project is how the
overall goals affect the solution space.

Currently the scanner is a tad short of 10x faster than any other
filesystem scanner I have tested (on linux), and seems to process
about 1 million files a second on what was a high-end SSD 9 months
ago. On OsX there appears to be some jvm/osx limitation that seems to
flatline at about 2.5x the competition (i'm considering installing
linux on my MBP to find out if it's hardware issue). Even old-skool
hard-drives seem to be 5-6X times faster on linux.

If anyone else wants to play around with this or discuss the
strategies chosen, feel free to do so :) I am also working on
the api's, which still need a little TLC. There is also the issue of
establishing sensible defaults that should work across all sorts of
platforms, which may turn out to be a tough nut that could require
some discussion :)

The source is at https://github.com/krosenvold/fast-dirscan

I will submit the end result of this project back somewhere at apache
once I finish the work, this project is ASL2.0 licensed.

Kristian

.

Travis Downs

unread,
Nov 20, 2013, 7:46:09 PM11/20/13
to mechanica...@googlegroups.com
I've written some fast directory scanners in the past, and the bottlenecks can vary heavily by platform.  I found that the list() call generally spends most of its time the underlying native calls, stuff like opendir() and readdir() - and the cost of these calls seems to vary wildly - being much slower on OSX than Linux, for example.  Earlier Linux kernels seemed to have coarser grained locking, to the point that using more than two threads didn't seem fruitful (this is with a hot inode/block cache), but the situation has improved on recent kernels.

One approach that paid big dividends was to cache (to disk) the results (all files and directory paths) of a particular scan, along with directory modification timestamps. Then on subsequent scans you need only to check the timestamps of all the directories in your cache, and rescan only modified directories - since otherwise the set of contained files won't have changed (of course, the file content itself can change arbitrarily w/o updating the directory timestamp).  Depending on how frequently you are scanning unchanged directory trees, and your ratio of files to directories, this can speed up scanning by an order of magnitude or more. There are edge cases to handle, e.g., when a directory has been modified very recently - that is, in the "current" time slice as defined by the timestamp granularity - these should be rescanned always since you could otherwise miss a subsequent modification in the same time slice (this case is quite rare in practice with fine-grained timestamps).

Kristian Rosenvold

unread,
Nov 21, 2013, 5:07:21 AM11/21/13
to mechanica...@googlegroups.com
2013/11/21 Travis Downs <travis...@gmail.com>
I've written some fast directory scanners in the past, and the bottlenecks can vary heavily by platform.  I found that the list() call generally spends most of its time the underlying native calls, stuff like opendir() and readdir() - and the cost of these calls seems to vary wildly - being much slower on OSX than Linux, for example.  Earlier Linux kernels seemed to have coarser grained locking, to the point that using more than two threads didn't seem fruitful (this is with a hot inode/block cache), but the situation has improved on recent kernels.

From my testing, it seems like 10-12 threads seem to be fairly optimal on my modern linuxes, regardless of underlying disk type (!). A HDD might have the optimal performance around 10 threads, a SSD around 12. As for MacOS, does anyone know what is actually going on there ? Is it MacOS itself or java ?


One approach that paid big dividends was to cache (to disk) the results (all files and directory paths) of a particular scan, along with directory modification timestamps. Then on subsequent scans you need 
 
Cool strategy that will probably pay off in some scenarios. Part of the reason for doing my rewrite was also to provide efficient abstractions that avoid repeated IO in layered scenarios, since a lot of the code I see has a tendency to pass these files between layers, where "state" tends to get recreated on different layers thereby creating inefficiencies. This basically means I need to know details about individual files too, not just their existence. 

But given that the contents of a single directory generates a given number of fork join tasks for each subdirectory, I suppose I could actually avoid list on a directory with identical attributes, and actually bypass the entire call to listFiles... Do you know if this is cross-platform reliable ?

Kristian

Peter Lawrey

unread,
Nov 21, 2013, 7:05:36 AM11/21/13
to mechanica...@googlegroups.com

My guess is you are scanning directories in disk cache so the type of drive doesn't matter.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Travis Downs

unread,
Nov 21, 2013, 8:28:21 PM11/21/13
to mechanica...@googlegroups.com
Yeah my next question was whether you are testing cache hot or cache cold behavior.  Having the inodes & directory blocks in the cache makes a couple of orders of magnitude difference in the cases I tested (I tested the cache-cold case using /proc/sys/vm/drop_caches and checked that the results were consistent with a just-after-reboot cold cache). You might find also that the optimal number of threads varies widely between the two scenarios, and in the cache cold scenario depends further on the hardware.

I'm not sure if the directory timestamp/attribute updating is cross platform reliable (e.g., defined by POSIX), but my testing showed that timestamps are updated whenever a file is moved (renamed), deleted, created, etc within a directory.  However, the directory timestamp is not updated when a contained file changes permissions, which is a limitation of the system - if a user is given permissions to see a file that he previously couldn't, the technique of caching the directory contents will miss this file on subsequent scans because it won't rescan the directory.  IIRC though Java actually returns these "inaccessible" files from list() - because your permissions on the containing directory are what matter for seeing the existence of the file, so you could in theory cache these as well and rescan them to see if their permissions have changed.

Kristian Rosenvold

unread,
Nov 26, 2013, 5:43:56 AM11/26/13
to mechanica...@googlegroups.com
Running with echo 3 > /proc/sys/vm/drop_caches between each test certainly worsens the performance, and maximum
throughput seems to be reachable with far fewer threads, maybe just 3-4. It would still appear that (at least on linux) performance does not degrade significantly if I use "10" threads for scanning (which seems to provide optimal throughput on my current ssd's). I m not sure to what extent this observation applies cross-platform (I can clearly envision Windows XP trashing with 10 threads...)

The use case (build/tooling) normally implies a certain amount of caching, so while fully cached is unrealistic, uncached is equally so (i'd say the likelihood of fully uncached is minimal, wheres the likelihood of fully cached is decent). So before I start testing on Windows, it looks like "10" threads is a decent all round value that doubles uncached HDD performance, has uncached SSD at around 30% faster and cached SSD anywhere up to 10x.

@Travis; I tried caching the whole structure and just comparing timstamps on all the directories to verify unchanged directories. As long as I was able to hold the cache in memory, this appears to give a significant performance boost. But the moment I write (and read) the file to/from disk it appears to loose much of its benefit. Arguably I have not been very efficient in terms of optimizing the read/write of the file itself. I was kind-of expecting a much clearer gain. (I forked off the actual /write/ of the file in a separate thread that I kept out of the benchmarking - for obvious reasons the reading of the file must be included in the benchmark...). Does this match your observations or do you think I've done something inefficient/wrong ?

Kristian



2013/11/22 Travis Downs <travis...@gmail.com>
Yeah my next question was whether you are testing cache hot or cache cold behavior.  Having the inodes & directory blocks in the cache makes a couple of orders of magnitude difference in the cases I tested (I tested the cache-cold case using /proc/sys/vm/drop_caches and checked that the results were consistent with a just-after-reboot cold cache). You might find also that the optimal number of threads varies widely between the two scenarios, and in the cache cold scenario depends further on the hardware.

I'm not sure if the directory timestamp/attribute updating is cross platform reliable (e.g., defined by POSIX), but my testing showed that timestamps are updated whenever a file is moved (renamed), deleted, created, etc within a directory.  However, the directory timestamp is not updated when a contained file changes permissions, which is a limitation of the system - if a user is given permissions to see a file that he previously couldn't, the technique of caching the directory contents will miss this file on subsequent scans because it won't rescan the directory.  IIRC though Java actually returns these "inaccessible" files from list() - because your permissions on the containing directory are what matter for seeing the existence of the file, so you could in theory cache these as well and rescan them to see if their permissions have changed.

--
Reply all
Reply to author
Forward
0 new messages