I've written some fast directory scanners in the past, and the bottlenecks can vary heavily by platform. I found that the list() call generally spends most of its time the underlying native calls, stuff like opendir() and readdir() - and the cost of these calls seems to vary wildly - being much slower on OSX than Linux, for example. Earlier Linux kernels seemed to have coarser grained locking, to the point that using more than two threads didn't seem fruitful (this is with a hot inode/block cache), but the situation has improved on recent kernels.
One approach that paid big dividends was to cache (to disk) the results (all files and directory paths) of a particular scan, along with directory modification timestamps. Then on subsequent scans you need
My guess is you are scanning directories in disk cache so the type of drive doesn't matter.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Yeah my next question was whether you are testing cache hot or cache cold behavior. Having the inodes & directory blocks in the cache makes a couple of orders of magnitude difference in the cases I tested (I tested the cache-cold case using /proc/sys/vm/drop_caches and checked that the results were consistent with a just-after-reboot cold cache). You might find also that the optimal number of threads varies widely between the two scenarios, and in the cache cold scenario depends further on the hardware.
I'm not sure if the directory timestamp/attribute updating is cross platform reliable (e.g., defined by POSIX), but my testing showed that timestamps are updated whenever a file is moved (renamed), deleted, created, etc within a directory. However, the directory timestamp is not updated when a contained file changes permissions, which is a limitation of the system - if a user is given permissions to see a file that he previously couldn't, the technique of caching the directory contents will miss this file on subsequent scans because it won't rescan the directory. IIRC though Java actually returns these "inaccessible" files from list() - because your permissions on the containing directory are what matter for seeing the existence of the file, so you could in theory cache these as well and rescan them to see if their permissions have changed.
--