https://github.com/checkstyle/checkstyle/compare/master...rnveach:issue_138I have started looking at multi-threaded Checkstyle, since it keeps coming up in talks with new features. This thread is my work I have done on it.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------- Designing
Checkstyle's multi-threaded design should be pretty simple. Checkstyle currently takes in a list of files to work on, and then processes one file at a time against certain rules. So the design for multi-threaded should be that each file is its own thread, and there should be a core/master thread that maintains the work the other threads do.
To achieve this, we first have to split out where each part would go. Obviously
FileSets and
TreeWalker would definitely be in the multi-threaded area, since they are individual works on a single file. Since
Checker is already maintaining multiple files and sends the work to each individual
FileSet,
Checker will remain outside the multi-threaded area and be the master of the threads and files. Since
Listeners are so close related to
Checker and since they do different work than
FileSets, they will remain with
Checker.
This design allows us to stay true to the current Checkstyle implementation and achieve multi-threaded status.
-------- Implementing
Java has 2 implementations of multi-threaded designs that we can work with implementing in Checkstyle.
ExecutorService with
Callable or with
ForkJoinPool. You can google them to learn more about each one.
-------- My Implementation
For Checkstyle, I really liked
ForkJoinPool which takes big
tasks, splits them into the same smaller tasks, and then joins them back
up. I could see this working with Checkstyle because FileSets,
TreeWalkers, and Checks are all doing the same thing. They take in a
file, process it, and spit out violations.
I however did not go this
route because I was afraid of copying and maintaining new instances of
these modules everytime we split tasks. Also, checks usually are the
fastest to finish, so I'm not sure giving them their own threads will
improve the speed much or just be a slowdown. For my work, I decided to go with
Callable.
https://github.com/rnveach/checkstyle/commits/issue_138https://github.com/checkstyle/checkstyle/compare/master...rnveach:issue_138Following everything I talked about above, I changed Checkstyle's code to achieve this form of multi-threaded status in the branch at the top of this post. All of my changes are trying to be as isolated as possible. I wanted to clear show changes from non-MT to MT, and still keep Checkstyle's main code intact with as little changes as possible.
The main classes to look at are
MultiThreadedChecker and
MultiThreadedTask, and
AbstractFileSetCheck `clone` method.
MultiThreadedChecker is a class that extends
Checker and only replaces minor parts, for easy integration.
CheckstyleFileResults is used to pass data between the threads in one easy class.MultiThreadedTask's only duty is to clone the configuration to all the threads and maintain communication with the other threads.
MultiThreadTask's main purpose is just to call
Checker.processFile in each thread.
The main thing to note is the use of deep cloning. Each thread is working with its own modules, but all these modules still need to have the same configuration as all the others. The main purpose of cloning is so we can make checks easier to code by maintaining state (instance fields) and so we don't collide with other threads if they are using the same modules. To keep inline with making as little changes as possible to the checks, I used a utility to achieve the cloning.
This multi-threaded design shows that we are just injecting our simple
code right into the middle of checker. Since we are injecting right in
the middle of the normal Checker process, we have a higher guarantee
that we won't break the logic of the checks, especially since the unit
tests still call the normal, non-MT paths. When developing this code, the only real issues that kept cropping up were related to failing to clone the class properly, which is why I use the cloning utility.
I believe the current state of the branch is a fully working version minus test and code style fixes. Running on my work machine of Checkstyle's source dropped runtime from 2 minutes to around 40 seconds.
-------- Testing
MultiThreadedChecker allows us to specify the number of threads. Just testing the cloning and thread execution can easily be tested with just setting the value to 1. It will run exactly the same as using the normal
Checker but still in the separate thread of MT mode. Our current unit tests verify that the logic works single run, and my implementation doesn't require big changes to them.
For truly testing multiple threads, I set the value to 4 and run the utility on CS' code multiple times. Usually if it failed while developing, it failed consistently and in the same area. I will look online more into better testing methods, but I think this might be the only way.
We should add both these tests to travis. A multi-threaded run of 1 and 2+ threads.
-------- Unresolved Issues with My Implementation
1)
FileSets/
Checks that require every file for proper validation. Obviously we are talking about
TranslationCheck.
My current idea is to make a special override for these checks so that they are the same instance in every thread. This way we can still split the files between all the threads, but also keep the files together for these special checks. These type of checks will be the only ones that need to be completely thread safe. They will either need to use a synchronize or ThreadLocal to maintain any type of state.
I have not started coding for this yet, but it seems like an easy fix.
-------- Issues My Changes Touched
https://github.com/checkstyle/checkstyle/issues/138This is a given.
https://github.com/checkstyle/checkstyle/issues/3515This is required to be able to switch from
Checker to
MultiThreadedChecker.
https://github.com/checkstyle/checkstyle/issues/2992FileContentsHolder holds the contents of the file for filters like
SuppressWithNearbyCommentFilter.
FileContentsHolder is more like a hack to send extra data from the
Checks to the
Listeners because Checkstyle currently doesn't support that kind of behavior.
In my current multi-threaded implementation, we are doing multiple files at once and they are completely isolated from the filters so this class can't remain as it is. My change deprecated this class and passed the information around using
FileContents, thereby removing the use of FileText.
https://github.com/checkstyle/checkstyle/issues/3034The previous issue made little use of
FileText as we require
FileContents to be able to pass around the extra information the filters need.