MultiThreaded Checkstyle

128 views
Skip to first unread message

R Veach

unread,
Oct 31, 2016, 1:03:28 PM10/31/16
to checkstyle-devel
https://github.com/rnveach/checkstyle/commits/issue_138
https://github.com/checkstyle/checkstyle/compare/master...rnveach:issue_138

I have started looking at multi-threaded Checkstyle, since it keeps coming up in talks with new features. This thread is my work I have done on it.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

-------- Designing

Checkstyle's multi-threaded design should be pretty simple. Checkstyle currently takes in a list of files to work on, and then processes one file at a time against certain rules. So the design for multi-threaded should be that each file is its own thread, and there should be a core/master thread that maintains the work the other threads do.
To achieve this, we first have to split out where each part would go. Obviously FileSets and TreeWalker would definitely be in the multi-threaded area, since they are individual works on a single file. Since Checker is already maintaining multiple files and sends the work to each individual FileSet, Checker will remain outside the multi-threaded area and be the master of the threads and files. Since Listeners are so close related to Checker and since they do different work than FileSets, they will remain with Checker.

This design allows us to stay true to the current Checkstyle implementation and achieve multi-threaded status.

-------- Implementing

Java has 2 implementations of multi-threaded designs that we can work with implementing in Checkstyle.
ExecutorService with Callable or with ForkJoinPool. You can google them to learn more about each one.

-------- My Implementation

For Checkstyle, I really liked ForkJoinPool which takes big tasks, splits them into the same smaller tasks, and then joins them back up. I could see this working with Checkstyle because FileSets, TreeWalkers, and Checks are all doing the same thing. They take in a file, process it, and spit out violations.
I however did not go this route because I was afraid of copying and maintaining new instances of these modules everytime we split tasks. Also, checks usually are the fastest to finish, so I'm not sure giving them their own threads will improve the speed much or just be a slowdown. For my work, I decided to go with Callable.

https://github.com/rnveach/checkstyle/commits/issue_138
https://github.com/checkstyle/checkstyle/compare/master...rnveach:issue_138

Following everything I talked about above, I changed Checkstyle's code to achieve this form of multi-threaded status in the branch at the top of this post. All of my changes are trying to be as isolated as possible. I wanted to clear show changes from non-MT to MT, and still keep Checkstyle's main code intact with as little changes as possible.

The main classes to look at are MultiThreadedChecker and MultiThreadedTask, and AbstractFileSetCheck `clone` method.

MultiThreadedChecker is a class that extends Checker and only replaces minor parts, for easy integration.
CheckstyleFileResults is used to pass data between the threads in one easy class.
MultiThreadedTask's only duty is to clone the configuration to all the threads and maintain communication with the other threads. MultiThreadTask's main purpose is just to call Checker.processFile in each thread.

The main thing to note is the use of deep cloning. Each thread is working with its own modules, but all these modules still need to have the same configuration as all the others. The main purpose of cloning is so we can make checks easier to code by maintaining state (instance fields) and so we don't collide with other threads if they are using the same modules. To keep inline with making as little changes as possible to the checks, I used a utility to achieve the cloning.

This multi-threaded design shows that we are just injecting our simple code right into the middle of checker. Since we are injecting right in the middle of the normal Checker process, we have a higher guarantee that we won't break the logic of the checks, especially since the unit tests still call the normal, non-MT paths. When developing this code, the only real issues that kept cropping up were related to failing to clone the class properly, which is why I use the cloning utility.

I believe the current state of the branch is a fully working version minus test and code style fixes. Running on my work machine of Checkstyle's source dropped runtime from 2 minutes to around 40 seconds.

-------- Testing

MultiThreadedChecker allows us to specify the number of threads. Just testing the cloning and thread execution can easily be tested with just setting the value to 1. It will run exactly the same as using the normal Checker but still in the separate thread of MT mode. Our current unit tests verify that the logic works single run, and my implementation doesn't require big changes to them.
For truly testing multiple threads, I set the value to 4 and run the utility on CS' code multiple times. Usually if it failed while developing, it failed consistently and in the same area. I will look online more into better testing methods, but I think this might be the only way.

We should add both these tests to travis. A multi-threaded run of 1 and 2+ threads.

-------- Unresolved Issues with My Implementation

1)
FileSets/Checks that require every file for proper validation. Obviously we are talking about TranslationCheck.

My current idea is to make a special override for these checks so that they are the same instance in every thread. This way we can still split the files between all the threads, but also keep the files together for these special checks. These type of checks will be the only ones that need to be completely thread safe. They will either need to use a synchronize or ThreadLocal to maintain any type of state.

I have not started coding for this yet, but it seems like an easy fix.

-------- Issues My Changes Touched

https://github.com/checkstyle/checkstyle/issues/138

This is a given.

https://github.com/checkstyle/checkstyle/issues/3515

This is required to be able to switch from Checker to MultiThreadedChecker.

https://github.com/checkstyle/checkstyle/issues/2992

FileContentsHolder holds the contents of the file for filters like SuppressWithNearbyCommentFilter. FileContentsHolder is more like a hack to send extra data from the Checks to the Listeners because Checkstyle currently doesn't support that kind of behavior.
In my current multi-threaded implementation, we are doing multiple files at once and they are completely isolated from the filters so this class can't remain as it is. My change deprecated this class and passed the information around using FileContents, thereby removing the use of FileText.

https://github.com/checkstyle/checkstyle/issues/3034

The previous issue made little use of FileText as we require FileContents to be able to pass around the extra information the filters need.

R Veach

unread,
Nov 1, 2016, 10:40:42 PM11/1/16
to checkstyle-devel
-------- Additional Unresolved Issues

SuppressWarningsFilter/SuppressWarningsHolder will have to be rewritten too.
We should probably come up with a more common way to send data between filters and filesets/checks.

-------- Additional Issue Touched

https://github.com/checkstyle/checkstyle/issues/820

Because of the passing from thread to master, I don't think this should be a child. Like my design, it should replace FileText.

Oliver Burn

unread,
Nov 2, 2016, 6:22:25 AM11/2/16
to checkstyle-devel
Hi,

This looks like an exciting advancement for Checkstyle that is long overdue. It's something I tinkered with a long time ago and abandoned.

Here are some random thoughts - feel free to disregard.

I like the idea of Checks stating explicitly what type they are and then framework takes care of executing them as appropriate.

The types I can think of are:
  1. Completely stateless - these checks have no state. A single instance can be created and called by any thread for any file. A good example is WhitespaceAfterCheck.
  2. File stateful - these checks have state than needs to be maintained for each file. An instance needs to be created for each thread that is processing files. A good example is MethodCountCheck.
  3. Global stateful - these checks have state that needs to be maintained across all the files being processed. A single instance needs to be created that processes all the files. A good example is TranslationCheck and a subtle one is JavadocPackageCheck.
I would expect that most checks would fall into being type (1) or (2). 

A new method could be added to Check to describe what type of check it is. By default it could return type (3) to guarantee compatibility with current behaviour.

Regards,
Oliver

R Veach

unread,
Nov 2, 2016, 11:46:50 AM11/2/16
to checkstyle-devel
The 3 types of checks are true, however, with the implementation I have going right now using `deep cloning`, point 2 (file stateful) and point 1 (stateless) don't need to be identified separately.
The `deep clone` doesn't care if the class has instances or not, everything is deep cloned, giving all checks brand new instances. This gives us backward compatibility, to allow this checks as they are without needing any changes. It also makes it easier to code these checks for multi-threaded as you don't need to worry if your check meets the requirements for point 1 or 2. This also alleviates the worry about some of these type of threading issues that can occur with each if they are not managed correctly (Ex: multiple threads writing/reading to the same field instance).
When I went forward with my implementation, my goal was to not have to have the check declare which type, 1 or 2, it is.

Type 3 (global stateful) is now supported in this implementation and the fileset/check must be marked with the interface SingleInstance. The implementation does a shallow clone of the check, and not a deep clone.
However, Type 3 checks may not be supported by Checkstyle anymore unless an agreement can be made on how to handle them. See: https://github.com/checkstyle/checkstyle/issues/2221#issuecomment-257188257

I'll add JavadocPackageCheck to my list of single instances as I missed that too. It is not strictly type 3 material where violations are reported in finishProcessing, but it needs to be single instance to avoid examining/reporting the same directory multiple times.


Reply all
Reply to author
Forward
0 new messages