Multi-thread mode for Java files processing

41 views
Skip to first unread message

Julian Gamble

unread,
Dec 17, 2016, 9:57:42 PM12/17/16
to checkstyle
Hi Everyone, 

I've been using checkstyle for about a decade on work projects, and made a contribution back when the source was on sourceforge. 

I can see that twice a project called "Multi-thread mode for Java files processing" has been put on the Google Summer of Code project page for checkstyle:

This is a profoundly useful goal, and I imagine the community here has put quite a bit of thought into this. 

My question is - is this thought and aspiration captured in a design anywhere? (I'm hoping for something a little more than the two paragraphs on that page). 

Is there a list of rules that already comply with the multi-thread requirements?

Is there a list of changes that would need to be made to rules that don't comply with the multi-thread requirements?

Is the underlying AST threadsafe?

Are considerations like making the whole operation threadsafe so that sonar can run checkstyle in parallel with a findbugs operation, in scope?

Cheers
Julian

R Veach

unread,
Dec 18, 2016, 12:08:48 AM12/18/16
to checkstyle
Nothing has been written into stone on how we will achieve multi-threaded status.
Different members have different ideas, written out in different corners.

One of the more concrete examples and implementations is at https://groups.google.com/forum/#!topic/checkstyle-devel/ay2CxvaKeQo but it will require many API changes, some of which are already slated to start in Checkstyle 8.
That implementation has not been reviewed though.

The AST is thread-safe as long as ANTLR is. http://stackoverflow.com/questions/19325803/are-antlr-parsers-for-java-thread-safe
I haven't run into any kind of issue with this in the implementation I mentioned above, so I assume we use it in a thread safe way.

Roman Ivanov

unread,
Dec 21, 2016, 9:30:40 AM12/21/16
to R Veach, Julian Gamble, checkstyle
Hi Julian,

>is this thought and aspiration captured in a design anywhere? 

unfortunately GSoC briefs and recent thoughts+actions of Richard is all that we have.
My ideas are still in my mind, I just need some time to organize all of them to share with community.
The biggest problem there is me who blocks all innovations till we make Checksyle stable in validations. Even team did and do a lot of efforts to resolve problems, amount of issues is growing.

>Is there a list of rules that already comply with the multi-thread requirements?
Is there a list of changes that would need to be made to rules that don't comply with the multi-thread requirements?

Rules are depend on design. But for now we use simple rule is we can - make Check "state less" (Check should have only properties that user define in configuration). Such Checks are first candidates to run in multi-thread(MT) mode.

Is the underlying AST threadsafe?

Tree is not changeable during validation (is not enforced by API but all Checks follow that rule) - so it could be thread safe if we resolve MT nuance for internal cache of each node - https://github.com/checkstyle/checkstyle/blob/master/src/main/java/com/puppycrawl/tools/checkstyle/api/DetailAST.java#L59.


We need API changes to make DetailAst more stateless and robust for MT mode

>Are considerations like making the whole operation threadsafe so that sonar can run checkstyle in parallel with a findbugs operation, in scope?

sonar is not a only reason, we first of all care about improvements of our CLI and than in maven, eclipse-cs, sonar and all other plugins and usages.

Sonar could run checks in parallel with any other tools, even right now. But Checksyle can not be now MT inside itself.
By the way we own sonar plugin code now - https://github.com/checkstyle/sonar-checkstyle . So you are welcome to improve it, there bunch of problems.

But the main problem for us is provide users a reason to use Checkstyle along with others, so provide unique and minimal false-positives functionality. As we do this, user will wait extra minutes during validation. MT mode will give user only some speed. Most users demand correct validation without false-positives. As resolving false positives takes more time.

One more problem is that validation is not biggest performance problems. The biggest problem is parsing. ANTRL2 (that we use now) is slow in comparison to others (javacc). Current ANTLR is version 4.

Richard, can you share here some statistics about how much checkstyle spent in parsing and validation ? I remember  you did smth like this.

-------------------------------------

So MT mode demand from us a lot of time to consideration and design. Mean while we have problems:
- Unclear status of multi-file mode of validation and cache usage.
- new ANTLR parser for javadocs is not ready for massive usage
- Most Javadocs checks are broken by design, need to be re-implemented
- current API is weird and need to changed (we already postponed this change for long time)
- old parser library is used, that is not supported any more (some limitations of it could result in inability to support some syntax of new java, so it will mean DEATH of checkstyle). Upgrade to new version of ANTLR is huge work.
- Team of maintainers is tiny and use their spare time , so no ability to focus on huge task for long time.
- No convenient way to do suppression of violations (Xpath, annotations).
- multiple configurations support , inheritance or composition of configurations
- get out of moratorium period for new Checks that I enforced few years ago (almost done)
..... some others....

So for now MT mode is a dream with small benefit. There more other issues that users would love to have. That is why I nominate it to GSoC where smb will have few month to think about it with full focus. 
I can not afford full focus on one problem in checkstyle already for few years, I doubt it will change tomorrow. 

There is a way out of this - growing the team.
Please come back to project and we will make it :) eventually :).

thanks,
Roman Ivanov

--
You received this message because you are subscribed to the Google Groups "checkstyle" group.
To unsubscribe from this group and stop receiving emails from it, send an email to checkstyle+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

R Veach

unread,
Dec 22, 2016, 8:01:23 AM12/22/16
to checkstyle, rvea...@gmail.com, julian...@gmail.com
> Richard, can you share here some statistics about how much checkstyle spent in parsing and validation ? I remember  you did smth like this.

https://github.com/rnveach/checkstyle/commits/more_audits
https://github.com/checkstyle/checkstyle/compare/master...rnveach:more_audits

That is part of my `more_audits` branch.
It's run will vary depending on the specs of the machine used.
I attached the raw output of the run for my work computer which is Windows 7 32 bit, 4 core processor.

The columns in order are Number of Runs, Total Time, Min Time, Max Time, Average Time (Total / Number)
It prints stats on per file, per filset, per check, and per parser.

The entire run took 69.024 seconds on this branch which has checkstyle's latest source. The printed violations can be ignored.

First is information on each file.
68.4 seconds total is spent inside processing each file, which is almost all our time.
On average a file, from start to end, takes 0.09 seconds. TokenTypes.java takes the longest at 6.8 seconds.After the top 2, the next group takes 1.6 seconds as the longest and continues to go down from there.

Next is information on each FileSets.
47.072 seconds total is spent inside processing the filesets.
The longest FileSet is TreeWalker which takes 45.492 seconds, which is where our parsing is done. The longest TreeWalker takes is 6.8 seconds and averages 0.06 seconds.
All other FileSets hit at most 0.5 seconds.

Next is information on each Check. Just to note this after the parser is done.
9 seconds total is spent inside processing the checks.
The longest check is VariableDeclarationUsageDistanceCheck which is 2.04 seconds. Next longest is UnusedImportsCheck at 0.798 seconds.
On average all checks take 0 seconds for each call.

The last information is on each parser. We just have 2, Java and JavaDoc.
3.07 seconds total is spent inside the Java parser.
33.649 seconds total is spent inside the JavaDoc parser.
JavaDoc received 6.6x more calls than the Java parser which explains why its time is so much greater. If the number of calls are averaged out, it is a comparison of 3.07 seconds to 5.034 seconds.
On average, each parser takes less than 0.007 seconds. The longest Java time is 0.365 seconds. The longest JavaDoc time is 1.44 seconds.



To unsubscribe from this group and stop receiving emails from it, send an email to checkstyle+...@googlegroups.com.

R Veach

unread,
Dec 22, 2016, 8:02:24 AM12/22/16
to checkstyle, rvea...@gmail.com, julian...@gmail.com
Forgot to attach the data.

output.txt
Reply all
Reply to author
Forward
0 new messages