| Thanks a lot for your detailed answered. It is very kind of you that you are so open to hear from the open and respond in detail. Thanks, I don’t take it for granted. 1) I totally agree with you that once the file is not changed that all (exact) matching warning can be treated as outstanding, and all the rest as new. If you ask how come there will be new issues for files that was not changed, then I tend to think that it can happen for 2 (rare I must say) reasons.
- Tool changes (such as version upgrade or different configuration as removing suppressions)
- Change in other logically linked files that created a new problem and that the report will point to an old code as part of the problem. In c++ for example (tool: cppcheck) one can see that part of the messages are not “contained in one line”. Meaning that there are problems that are multiline e.g. allocate memory with no free, asking on null pointer after using it.
However since those are quite rare and also make sense from user prospective that will pop us as new for old code, it is quite simple to handle. 2) Now for the interesting part what happens when the file was changed. From the 3 possibilities you mentioned I tend to think that the current behavior is the only one that is valid. From my point of view plagiarism is a totally different issue/problem that is not relevant here, therefore both AST/MOSS will not help us here (see below why I think so). So assuming this is the only valid option, what can be improved in the current behavior to overcome the bug? I tend to think that the only valid change is how to compare an issue and how to fingerprint a message. Therefore my offer is to:
- Calculate line shift through git diff. and/or use git blame to ignore line shift.
- Compare only the exact line (+- shift) of the source code (not range of lines) & warning message id & text.
- If needed compare fingerprint and when match found inform user that this is a fingerprint match and not except match.
Let me explain why I think so. But before that, I want to put it in prospective by adding a disclaimer and some background. First, I am not an expert on any of those issues, so all I can offer is my (somewhat naïve) thoughts on this issue. I googled to understand what is AST/ MOSS and briefed the code and I am not Jenkins expert. Second, my view is derived from my current mission. I am responsible to introduce static code analysis tools (SCA) in the development process for a very big legacy code base. I tend to think that for this reason I tend to prefer one type of solution on the other as I will explain. Ok, so let’s take it one by one.
- Why the current behavior is the only one that is valid?
Well, AFAIU AST/MOSS are plagiarism detection tools that look for copies (with small modifications) for the original code . But for me another copy of a piece of code which was introduced between commits is a new code. Therefore even if I had 2 warning on the first copies and now I have 2 more the 2 new messages should be tagged as new (while the original as outstanding). The is perfectly valid from developer point of view because someone added a copy in another place so this is new message. From developer point of view, if he (or others) added some lines before the line that triggered the warning message line or added some white space in the line then “he did not changed the line”, but if he/others touched the line (in the extreme even if it was white space) then it is new. Look at it this way, let’s define the exact line that created the warning message as the “message line”. So we have three types of file changes:
- Before the line
- In the line
- After the line
Clearly the match algorithm don’t care for (c), the match algorithm & the developer care about (b), and regarding (a) the developer don’t care but the match algorithm should since it needs to understand that it will change the line mentioned in the warning message. So we can use either mechanism that brings better results.
- Sum all git changes up to the current line as line shift and compare reference line + shift to current line
OR
- Compare only git blame SHA1 for reference line to current line (ignore the actual line location and use line content for comparison)
- Why mark real compare different then fingerprint compare ?
As I said I am trying to introduce SCA for large legacy code base. So, for this purpose I am looking for
- Having absolute accurate delta report for “new” issues.
For my needs it is a must since our current code base creates many warnings and I must be able to focus the energy of the organization only on real new issues, otherwise the effort to introduce SCA is doomed to fail .
- Having (almost) accurate delta report for “fixed” issues.
For next phase of the deployment, I would like to data mine all the fixes of the source code that were done during the development of old versions and try to correlate them with SCA messages. Finding old fixes that are correlated with old SCA messages means that those problems were detected by QA or customers and therefore can help me to prove that it is cost effective to fix even legacy messages at least for message categories that will show correlation with old fixes. Therefore I tend to think that letting the user know about what is absolute match and what seems to be a match is very crucial to keep user’s trust in the report categories {new, fixed, outstanding}. |