"clang" static analysis

Georgi Guninski

unread,

May 25, 2009, 3:24:36 AM5/25/09

to dev-stati...@lists.mozilla.org

hi,

has someone tested clang for static analysis?
it doesn't seem to support c++ yet (just c and objective c).

example report of real application (adium im):
http://trac.adium.im/wiki/StaticAnalysis

David Mandelin

unread,

May 26, 2009, 2:22:43 PM5/26/09

to Georgi Guninski, dev-stati...@lists.mozilla.org

I haven't tried it. From what I understand clang is a good parser for C.
C++ support is planned but doesn't exist yet. Given the difficulty of
parsing C++, I imagine it will take at least a few years to make a
really solid C++ parser.

I'm curious about the analyses that the clang bug-checker runs. The main
doc page doesn't say exactly what it does. From the link above, I see
that it tries to find null dereferences in C programs, which is a really
hard problem, especially for large programs.

Dave

Ted Kremenek

unread,

May 26, 2009, 2:35:00 PM5/26/09

to David Mandelin, Georgi Guninski, dev-stati...@lists.mozilla.org

C++ support in Clang is rapidly progressing, but because the Clang
static analyzer performs static analysis at the source level simply
having C++ parsing support does not imply immediate support in the
analyzer. Bringing that feature up will likely require active
participation from the open source community.

The clang analyzer currently does mostly local analysis, essentially
operating under the conservative approximation that the implementation
of the callee of functions/methods is unavailable for analysis. That
plan is to add more global analysis over time, hopefully over the next
year (time permitting).

The most polished checks in the clang analyzer have focused on
Objective-C related bugs (memory leaks and so forth), since that was
identified (given that the main analyzer development is funded by
Apple) as being the area having the most immediate impact given the
lack of an analyzer anywhere for Objective-C. While doing basic path-
sensitive analysis, however, the analyzer also flags cases of null
dereferences, uses of uninitialized values, returning the address of
stack memory, that all naturally fall out from the reasoning about
individual paths.

David Mandelin

unread,

May 26, 2009, 2:43:30 PM5/26/09

to Ted Kremenek, Georgi Guninski, dev-stati...@lists.mozilla.org

Ted Kremenek wrote:
> C++ support in Clang is rapidly progressing,

Cool. Is there a page with notes on the design? I'm curious what
approach you are using. Elsa's GLR design seems like a good approach,
but it doesn't cover all the latest complicated template features, and
the latest problems I saw seemed difficult to solve in that design. (It
was long enough ago that I don't remember the exact problem.)

> but because the Clang static analyzer performs static analysis at the
> source level simply having C++ parsing support does not imply
> immediate support in the analyzer. Bringing that feature up will
> likely require active participation from the open source community.

What about running over a language-independent IR instead? That's the
approach we've used in Treehydra and it seems like it would be even
better because I think you have much cleaner IRs in LLVM.

> The clang analyzer currently does mostly local analysis, essentially
> operating under the conservative approximation that the implementation
> of the callee of functions/methods is unavailable for analysis. That
> plan is to add more global analysis over time, hopefully over the next
> year (time permitting).

We generally do unsound analysis instead (assuming the callees do
nothing, or do a little bit we can guess at, like writing to
reference-typed arguments) to cut down on false positives. Maybe the
best possible tool has a dial to tune the level of conservatism. I have
no idea what the best default for general-purpose checking is, though.

Dave

Chris Lattner

unread,

May 26, 2009, 3:09:07 PM5/26/09

to David Mandelin, Georgi Guninski, Doug Gregor, Ted Kremenek, dev-stati...@lists.mozilla.org

On May 26, 2009, at 11:43 AM, David Mandelin wrote:

> Ted Kremenek wrote:
>> C++ support in Clang is rapidly progressing,
> Cool. Is there a page with notes on the design? I'm curious what
> approach you are using. Elsa's GLR design seems like a good
> approach, but it doesn't cover all the latest complicated template
> features, and the latest problems I saw seemed difficult to solve in
> that design. (It was long enough ago that I don't remember the exact
> problem.)

We use a straight-forward recursive descent parser (like GCC and
EDG). Is there a specific design point you're interested?

Clang C++ status is tracked here:
http://clang.llvm.org/cxx_status.html

>
>> but because the Clang static analyzer performs static analysis at
>> the source level simply having C++ parsing support does not imply
>> immediate support in the analyzer. Bringing that feature up will
>> likely require active participation from the open source community.
> What about running over a language-independent IR instead? That's
> the approach we've used in Treehydra and it seems like it would be
> even better because I think you have much cleaner IRs in LLVM.

There definitely is work on static and dynamic analysis built on llvm
(e.g. see http://klee.llvm.org/ ). The major difference between
building it on Clang and on LLVM is that Clang gives dramatically more
source level information, which makes the user interface possibilities
more rich. The advantage of building on LLVM is that an analysis can
work on much more than just C, including Fortran, Ada, Pure, and
various other languages that target llvm.

-Chris

David Mandelin

unread,

May 26, 2009, 3:15:33 PM5/26/09

to Chris Lattner, Georgi Guninski, Doug Gregor, Ted Kremenek, dev-stati...@lists.mozilla.org

Chris Lattner wrote:
> On May 26, 2009, at 11:43 AM, David Mandelin wrote:
>> Ted Kremenek wrote:
>>> C++ support in Clang is rapidly progressing,
>> Cool. Is there a page with notes on the design? I'm curious what
>> approach you are using. Elsa's GLR design seems like a good approach,
>> but it doesn't cover all the latest complicated template features,
>> and the latest problems I saw seemed difficult to solve in that
>> design. (It was long enough ago that I don't remember the exact
>> problem.)
> We use a straight-forward recursive descent parser (like GCC and
> EDG). Is there a specific design point you're interested?

That was the basic question. I've generally suspected that recursive
descent is better than a parser generator for real compilers, so now I
have yet another point in favor. :-)

>>> but because the Clang static analyzer performs static analysis at
>>> the source level simply having C++ parsing support does not imply
>>> immediate support in the analyzer. Bringing that feature up will
>>> likely require active participation from the open source community.
>> What about running over a language-independent IR instead? That's the
>> approach we've used in Treehydra and it seems like it would be even
>> better because I think you have much cleaner IRs in LLVM.
> There definitely is work on static and dynamic analysis built on llvm
> (e.g. see http://klee.llvm.org/ ). The major difference between
> building it on Clang and on LLVM is that Clang gives dramatically more
> source level information, which makes the user interface possibilities
> more rich. The advantage of building on LLVM is that an analysis can
> work on much more than just C, including Fortran, Ada, Pure, and
> various other languages that target llvm.

I know what you mean about the source loc info. It seems possible to
carry through most of the information, although GCC does not, apparently
because of desire to save memory and perform early optimizations. Not
having implemented a C++ AST->CFG translation myself, I could imagine
there are other hard problems too.

Dave

Ted Kremenek

unread,

May 26, 2009, 3:25:05 PM5/26/09

to David Mandelin, Georgi Guninski, dev-stati...@lists.mozilla.org

On May 26, 2009, at 11:43 AM, David Mandelin wrote:

> Ted Kremenek wrote:
>> C++ support in Clang is rapidly progressing,
> Cool. Is there a page with notes on the design? I'm curious what
> approach you are using. Elsa's GLR design seems like a good
> approach, but it doesn't cover all the latest complicated template
> features, and the latest problems I saw seemed difficult to solve in
> that design. (It was long enough ago that I don't remember the exact
> problem.)

Hi David,

Clang uses a recursive descent parser design, and this applies to the C
++ portion as well. We have found that the design works really well,
and leads to a fairly clean implementation that is fairly easy to
understand and extend.

As you pointed out, the Clang documentation on the parser is lacking
(and should be improved). The design itself is simple. The parser
handles the grammar of the language, and calls back into an abstract
interface (called 'Actions') that is responsible for building up the
ASTs, performing the type checking, etc. (the implementation of this
is called 'Sema'). The abstract interface allows the parsing logic to
be (relatively) simple, and allows one to swap in a different
implementation of the abstract interface if one didn't want to do full
type-checking, etc.

>> but because the Clang static analyzer performs static analysis at
>> the source level simply having C++ parsing support does not imply
>> immediate support in the analyzer. Bringing that feature up will
>> likely require active participation from the open source community.
> What about running over a language-independent IR instead? That's
> the approach we've used in Treehydra and it seems like it would be
> even better because I think you have much cleaner IRs in LLVM.

We definitely thought about this, and we made a deliberate choice to
analyze source code instead of LLVM IR.

Analyzing at the IR level certainly has it benefits, as complicated
language features are lowered to primitive operations, allowing one to
focus on analyzing those primitives. This is both a significant
blessing and a onerous curse, and it all comes down to the tradeoffs
employed.

Our choice of analyzing source code came down to several key factors
(in no particular order):

a) Understanding high-level interfaces.

Many complicated language features reduce to a large number of LLVM IR
instructions, but ultimately we are interested in the macroscopic
actions (e.g., the invocation of a method, which could span many LLVM
IR instructions). Many bugs have to do with reasoning about
interfaces rather than the specific low-level semantics (which are
also important, but can be approximated), and doing this at the source
level can be much easier.

Further, source often captures the programmer's intent in a far more
recognizable way than a lowered representation. Many bugs are
somewhere between the poles of syntax and semantics. For example,
sometimes a potential bug isn't really ever a bug if it occurs within
a macro, or that the code was written in a specific way indicating
that the user didn't care about the "bug". For example, consider a
dead store:

err = foo();

versus

if (err = foo())

Suppose 'err' is never read after the assignment. According the the
semantics of C, the variable 'err' isn't actually read in both cases,
but the first case is more likely to be a programming mistake than the
second (the second can also be an error if they meant to do 'err == foo
()', but that conceptually is a different kind of bug). Certainly
distinguishing between these cases can be done at the LLVM IR level,
but it is a little more tricky to do. There are also cases such as 'i+
+' and 'i = i + 1' that are potentially indistinguishable at the LLVM
IR level, but could be relevant when determining the chance that a
real bug occurred.

In other words, precisely analyzing semantics isn't always enough, and
understanding the intent of the programmer, which often boils down to
looking at syntax, is often very useful when determining whether or
not a real error is present.

b) Language types

Often high-level language types are essentially completely erased at
the LLVM IR level, being lowered to structs, etc. The high-level type
system is especially useful when one is analyzing a language with a
rich OO-type system such as Objective-C and C++. This is useful both
for reasoning about high-level interfaces (my previous point) and
thinking about virtual function calls, etc.

c) Great diagnostics

Clang's preprocessor and parser are integrated, meaning the ASTs have
full information regarding macros, pragmas, the #include stack, and so
on. Clang also has full source range information, with locations for
individual '{' tokens, etc. This allows the analyzer to report
excellent diagnostics with full column and line information, source
ranges, etc. Such rich location information also allows us to
potentially tie into code refactoring operations that could be used to
either fix bugs or to transform the code in some other useful way.
While is possible to tie much of the LLVM IR back to the original
source, this isn't always trivial as the lowering could be
architecture independent. Moreover, because some language-level
features (such as an Objective-C method invocation) lower to many
LLVM IR instructions, performing the back mapping in many cases can be
non-trivial and error prone.

d) Sometimes lying gets you closer to the truth

Precisely handling various operations such as sign-extension, bit
masking, etc., when reasoning about symbolic values can be
challenging. Instead of being perfect, I think it is easier to
approximate the truth when analyzing source code than when analyzing
LLVM IR (since operations can be broken up over many instructions).
At a high-level representation, it is often easier to understand what
is important and what is not when it comes to precisely analyzing a
fragment of code. Sometimes not handling certain details just doesn't
really matter, and in certain cases where clang's analyzer currently
doesn't handle something well we can often recover path-sensitivity by
making up new symbolic values, etc., when the result of an operation
is "too complicated" to reason about. I think this kind of cheating
is often easier to do at a high-level than when using a lowered
representation, but opinions may differ.

Of course analyzing source code can be hard. One has to reason about
arbitrary casts, short-circuit operations, etc., that all simplified
when lowered to the LLVM IR level. However, I argue that once the
core logic to handle such things is implemented, that hard work in
implementing the analyzer is elsewhere (e.g., reasoning about symbolic
values and abstracted program memory, etc.).

>> The clang analyzer currently does mostly local analysis,
>> essentially operating under the conservative approximation that the
>> implementation of the callee of functions/methods is unavailable
>> for analysis. That plan is to add more global analysis over time,
>> hopefully over the next year (time permitting).
> We generally do unsound analysis instead (assuming the callees do
> nothing, or do a little bit we can guess at, like writing to
> reference-typed arguments) to cut down on false positives. Maybe the
> best possible tool has a dial to tune the level of conservatism. I
> have no idea what the best default for general-purpose checking is,
> though.

Ah. By conservative I meant a combination of unsound and sound
approximations designed to reduce the number of false positives and
have a high signal-to-noise ratio from the analyzer. In other words I
prefer to trade off false negatives for false positives in order to
extract the most useful results.