Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Simple C++ Source Parser?

50 views
Skip to first unread message

Mike Copeland

unread,
Jul 19, 2020, 7:34:14 PM7/19/20
to

I am working on a C++ source file analyzer; I had one that worked for
C sources. That program was written many years ago, and I'm attempting
to update it for C++ code, as well as use C++ structures and features.
The code for parsing C++ code is tedious, and I'm looking for a
library or functional code that will (1) parse non-comment code elements
and (2) return token strings.
Is there something I can link to/use that will help me? I've done
some Google searching and have seen references to Clang, Elsa, Metre and
ANTLR - all of which seem much more than I need. I just want source
code tokens and to know which source code line they're from.
TIA

--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Ian Collins

unread,
Jul 19, 2020, 7:41:28 PM7/19/20
to
On 20/07/2020 11:34, Mike Copeland wrote:
>
> I am working on a C++ source file analyzer; I had one that worked for
> C sources. That program was written many years ago, and I'm attempting
> to update it for C++ code, as well as use C++ structures and features.
> The code for parsing C++ code is tedious, and I'm looking for a
> library or functional code that will (1) parse non-comment code elements
> and (2) return token strings.
> Is there something I can link to/use that will help me? I've done
> some Google searching and have seen references to Clang, Elsa, Metre and
> ANTLR - all of which seem much more than I need. I just want source
> code tokens and to know which source code line they're from.

https://clang.llvm.org/doxygen/classclang_1_1Parser.html

--
Ian.

Sam

unread,
Jul 19, 2020, 8:41:33 PM7/19/20
to
Mike Copeland writes:

>
> I am working on a C++ source file analyzer; I had one that worked for
> C sources. That program was written many years ago, and I'm attempting
> to update it for C++ code, as well as use C++ structures and features.
> The code for parsing C++ code is tedious, and I'm looking for a
> library or functional code that will (1) parse non-comment code elements
> and (2) return token strings.
> Is there something I can link to/use that will help me? I've done
> some Google searching and have seen references to Clang, Elsa, Metre and
> ANTLR - all of which seem much more than I need. I just want source
> code tokens and to know which source code line they're from.
> TIA

Maybe for some subset of C++ grammar one could come up with a "Simple C++
Source Parser".

But in order to parse the full C++ syntax, especially c++11 and higher,
that's a project of a lifetime, I'm afraid.

Öö Tiib

unread,
Jul 20, 2020, 2:35:24 AM7/20/20
to
On Monday, 20 July 2020 02:34:14 UTC+3, Mike Copeland wrote:
> I am working on a C++ source file analyzer; I had one that worked for
> C sources. That program was written many years ago, and I'm attempting
> to update it for C++ code, as well as use C++ structures and features.
> The code for parsing C++ code is tedious, and I'm looking for a
> library or functional code that will (1) parse non-comment code elements
> and (2) return token strings.
> Is there something I can link to/use that will help me? I've done
> some Google searching and have seen references to Clang, Elsa, Metre and
> ANTLR - all of which seem much more than I need. I just want source
> code tokens and to know which source code line they're from.

On general case it is impossible to make simple parser for logical
analyzing of C++ code since its grammar is large and full of
complications.

The simplest parsers are made for syntax highlighting or
automatic reformatting, but such are usually uninterested in
meaning of code so results are not suitable for substantive
analysis.
Example ... Artistic Style.

There are bit more aware parsers made for automatic documenting.
Those are more complex. You will get bit better tagged results
from such parsers. Additionally those parsers tend to pay
lot of attention to contents of comments but you can ignore that
aspect.
Example ... Doxygen.

Even those things are relatively far from trivial but since
we do not know your goals and you say Clang is more than
you need ... then perhaps try.

Scott Newman

unread,
Jul 20, 2020, 2:42:52 AM7/20/20
to
C++ is only a bit more complicated to parse than C.
Try to write a parser on your own.

David Brown

unread,
Jul 20, 2020, 4:49:15 AM7/20/20
to
On 20/07/2020 01:34, Mike Copeland wrote:
>
> I am working on a C++ source file analyzer; I had one that worked for
> C sources. That program was written many years ago, and I'm attempting
> to update it for C++ code, as well as use C++ structures and features.
> The code for parsing C++ code is tedious, and I'm looking for a
> library or functional code that will (1) parse non-comment code elements
> and (2) return token strings.
> Is there something I can link to/use that will help me? I've done
> some Google searching and have seen references to Clang, Elsa, Metre and
> ANTLR - all of which seem much more than I need. I just want source
> code tokens and to know which source code line they're from.
> TIA
>

I think it is unlikely that you'll get far without using a big project.
Parsing C++ has got more and more difficult - there are more new
syntaxes, context-dependent keywords, even a new operator in the latest
version. I would recommend you look again at existing parsers, and see
if you can learn to use them.

It might take you time to get the hang of clang as a parser, but that's
a job you do once - and then you can take advantage of all the work they
do and you don't have to update or re-write things for each new C++
version. clang is /designed/ to be usable as a library, and as a
parser, for syntax highlighting in IDE's, for making static analysers,
for JIT compilation, and other tools.

I don't know the other tools you mentioned, but I personally would
definitely concentrate on clang first. I'd start with the existing
clang analyser, and see where that could take me - that could be a very
good starting point for adding the new analysers that interest you.

(gcc might also be worth a look these days. There is an analyser
framework in the latest version, there is support for plugins that can
get access to parsed source information for checking, with existing
plugins for other kinds of static or style checking. There is even a
project underway for making a JIT compiler library of gcc. I don't
think gcc is as far down this path as clang, but maybe it is of use.)

Paavo Helde

unread,
Jul 20, 2020, 5:55:42 AM7/20/20
to
20.07.2020 02:34 Mike Copeland kirjutas:
>
> I am working on a C++ source file analyzer; I had one that worked for
> C sources. That program was written many years ago, and I'm attempting
> to update it for C++ code, as well as use C++ structures and features.
> The code for parsing C++ code is tedious, and I'm looking for a
> library or functional code that will (1) parse non-comment code elements
> and (2) return token strings.
> Is there something I can link to/use that will help me? I've done
> some Google searching and have seen references to Clang, Elsa, Metre and
> ANTLR - all of which seem much more than I need. I just want source
> code tokens and to know which source code line they're from.
> TIA

If you just want tokens without any knowledge what they mean, then this
should be pretty straightforward, the C++ preprocessor does exactly
that: removes comments and outputs token strings. It also helpfully adds
extra spaces between tokens which would otherwise appear glued together.
It also outputs file names and line numbers so keeping track of line
numbers should be easy. So if I was given this task, I would start with
getting my toolchain to output preprocessed source files instead of
object files.

In preprocessed source, extracting tokens is simple in general, except
for string literals and especially raw string literals which are a bit
more tricky.

Beware though that tokens without meaning do not give you much. If all
you know is that there is a token 'final' on line 5095, it does not even
tell you if this is a C++ keyword or some other name, not to speak about
in which scope, namespace or class it belongs to.

Also, if there are different preprocessor branches, only one of them
survives after preprocessing step.

Keith Thompson

unread,
Jul 20, 2020, 1:53:49 PM7/20/20
to
Paavo Helde <ees...@osa.pri.ee> writes:
[...]
> Beware though that tokens without meaning do not give you much. If all
> you know is that there is a token 'final' on line 5095, it does not
> even tell you if this is a C++ keyword or some other name, not to
> speak about in which scope, namespace or class it belongs to.

If you see 'final' on line 5095, you can be sure it's not a C++ keyword.

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for Philips Healthcare
void Void(void) { Void(); } /* The recursive call of the void */

James Kuyper

unread,
Jul 20, 2020, 6:43:01 PM7/20/20
to
On 7/20/20 1:53 PM, Keith Thompson wrote:
> Paavo Helde <ees...@osa.pri.ee> writes:
> [...]
>> Beware though that tokens without meaning do not give you much. If all
>> you know is that there is a token 'final' on line 5095, it does not
>> even tell you if this is a C++ keyword or some other name, not to
>> speak about in which scope, namespace or class it belongs to.
>
> If you see 'final' on line 5095, you can be sure it's not a C++ keyword.

For the benefit of those who don't already know what Keith is referring
to: the C++ standard does not describe either override or final as
keywords. Instead, what it says about them is:

"The identifiers in Table 4 have a special meaning when appearing in a
certain context. When referred to in the grammar, these identifiers are
used explicitly rather than using the identifier grammar production.
Unless otherwise specified, any ambiguity as to whether a given
identifier has a special meaning is resolved to interpret the token as a
regular identifier."

Note that this is a meaningful distinction. Keywords can never be used
as regular identifiers - they are always parsed as keywords, and if they
appear in a place where that keyword is not permitted, it's a syntax
error. When used anywhere other the specific context where they're
referred to in the grammar, override and final can be used as ordinary
identifiers.
In principle, the part about ambiguities being resolved in favor of the
regular identifier is another difference, but after a careful review of
all of the relevant grammar rules, I can't figure out any way to create
such an amibiguity - I may have missed something.

Keith Thompson

unread,
Jul 20, 2020, 8:15:07 PM7/20/20
to
James Kuyper <james...@alumni.caltech.edu> writes:
> On 7/20/20 1:53 PM, Keith Thompson wrote:
>> Paavo Helde <ees...@osa.pri.ee> writes:
>> [...]
>>> Beware though that tokens without meaning do not give you much. If all
>>> you know is that there is a token 'final' on line 5095, it does not
>>> even tell you if this is a C++ keyword or some other name, not to
>>> speak about in which scope, namespace or class it belongs to.
>>
>> If you see 'final' on line 5095, you can be sure it's not a C++ keyword.
>
> For the benefit of those who don't already know what Keith is referring
> to: the C++ standard does not describe either override or final as
> keywords. Instead, what it says about them is:
>
> "The identifiers in Table 4 have a special meaning when appearing in a
> certain context. When referred to in the grammar, these identifiers are
> used explicitly rather than using the identifier grammar production.
> Unless otherwise specified, any ambiguity as to whether a given
> identifier has a special meaning is resolved to interpret the token as a
> regular identifier."

The set of "those who don't already know what Keith is referring to"
included me. I knew that "final" isn't a C++ keyword, but I had
forgotten about its special status as an identifier.

[...]
0 new messages