is there a C++ version of the strtok() function?

alexo

unread,

Jun 28, 2019, 1:15:09 PM6/28/19

to

Hello,

I would like to improve a chemical formula parser that I wrote from
scratch without using tokenizing functions that correctly handles the
formula:

Fe4[Fe(CN)6]3*6H2O

but not the following:

[Be(N(CH3)2)2]3

my asking is:
is there a purely C++ function that behaves like the C strtok() ?

thank you

alessandro

Thiago Adams

unread,

Jun 28, 2019, 1:33:48 PM6/28/19

to

Maybe std::regex can help you in the way you want to do.
https://en.cppreference.com/w/cpp/regex

I don't remember how chemical formulas are expressed, but
I believe the best thing to do is write you own tokenizer
and parser.
If strtok was helping you, that means that the tokenizer
you need to do is simple.

Scott Lurndal

unread,

Jun 28, 2019, 1:41:35 PM6/28/19

to

If it works, why "improve" it? strtok is perfectly legal C++.

Manfred

unread,

Jun 28, 2019, 1:47:35 PM6/28/19

to

If you want a function that behaves like strtok, why not use strtok
itself? This, like all C standard functions, is allowed in C++.
Besides, I am pretty sure that the (pure) C++ standard library does not
include a function that behave identically to a C standard function.

That said, from what you are trying to achieve probably strtok is not
the best tokenizer for the purpose - most notably it does not handle
nesting and paired (open/close) parentheses by itself, not to mention
that it overwrites delimiters with 0's.

A long time ago I wrote a math expression parser, but that was pure C,
and not using strtok either.

Looking at C++ and the kind of problem, you probably won't be best off
with a /function/, possibly you may use some combination of string_view
with some recursive logic.
Others may give more detailed hints.

>
> thank you
>
> alessandro

Bonita Montero

unread,

Jun 28, 2019, 1:54:28 PM6/28/19

to

> Maybe std::regex can help you in the way you want to do.
> https://en.cppreference.com/w/cpp/regex

This will work also, but I'm sure that's not nearly as fast
as strtok().

Alf P. Steinbach

unread,

Jun 28, 2019, 2:17:16 PM6/28/19

to

The requirements are not clear.

But the main problem with `strtok` is that it isn't thread safe.

If thread safety isn't a concern then just (continue to) use it.

Otherwise, consider either a regular expression (standard library
solution) or a parsing framework like Boost Spirit (3rd party library).

Cheers & hth.,

- Alf

alexo

unread,

Jun 28, 2019, 3:11:32 PM6/28/19

to

Il 28/06/19 20:17, Alf P. Steinbach ha scritto:

> On 28.06.2019 19:14, alexo wrote:
>>
>> I would like to improve a chemical formula parser that I wrote from
>> scratch without using tokenizing functions that correctly handles the
>> formula:
>>
>> Fe4[Fe(CN)6]3*6H2O
>>
>> but not the following:
>>
>> [Be(N(CH3)2)2]3
>>
>> my asking is:
>> is there a purely C++ function that behaves like the C strtok() ?

>
> The requirements are not clear.

what is not clear in my question? I was wondering if thre exists
a C++ std function that replaces strtok.

> But the main problem with `strtok` is that it isn't thread safe.

I don't need threads, so strtok is ok.

thank you

James Kuyper

unread,

Jun 28, 2019, 3:13:07 PM6/28/19

to

On Friday, June 28, 2019 at 2:17:16 PM UTC-4, Alf P. Steinbach wrote:
> On 28.06.2019 19:14, alexo wrote:

...

> > is there a purely C++ function that behaves like the C strtok() ?
>
>
> The requirements are not clear.
>
> But the main problem with `strtok` is that it isn't thread safe.

If an implementation pre#defines __STDC_LIB_EXT1__, you can use
std::strtok_s(), declared in <cstring>, which is thread safe.

Bonita Montero

unread,

Jun 28, 2019, 3:22:59 PM6/28/19

to

> If an implementation pre#defines __STDC_LIB_EXT1__, you can use
> std::strtok_s(), declared in <cstring>, which is thread safe.

Why they don't simply re-specify strtok() for newer language
-versions with internal buffers which are thread-local?

Christian Gollwitzer

unread,

Jun 28, 2019, 3:23:09 PM6/28/19

to

Am 28.06.19 um 19:33 schrieb Thiago Adams:

> On Friday, June 28, 2019 at 2:15:09 PM UTC-3, alexo wrote:
>> Hello,
>>
>> I would like to improve a chemical formula parser that I wrote from
>> scratch without using tokenizing functions that correctly handles the
>> formula:
>>
>> Fe4[Fe(CN)6]3*6H2O
>>
>> but not the following:
>>
>> [Be(N(CH3)2)2]3
>

> Maybe std::regex can help you in the way you want to do.
> https://en.cppreference.com/w/cpp/regex
>

A regex cannot express this grammar. Simple proof: There are nested
parentheses and for that you need a stack automaton. But generally using
a parser generator might be good advice. There are many to choose from,
I like PEG grammars. There is https://github.com/yhirose/cpp-peglib for
example.

Christian

Alf P. Steinbach

unread,

Jun 28, 2019, 3:26:31 PM6/28/19

to

This was news to me, and I'm unable to find much info about it.

What I've found, scattered here & there in discussions, is that

* Microsoft submitted their silly *_s bounds checking functions for
standardization in C.
* The C committee fixed the worst problems and created a technical
report, TR 24731-1.
* That TR was included as normative annex K in C11.
* To use it you're apparently supposed to #define __STDC_WANT_LIB_EXT1__
as 1 before including any C library header.

I don't have the C11 standard, unfortunately.

And nothing I've found indicates a connection with threading?

Cheers!,

- Alf

James Kuyper

unread,

Jun 28, 2019, 3:37:33 PM6/28/19

to

Yes. It is declared in <cstring>, and it's called std::strtok().

If the length of the string which is the second argument to your
strtok() calls is 1, you might want to look into std::getline<>(),
declared in <string>, which takes an argument which is a delimiter
character.

alexo

unread,

Jun 28, 2019, 3:45:33 PM6/28/19

to

Il 28/06/19 19:47, Manfred ha scritto:

> On 6/28/2019 7:14 PM, alexo wrote:
>> Hello,
>>
>> I would like to improve a chemical formula parser that I wrote from
>> scratch without using tokenizing functions that correctly handles the
>> formula:
>>
>> Fe4[Fe(CN)6]3*6H2O
>>
>> but not the following:
>>
>> [Be(N(CH3)2)2]3
>>
>> my asking is:
>> is there a purely C++ function that behaves like the C strtok() ?

> [...] from what you are trying to achieve probably strtok is not

> the best tokenizer for the purpose - most notably it does not handle
> nesting and paired (open/close) parentheses by itself, not to mention
> that it overwrites delimiters with 0's.

I thought it could help, but if I use something like this:

tokens = strtok("Na[Fe(CN)6]", "()[]*");

I get: Na Fe CN 6

that is a correct but useless decomposition, because as you stated, I
can't match the '6' referring to both the 'Fe' and the 'CN' group.

> A long time ago I wrote a math expression parser, but that was pure C,
> and not using strtok either.
> Looking at C++ and the kind of problem, you probably won't be best off
> with a /function/, possibly you may use some combination of string_view
> with some recursive logic.
> Others may give more detailed hints.
>

The program that I've written uses a 'manual' jump from an open
parentheses to the corresponding closing, but can't handle trickier
formulas.

for example:

[Be(N(CH3)2)2]3

in my program is seen as having:

3 Be atoms -> correct
6 N atoms -> correct
1 C atom -> it should be 12
3 H atoms -> it should be 36

thank you,
alessandro

Scott Lurndal

unread,

Jun 28, 2019, 3:50:24 PM6/28/19

to

Bonita Montero <Bonita....@gmail.com> maliciously elided attributions yet again:

Because it makes much more sense for the caller to provide the
storage for the metadata, as POSIX realized two decades ago:

$ man strtok |head -20
STRTOK(3) Linux Programmer's Manual STRTOK(3)

NAME
strtok, strtok_r - extract tokens from strings

SYNOPSIS
#include <string.h>

char *strtok(char *str, const char *delim);

char *strtok_r(char *str, const char *delim, char **saveptr);

Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

strtok_r(): _SVID_SOURCE || _BSD_SOURCE || _POSIX_C_SOURCE >= 1 ||
_XOPEN_SOURCE || _POSIX_SOURCE

James Kuyper

unread,

Jun 28, 2019, 4:35:47 PM6/28/19

to

On Friday, June 28, 2019 at 3:26:31 PM UTC-4, Alf P. Steinbach wrote:
> On 28.06.2019 21:13, James Kuyper wrote:

...

> > If an implementation pre#defines __STDC_LIB_EXT1__, you can use
> > std::strtok_s(), declared in <cstring>, which is thread safe.
>
> This was news to me, and I'm unable to find much info about it.
>
> What I've found, scattered here & there in discussions, is that
>
> * Microsoft submitted their silly *_s bounds checking functions for
> standardization in C.
> * The C committee fixed the worst problems and created a technical
> report, TR 24731-1.
> * That TR was included as normative annex K in C11.
> * To use it you're apparently supposed to #define __STDC_WANT_LIB_EXT1__
> as 1 before including any C library header.

That's implementation-specific. The standard does not specify how to enable support for annex K, only how to check whether support has been enabled.

> I don't have the C11 standard, unfortunately.
>
> And nothing I've found indicates a connection with threading?

When I said that strtok_s() is thread-safe, I should instead have said
that it can be used in a thread-safe fashion.

"The strtok function is not required to avoid data races with other
calls to the strtok function.311)" (n1570.pdf 7.24.5.8p3)

311 is a reference to the following footnote:
"The strtok_s function can be used instead to avoid data races."

The fact that the functions described in Annex K can be used in a thread
safe fashion is something you must derive from the descriptions, it's
never said explicitly in Annex K itself.

The key feature of strtok_s() that improves thread safety over strtok()
is that strtok() uses it's own data area to store information about the
string it's parsing. With strtok_s(), you define your own char* pointer,
and then pass the address of that pointer to strtok_s() as it's fourth
argument. That pointer will contain the information that strtok_s()
needs to continue it's parsing when you call it with a null first
argument.

All of the data memory used by strtok_s() is under your control. If you
manage that data in a thread-safe fashion, then your calls to
strtok_s() will be thread safe.

Keith Thompson

unread,

Jun 28, 2019, 5:57:56 PM6/28/19

to

N1570 is the last pre-standard draft of C11. It's close enough for most
purposes.

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf

Annex K is normative but optional. An implementation that supports it
must pre#define __STDC_WANT_EXT1__.

User code should have
#define __STDC_WANT_LIB_EXT1__ 1
to enable the features of Annex K (or 0 to disable them). It's
implementation-defined whether they're enabled or not if
__STDC_WANT_LIB_EXT1__ is not defined.

> And nothing I've found indicates a connection with threading?

strtok() modifies the string passed to it as an argument and it
maintains internal state, so it can't be used in parallel to parse two
different strings. (That could happen either with separate threads or
with interspersed calls in non-threaded code.) strtok_s() requires the
caller to provide space for any internal state, so two different threads
should be able to use it safely as long as the storage they provide is
distinct.

None of the implementations I use support Annex K, and there's a serious
proposal to remove it from C2X.
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1969.htm

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Keith Thompson

unread,

Jun 28, 2019, 6:01:56 PM6/28/19

to

alexo <ale...@inwind.it> writes:
> Il 28/06/19 20:17, Alf P. Steinbach ha scritto:
>> On 28.06.2019 19:14, alexo wrote:
>>>
>>> I would like to improve a chemical formula parser that I wrote from
>>> scratch without using tokenizing functions that correctly handles the
>>> formula:
>>>
>>> Fe4[Fe(CN)6]3*6H2O
>>>
>>> but not the following:
>>>
>>> [Be(N(CH3)2)2]3
>>>
>>> my asking is:
>>> is there a purely C++ function that behaves like the C strtok() ?
>>
>> The requirements are not clear.
>
> what is not clear in my question? I was wondering if thre exists
> a C++ std function that replaces strtok.

What's not clear to me is how strtok() would solve your stated problem.

strtok() splits a string on a specified delimiter. How would that parse
your sample formulas?

I suggest that you have an XY problem.
https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem

lex/flex and yacc/bison could do the job, but might be overkill.

[...]

Keith Thompson

unread,

Jun 28, 2019, 6:10:51 PM6/28/19

to

James Kuyper <james...@alumni.caltech.edu> writes:
> On Friday, June 28, 2019 at 3:26:31 PM UTC-4, Alf P. Steinbach wrote:

[...]

>> * To use it you're apparently supposed to #define __STDC_WANT_LIB_EXT1__
>> as 1 before including any C library header.
>
> That's implementation-specific. The standard does not specify how to
> enable support for annex K, only how to check whether support has been
> enabled.

It does both. An implementation uses __STDC_LIB_EXT1__ for the
former. A program can use __STDC_WANT_LIB_EXT1__ for the latter.

For details, see K.2p2 and K.3.1.1 in N1570/C11.

Mr Flibble

unread,

Jun 28, 2019, 6:48:37 PM6/28/19

to

C++ includes strtok as std::strtok but why would you want to use a
demented thing such as strtok in C++? Use std::regex or write your own
sane tokeniser that doesn't modify the input string.

/Flibble

--
"Snakes didn't evolve, instead talking snakes with legs changed into
snakes." - Rick C. Hodgin

“You won’t burn in hell. But be nice anyway.” – Ricky Gervais

“I see Atheists are fighting and killing each other again, over who
doesn’t believe in any God the most. Oh, no..wait.. that never happens.” –
Ricky Gervais

"Suppose it's all true, and you walk up to the pearly gates, and are
confronted by God," Bryne asked on his show The Meaning of Life. "What
will Stephen Fry say to him, her, or it?"
"I'd say, bone cancer in children? What's that about?" Fry replied.
"How dare you? How dare you create a world to which there is such misery
that is not our fault. It's not right, it's utterly, utterly evil."
"Why should I respect a capricious, mean-minded, stupid God who creates a
world that is so full of injustice and pain. That's what I would say."

Alf P. Steinbach

unread,

Jun 28, 2019, 6:53:26 PM6/28/19

to

On further reflection, after looking at the documentation of `strtok`, I
don't see how you could have written a parser that works for the example
formula, using that function. Because it discards the delimiter it stops
at. Which means, in any way I think of using it, that it discards the
parentheses.

In my response to you I hadn't yet looked at the `strtok` docs, I just
assumed that (as I interpreted it) what you wrote about having a working
parser, was correct, and then I based what I wrote on what I remembered
about the function.

Re the question: part of what's unclear about the requirements is e.g.
the rules for use of square brackets versus round parentheses, and
whether there can be multi-digit integers in there. From what I remember
of chemistry and quantum mechanics I guess the maximum number would be 7
or less, maybe just 4? But guesswork doesn't make up for a clear spec.

Cheers!,

- Alf

Ben Bacarisse

unread,

Jun 28, 2019, 6:57:30 PM6/28/19

to

alexo <ale...@inwind.it> writes:
<cut>

> I thought it could help, but if I use something like this:
>
> tokens = strtok("Na[Fe(CN)6]", "()[]*");
>
>
> I get: Na Fe CN 6
>
> that is a correct but useless decomposition, because as you stated, I
> can't match the '6' referring to both the 'Fe' and the 'CN' group.

Can you link to a definition of this notation? The examples look simple
but I can imagine that it could get complex and I don't want to suggest
anything that is a dead-end.

Even so I suspect that regexes and parser generators are all a bit
heavy-weight. The tokens appear to be element names (upper-case letter
optionally followed by one lower-case letter though you could limit it
to just the valid ones), numbers and (, ), [, ] and *, so what would be
called the lexical analysis is very simple.

To know where to go from there one would need to know what the result of
the parse should be.

--
Ben.

James Kuyper

unread,

Jun 28, 2019, 7:10:00 PM6/28/19

to

On 6/28/19 4:35 PM, James Kuyper wrote:
> On Friday, June 28, 2019 at 3:26:31 PM UTC-4, Alf P. Steinbach wrote:
>> On 28.06.2019 21:13, James Kuyper wrote:
> ...
>>> If an implementation pre#defines __STDC_LIB_EXT1__, you can use
>>> std::strtok_s(), declared in <cstring>, which is thread safe.
>>
>> This was news to me, and I'm unable to find much info about it.
>>
>> What I've found, scattered here & there in discussions, is that
>>
>> * Microsoft submitted their silly *_s bounds checking functions for
>> standardization in C.
>> * The C committee fixed the worst problems and created a technical
>> report, TR 24731-1.
>> * That TR was included as normative annex K in C11.
>> * To use it you're apparently supposed to #define __STDC_WANT_LIB_EXT1__
>> as 1 before including any C library header.
>
> That's implementation-specific. The standard does not specify how to enable support for annex K, only how to check whether support has been enabled.

Strictly speaking, what I said was correct - #defining
__STDC_WANT_LIB_EXT1__ doesn't have any effect unless the implementation
pre#defines __STDC_LIB_EXT1__.

However, the wording of my comment reflected the fact that I was unaware
of the fact that the effects of #defining __STDC_WANT_LIB_EXT1__ are in
fact specified by the C11 standard. I don't make much use of Annex K, so
I hadn't delved into the details of how it works.

alexo

unread,

Jun 28, 2019, 11:19:03 PM6/28/19

to

Il 29/06/19 00:53, Alf P. Steinbach ha scritto:

> On 28.06.2019 21:11, alexo wrote:
>> Il 28/06/19 20:17, Alf P. Steinbach ha scritto:
>>> On 28.06.2019 19:14, alexo wrote:
>>>>
>>>> I would like to improve a chemical formula parser that I wrote from
>>>> scratch without using tokenizing functions that correctly handles
>>>> the formula:
>>>>
>>>> Fe4[Fe(CN)6]3*6H2O
>>>>
>>>> but not the following:
>>>>
>>>> [Be(N(CH3)2)2]3
>>>>
>>>> my asking is:
>>>> is there a purely C++ function that behaves like the C strtok() ?
>>
>>
>>
>>>
>>> The requirements are not clear.
>>
>> what is not clear in my question? I was wondering if thre exists
>> a C++ std function that replaces strtok.
>>
>>> But the main problem with `strtok` is that it isn't thread safe.
>>
>> I don't need threads, so strtok is ok.
>
> On further reflection, after looking at the documentation of `strtok`, I
> don't see how you could have written a parser that works for the example
> formula, using that function. Because it discards the delimiter it stops
> at. Which means, in any way I think of using it, that it discards the
> parentheses.

What I wrote doesn't use the strok function. I analyzed the formula from
left to right skipping invalid letter combinations and analyzing
numerical coefficients.
I wrongly supposed that strok function could help me in the job, but it
is not so.

> Re the question: part of what's unclear about the requirements is e.g.
> the rules for use of square brackets versus round parentheses, and
> whether there can be multi-digit integers in there. From what I remember
> of chemistry and quantum mechanics I guess the maximum number would be 7
> or less, maybe just 4? But guesswork doesn't make up for a clear spec.

The numerical coefficient theorically can be of any lenght, but usually
are not greater than, say, 15.

>
> Cheers!,
>
> - Alf

Alf P. Steinbach

unread,

Jun 29, 2019, 12:08:45 AM6/29/19

to

OK. The following is a manual approach to lexing such formulas. I've
assumed that they can contain whitespace.

-------------------------------------- chemical-elements.hpp
#pragma once
#include <string_view>

namespace chemical
{
using std::string_view;

struct Element
{
int number;
string_view symbol;
string_view name;
};

constexpr Element elements[] =
{
{ 1, "H", "Hydrogen" },
{ 2, "He", "Helium" },
{ 3, "Li", "Lithium" },
{ 4, "Be", "Beryllium" },
{ 5, "B", "Boron" },
{ 6, "C", "Carbon" },
{ 7, "N", "Nitrogen" },
{ 8, "O", "Oxygen" },
{ 9, "F", "Fluorine" },
{ 10, "Ne", "Neon" },
{ 11, "Na", "Sodium" },
{ 12, "Mg", "Magnesium" },
{ 13, "Al", "Aluminium" },
{ 14, "Si", "Silicon" },
{ 15, "P", "Phosphorus" },
{ 16, "S", "Sulfur" },
{ 17, "Cl", "Chlorine" },
{ 18, "Ar", "Argon" },
{ 19, "K", "Potassium" },
{ 20, "Ca", "Calcium" },
{ 21, "Sc", "Scandium" },
{ 22, "Ti", "Titanium" },
{ 23, "V", "Vanadium" },
{ 24, "Cr", "Chromium" },
{ 25, "Mn", "Manganese" },
{ 26, "Fe", "Iron" },
{ 27, "Co", "Cobalt" },
{ 28, "Ni", "Nickel" },
{ 29, "Cu", "Copper" },
{ 30, "Zn", "Zinc" },
{ 31, "Ga", "Gallium" },
{ 32, "Ge", "Germanium" },
{ 33, "As", "Arsenic" },
{ 34, "Se", "Selenium" },
{ 35, "Br", "Bromine" },
{ 36, "Kr", "Krypton" },
{ 37, "Rb", "Rubidium" },
{ 38, "Sr", "Strontium" },
{ 39, "Y", "Yttrium" },
{ 40, "Zr", "Zirconium" },
{ 41, "Nb", "Niobium" },
{ 42, "Mo", "Molybdenum" },
{ 43, "Tc", "Technetium" },
{ 44, "Ru", "Ruthenium" },
{ 45, "Rh", "Rhodium" },
{ 46, "Pd", "Palladium" },
{ 47, "Ag", "Silver" },
{ 48, "Cd", "Cadmium" },
{ 49, "In", "Indium" },
{ 50, "Sn", "Tin" },
{ 51, "Sb", "Antimony" },
{ 52, "Te", "Tellurium" },
{ 53, "I", "Iodine" },
{ 54, "Xe", "Xenon" },
{ 55, "Cs", "Caesium" },
{ 56, "Ba", "Barium" },
{ 57, "La", "Lanthanum" },
{ 58, "Ce", "Cerium" },
{ 59, "Pr", "Praseodymium" },
{ 60, "Nd", "Neodymium" },
{ 61, "Pm", "Promethium" },
{ 62, "Sm", "Samarium" },
{ 63, "Eu", "Europium" },
{ 64, "Gd", "Gadolinium" },
{ 65, "Tb", "Terbium" },
{ 66, "Dy", "Dysprosium" },
{ 67, "Ho", "Holmium" },
{ 68, "Er", "Erbium" },
{ 69, "Tm", "Thulium" },
{ 70, "Yb", "Ytterbium" },
{ 71, "Lu", "Lutetium" },
{ 72, "Hf", "Hafnium" },
{ 73, "Ta", "Tantalum" },
{ 74, "W", "Tungsten" },
{ 75, "Re", "Rhenium" },
{ 76, "Os", "Osmium" },
{ 77, "Ir", "Iridium" },
{ 78, "Pt", "Platinum" },
{ 79, "Au", "Gold" },
{ 80, "Hg", "Mercury" },
{ 81, "Tl", "Thallium" },
{ 82, "Pb", "Lead" },
{ 83, "Bi", "Bismuth" },
{ 84, "Po", "Polonium" },
{ 85, "At", "Astatine" },
{ 86, "Rn", "Radon" },
{ 87, "Fr", "Francium" },
{ 88, "Ra", "Radium" },
{ 89, "Ac", "Actinium" },
{ 90, "Th", "Thorium" },
{ 91, "Pa", "Protactinium" },
{ 92, "U", "Uranium" },
{ 93, "Np", "Neptunium" },
{ 94, "Pu", "Plutonium" },
{ 95, "Am", "Americium" },
{ 96, "Cm", "Curium" },
{ 97, "Bk", "Berkelium" },
{ 98, "Cf", "Californium" },
{ 99, "Es", "Einsteinium" },
{ 100, "Fm", "Fermium" },
{ 101, "Md", "Mendelevium" },
{ 102, "No", "Nobelium" },
{ 103, "Lr", "Lawrencium" },
{ 104, "Rf", "Rutherfordium" },
{ 105, "Db", "Dubnium" },
{ 106, "Sg", "Seaborgium" },
{ 107, "Bh", "Bohrium" },
{ 108, "Hs", "Hassium" },
{ 109, "Mt", "Meitnerium" },
{ 110, "Ds", "Darmstadtium" },
{ 111, "Rg", "Roentgenium" },
{ 112, "Cn", "Copernicium" },
{ 113, "Nh", "Nihonium" },
{ 114, "Fl", "Flerovium" },
{ 115, "Mc", "Moscovium" },
{ 116, "Lv", "Livermorium" },
{ 117, "Ts", "Tennessine" },
{ 118, "Og", "Oganesson" }
};
} // namespace chemical

-------------------------------------- chemical_formula-Tokenizer.hpp
#pragma once
#include "chemical-elements.hpp" // chemical::(Element, elements)
#include <cppx-core/all.hpp> // <url:
https://github.com/alf-p-steinbach/cppx-core>

namespace chemical_formula
{
$use_std(
stoi, string_view
);
$use_cppx(
Map_, // A std::unordered_map with [] indexing of
const instance.
P_, // P_<T> is an alias for T*. It supports prefix
const.
p_first_of, p_beyond_of
);
namespace ascii = cppx::ascii; // ascii::is_*

struct Token
{
struct Kind{ enum Enum {
none,
element,
number,
left_parens = '(', right_parens = ')',
left_bracket = '[', right_bracket = ']'
}; };

Kind::Enum kind;
string_view text;
int n; // Only used for `element` and
`number`.
};

class Tokenizer
{
const string_view m_formula;
const P_<const char> m_p_beyond_formula;

Token m_current;

struct Symbols_to_elements_map:
Map_<string_view, P_<const chemical::Element>>
{
Symbols_to_elements_map()
{
auto& self = *this;
for( const chemical::Element& elem : chemical::elements ) {
self[elem.symbol] = &elem;
}
}
};

auto is_in_formula( const P_<const char> p )
-> bool
{ return p != m_p_beyond_formula; }

void find_token_that_starts_at( const P_<const char> p_start )
{
static const auto symbols = Symbols_to_elements_map();

const char first_char = *p_start;
if( ascii::is_digit( first_char ) ) {
P_<const char> p_beyond = p_start + 1;
while( is_in_formula( p_beyond ) and ascii::is_digit(
*p_beyond ) ) {
++p_beyond;
}
const auto text = string_view( p_start, p_beyond -
p_start );
try {
m_current = { Token::Kind::number, text, stoi(
p_start ) };
} catch( ... ) {
m_current = { Token::Kind::none, text, -1 };
}
} else if( ascii::is_uppercase( first_char ) ) {
P_<const char> p_beyond = p_start + 1;
while( is_in_formula( p_beyond ) and
ascii::is_lowercase( *p_beyond ) ) {
++p_beyond;
}
const auto text = string_view( p_start, p_beyond -
p_start );
try {
m_current = { Token::Kind::element, text,
symbols[text]->number };
} catch( ... ) {
m_current = { Token::Kind::none, text, -1 };
}
} else {
const auto text = string_view( p_start, 1 );
switch( first_char ) {
case '(': [[fallthrough]];
case ')': [[fallthrough]];
case '[': [[fallthrough]];
case ']': {
m_current = { Token::Kind::Enum( first_char ),
text, -1 };
break;
}
default: {
m_current = { Token::Kind::none, text, -1 };
}
}
}
}

void find_next_remaining_token()
{
P_<const char> p_start = p_beyond_of( m_current.text );
while( is_in_formula( p_start ) and ascii::is_whitespace(
*p_start ) ) {
++p_start;
}
if( p_start == m_p_beyond_formula ) {
m_current = { Token::Kind::none, string_view( p_start,
0 ), -1 };
return;
}
find_token_that_starts_at( p_start );
}

public:
auto current() const
-> Token
{ return m_current; }

auto is_at_end() const
-> bool
{ return p_first_of( m_current.text ) == p_beyond_of( m_formula
); }

void advance()
{
find_next_remaining_token();
}

Tokenizer( const string_view& formula ):
m_formula( formula ),
m_p_beyond_formula( p_beyond_of( formula ) ),
m_current{ Token::Kind::none, string_view( formula.data(),
0 ), -1 }
{
assert( m_current.text.data() == formula.data() );
find_next_remaining_token();
}
};
} // namespace chemical_formula

-------------------------------------- main.cpp
#include "chemical_formula-Tokenizer.hpp"
#include <cppx-core/all.hpp>

auto main() -> int
{
$use_std( cout, endl );

const auto& formula = "[Be(N(CH3)24)255555555555555555555]3";

cout << "Tokens:" << endl;
for( auto tokens = chemical_formula::Tokenizer( formula );
not tokens.is_at_end();
tokens.advance() )
{
const chemical_formula::Token tok = tokens.current();
cout << "* ";
if( tok.kind == chemical_formula::Token::Kind::none ) {
cout << "<invalid> ";
}
cout << "\"" << tok.text << "\"" << endl;
}
}

Output:

Tokens:
* "["
* "Be"
* "("
* "N"
* "("
* "C"
* "H"
* "3"
* ")"
* "24"
* ")"
* <invalid> "255555555555555555555"
* "]"
* "3"

alexo

unread,

Jun 29, 2019, 12:21:53 AM6/29/19

to

Il 29/06/19 05:18, alexo ha scritto:

For a single central atom the coordination number can be higher than 7,
but the coefficient can be very high in organic chemistry formulas,
where there is not a limit to the number of carbon atoms that can be
chained together. They are usually reported grouping methyl or methylene
groups as this:

CH3(CH2)10CH(CH3)(CH2)3CH3

5-methyl-hexadecane.

For the parentheses, formulas usually involve a pair of square brackets
inside of which there are round parentheses.

The example I reported earlier, namely, [Be(N(CH3)2)2]3 even if simple,
fails to be recognized by my parser.

2(3HC)-N-N-(CH3)2 being the 1,1,2,2 tetramethyl-hydrazine group

alexo

unread,

Jun 29, 2019, 12:31:50 AM6/29/19

to

Il 29/06/19 06:08, Alf P. Steinbach ha scritto:

> On 29.06.2019 05:18, alexo wrote:
>> Il 29/06/19 00:53, Alf P. Steinbach ha scritto:
>>> On 28.06.2019 21:11, alexo wrote:
>>>> Il 28/06/19 20:17, Alf P. Steinbach ha scritto:
>>>>> On 28.06.2019 19:14, alexo wrote:

> OK. The following is a manual approach to lexing such formulas. I've
> assumed that they can contain whitespace.

No Alf, chemical formulas don't have spaces inside them. They are a list
of the atoms, or group of atoms the molecule contains.

Chris M. Thomasson

unread,

Jun 29, 2019, 12:49:14 AM6/29/19

to

[...]

>         { 115, "Mc", "Moscovium" },
>         { 116, "Lv", "Livermorium" },
>         { 117, "Ts", "Tennessine" },
>         { 118, "Og", "Oganesson" }

[...]

Where 115 = ununpentium or Lazarium? lol. ;^)

Just having some fun.

Bonita Montero

unread,

Jun 29, 2019, 2:40:26 AM6/29/19

to

>> Why they don't simply re-specify strtok() for newer language
>> -versions with internal buffers which are thread-local?

> Because it makes much more sense for the caller to provide the
> storage for the metadata, as POSIX realized two decades ago:

I think storing the saveptr yourself is less convenient than a
strtok() with internal thread-local buffers. And even more this
solution with be fully backward-compatible.

Manfred

unread,

Jun 29, 2019, 10:48:00 AM6/29/19

to

Just an extra bit of practical info: glibc, as diffuse as it is, does
_not_ support annex K, on purpose.

[...]

Bonita Montero

unread,

Jun 29, 2019, 10:51:14 AM6/29/19

to

> * Microsoft submitted their silly *_s bounds checking functions for
> standardization in C.

They aren't silly! If you don't use C++-strings but stick with C
char-arrays the make sense.

Manfred

unread,

Jun 29, 2019, 12:59:26 PM6/29/19

to

Except that they are poorly designed, i.e., in human language, silly.

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1969.htm

alexo

unread,

Jun 30, 2019, 12:43:12 PM6/30/19

to

Il 30/06/19 16:04, Stefan Ram ha scritto:

Hi Stefan, your parser works correctly. I checked with other formulas
{'such as

[Fe(CO)6][Sb2F11]2

1 * [ 1 * Fe + 6 * ( 1 * C + 1 * O ) ] + 2 * [ 2 * Sb + 11 * F ]

and

HRh(CO)[P[CH2CH2(CF2)5CF]3]3

for which the output is:

1 * H + 1 * Rh + 1 * ( 1 * C + 1 * O ) + 3 * [ 1 * P + 3 * [ 1 * C + 2 *
H + 1 * C + 2 * H + 5 * ( 1 * C + 2 * F ) + 1 * C + 1 * F ] ]

To say it all, the text from which I took this formula reports it as:

HRh(CO)[P{CH2CH2(CF2)5CF}3]3

i.e with another kind of parentheses '{' and '}

I added that to your parser.
You did a great job.

But to be perfect it should handle even the molecules of crystallization
that are written in this way (but I didn't talk about them in my post:

CuSO4*5H2O

here there are 5 molecules of H2O in the composition of the molecule.
The '*' character is the one I use because it should be used the
'central dot' character but I don't know how to type it.

here is your program purged by all those ::std that made the text hard
to read:

#include <cctype>
#include <iostream>
#include <istream>
#include <memory>
#include <ostream>
#include <sstream>
#include <string>
#include <vector>

using namespace ::std::literals;

using std::string;
using std::move;
using std::cout;
using std::istringstream;

template< typename source >struct scanner
{ source s;

string peek_value;

typename source::int_type eof { source::traits_type::eof() };

scanner( source && s )noexcept : s{ move( s )}, peek_value{ ""s } {}

scanner( scanner && s )noexcept : s{ static_cast<source&&>(s.s)},
peek_value{static_cast<string&&>(s.peek_value)} { }

string name( char c )
{ string result;
if( isalpha( c ))
{ int pos {};
while( isalpha( c )&&( pos == 0 || islower( c )))
{ pos = 1;
c = static_cast< char >( s.get() );
result += c;
auto const i = s.peek();
if( i == eof )return result;
c = static_cast< char >( i ); }}
return result; }

string number( char c )
{ string result;
if( isdigit( c ))
{ while( isdigit( c ))
{ c = static_cast< char >( s.get() );
result += c;
auto const i = s.peek();
if( i == eof )return result;
c = static_cast< char >( i ); }}
return result; }

string special( char c )
{ string result;
if( c == '(' || c == ')' || c == '[' || c == ']' || c == '{' || c
== '}')
{ c = static_cast< char >( s.get() );
result += c;
return result; }
return result; }

string do_get()
{ string result;
auto const i = s.peek();
if( i == eof )return result;
else
{ auto c { static_cast< char >( i )};
result = name( c ); if( result != ""s )return result;
result = number( c ); if( result != ""s )return result;
result = special( c ); if( result != ""s )return result;
return result; }
return result; }

string get()
{ if( peek_value != ""s )
{ string const result{ peek_value };
peek_value = ""s;
return result; }
else return do_get(); }

string peek()
{ if( peek_value == ""s )peek_value = do_get();
return peek_value; }};

struct parse_type
{ virtual string surface() const = 0;
virtual string description() const = 0;
parse_type()= default;
parse_type( const parse_type & )= default;
parse_type( parse_type && )= default;
parse_type & operator=( const parse_type & )= default;
parse_type & operator=( parse_type && )noexcept = default;
virtual ~parse_type() = default; };

struct primary_parse_type final : public parse_type
{ parse_type const * primary;
primary_parse_type( const primary_parse_type & )= default;
primary_parse_type& operator=(const primary_parse_type&)= default;
primary_parse_type( parse_type const * primary ):
primary{ primary } {}
string surface() const override { return primary->surface(); }
string description() const override
{ return primary->description(); }};

struct name_parse_type;

struct name_parse_type final : public parse_type
{ string surface_;
name_parse_type( string surface ):surface_{ move(surface) } {}
string surface() const override { return surface_; }
string description() const override { return surface_; }};

struct number_parse_type final : public parse_type
{ string surface_;
number_parse_type( string surface ):surface_{ move(surface) } {}
string surface() const override { return surface_; }
string description() const override { return "number "s + surface_; }};

struct product_parse_type final : public parse_type
{ string surface_;
primary_parse_type * component;
string multiplicity;
product_parse_type(const product_parse_type&)=default;
product_parse_type& operator=(const product_parse_type&)=default;
product_parse_type( string surface, primary_parse_type * component,
string multiplicity ):
surface_{ move(surface) }, component{ component }, multiplicity{
move(multiplicity) } {}
string surface() const override { return surface_; }
string description() const override { return multiplicity + " * " +
component->description(); }};

struct sum_parse_type final : public parse_type
{ string surface_;
::std::vector< product_parse_type >vector;
sum_parse_type( string surface, ::std::vector< product_parse_type >
vector ):
surface_{ move(surface) }, vector{ move(vector) } {}
string const plus = " + "s;
void append_surface( string const & s ){ surface_ += s; }
void append_addend( product_parse_type const & s ){ vector.push_back(
s ); }
string surface() const override { return surface_; }
string description() const override
{ string result;
bool first = true;
for( auto & entry: vector )
if( first ){ result = entry.description(); first = false; }
else result = result + plus + entry.description();
return result; }};

struct paren_parse_type final : public parse_type
{ string surface_;
string lpar;
string rpar;
sum_parse_type sum;
paren_parse_type( string surface, string lpar, string rpar,
sum_parse_type sum ):
surface_{ move(surface) }, lpar{ move(lpar) }, rpar{ move(rpar) },
sum{ move(sum) } {}
string const space{ " "s };
string surface() const override { return surface_; }
string description() const override
{ return lpar + space + sum.description() + space + rpar; }};

template< typename scanner >struct parser_type
{ scanner sc;
explicit parser_type( scanner && sc ): sc{ move( sc )} {}

name_parse_type get_name()
{ string t;
t = sc.get(); return name_parse_type{ t }; }

number_parse_type get_number()
{ string t; t = sc.get(); return number_parse_type{ t }; }

product_parse_type get_product()
{ primary_parse_type const component { get_primary() };
string preview { sc.peek() };
bool const is_product { preview.length() > 0 && isdigit(
preview.at( 0 )) };
string const multiplicity { is_product ? sc.get(): "1"s };
return product_parse_type{ component.surface() +( is_product ?
preview : ""s ), new primary_parse_type { component }, multiplicity }; }

sum_parse_type get_sum()
{ auto result { sum_parse_type{ ""s, {} }};
product_parse_type component { get_product() };
if( component.surface() == ""s )return result;
result.append_surface( component.surface() );
result.append_addend( component );
string preview;
while( true )
{ preview = sc.peek();
bool const is_sum { preview.length() > 0 &&( isalpha(
preview.front() ) || preview.front() == '(' || preview.front() == '[' ||
preview.front() == '{') };
if( !is_sum )return result;
component = get_product();
result.append_surface( component.surface() );
result.append_addend( component ); }
return sum_parse_type{ ""s, {} }; }

paren_parse_type get_paren()
{ string const lpar { sc.get() };
sum_parse_type const sum { get_sum() };
string const rpar { sc.get() };
return paren_parse_type
{ lpar + sum.surface() + rpar, lpar, rpar, sum }; }

primary_parse_type get_primary()
{ auto next { sc.peek() };
if( isalpha( next.front() ))
{ name_parse_type name = get_name();
return new name_parse_type{ name }; }
if( next.front() == '(' || next.front() == '[' || next.front() == '{')
{ paren_parse_type paren = get_paren();
return new paren_parse_type{ paren }; }
return nullptr; }};

int main()
{ istringstream s{ "HRh(CO)[P{CH2CH2(CF2)5CF}3]3" };
scanner< istringstream >sc{ move( s )};
parser_type< scanner< istringstream > >parser{ move( sc )};
sum_parse_type sum_value = parser.get_sum();
cout << sum_value.surface() << '\n';
cout << sum_value.description() << '\n'; }

Heinz Müller

unread,

Jun 30, 2019, 1:07:39 PM6/30/19

to

> Here is a quick attempt to write a parser for it.
>
> However, this is just a "quick and dirty" solution,
> it is a first draft, a kind of "compilable pseudo code",
> it contains some some unnecessary lines.
>
> It can parse the above formula, but has not been
> tested with other formulas, and it allocates memory
> with "new" that never is freed (it is leaking memory).
>
> Maybe better memory management can be achieved by
> overwriting the destructors?
>
> input (see "main"):
>
> [Be(N(CH3)2)2]3
>
> output (written to stdout):
>
> 3 * [ 1 * Be + 2 * ( 1 * N + 2 * ( 1 * C + 3 * H ) ) ]
>
> source code (wide lines!):

>
> #include <cctype>
> #include <iostream>
> #include <istream>
> #include <memory>
> #include <ostream>
> #include <sstream>
> #include <string>
> #include <vector>
>
> using namespace ::std::literals;
>

> template< typename source >struct scanner
> { source s;
>

> ::std::string peek_value;

>
> typename source::int_type eof { source::traits_type::eof() };
>

> scanner( source && s )noexcept : s{ ::std::move( s )}, peek_value{ ""s } {}
>
> scanner( scanner && s )noexcept : s{ static_cast<source&&>(s.s)}, peek_value{static_cast<::std::string&&>(s.peek_value)} { }
>
> ::std::string name( char c )
> { ::std::string result;

> if( isalpha( c ))
> { int pos {};
> while( isalpha( c )&&( pos == 0 || islower( c )))
> { pos = 1;
> c = static_cast< char >( s.get() );
> result += c;
> auto const i = s.peek();
> if( i == eof )return result;
> c = static_cast< char >( i ); }}
> return result; }
>

> ::std::string number( char c )
> { ::std::string result;

> if( isdigit( c ))
> { while( isdigit( c ))
> { c = static_cast< char >( s.get() );
> result += c;
> auto const i = s.peek();
> if( i == eof )return result;
> c = static_cast< char >( i ); }}
> return result; }
>

> ::std::string special( char c )
> { ::std::string result;
> if( c == '(' || c == ')' || c == '[' || c == ']' )

> { c = static_cast< char >( s.get() );
> result += c;
> return result; }
> return result; }
>

> ::std::string do_get()
> { ::std::string result;

> auto const i = s.peek();
> if( i == eof )return result;
> else
> { auto c { static_cast< char >( i )};
> result = name( c ); if( result != ""s )return result;
> result = number( c ); if( result != ""s )return result;
> result = special( c ); if( result != ""s )return result;
> return result; }
> return result; }
>

> ::std::string get()

> { if( peek_value != ""s )

> { ::std::string const result{ peek_value };

> peek_value = ""s;
> return result; }
> else return do_get(); }
>

> ::std::string peek()

> { if( peek_value == ""s )peek_value = do_get();
> return peek_value; }};
>
>
> struct parse_type

> { virtual ::std::string surface() const = 0;
> virtual ::std::string description() const = 0;

> parse_type()= default;
> parse_type( const parse_type & )= default;
> parse_type( parse_type && )= default;
> parse_type & operator=( const parse_type & )= default;
> parse_type & operator=( parse_type && )noexcept = default;
> virtual ~parse_type() = default; };
>
> struct primary_parse_type final : public parse_type
> { parse_type const * primary;
> primary_parse_type( const primary_parse_type & )= default;
> primary_parse_type& operator=(const primary_parse_type&)= default;
> primary_parse_type( parse_type const * primary ):
> primary{ primary } {}

> ::std::string surface() const override { return primary->surface(); }
> ::std::string description() const override

> { return primary->description(); }};
>
> struct name_parse_type;
>
> struct name_parse_type final : public parse_type

> { ::std::string surface_;
> name_parse_type( ::std::string surface ):surface_{ ::std::move(surface) } {}
> ::std::string surface() const override { return surface_; }
> ::std::string description() const override { return surface_; }};

>
> struct number_parse_type final : public parse_type

> { ::std::string surface_;
> number_parse_type( ::std::string surface ):surface_{ ::std::move(surface) } {}
> ::std::string surface() const override { return surface_; }
> ::std::string description() const override { return "number "s + surface_; }};

>
> struct product_parse_type final : public parse_type

> { ::std::string surface_;
> primary_parse_type * component;
> ::std::string multiplicity;

> product_parse_type(const product_parse_type&)=default;
> product_parse_type& operator=(const product_parse_type&)=default;

> product_parse_type( ::std::string surface, primary_parse_type * component, ::std::string multiplicity ):
> surface_{ ::std::move(surface) }, component{ component }, multiplicity{ ::std::move(multiplicity) } {}
> ::std::string surface() const override { return surface_; }
> ::std::string description() const override { return multiplicity + " * " + component->description(); }};

>
> struct sum_parse_type final : public parse_type

> { ::std::string surface_;
> ::std::vector< product_parse_type >vector;
> sum_parse_type( ::std::string surface, ::std::vector< product_parse_type > vector ):
> surface_{ ::std::move(surface) }, vector{ ::std::move(vector) } {}
> ::std::string const plus = " + "s;
> void append_surface( ::std::string const & s ){ surface_ += s; }

> void append_addend( product_parse_type const & s ){ vector.push_back( s ); }

> ::std::string surface() const override { return surface_; }
> ::std::string description() const override
> { ::std::string result;

> bool first = true;
> for( auto & entry: vector )
> if( first ){ result = entry.description(); first = false; }
> else result = result + plus + entry.description();
> return result; }};
>
> struct paren_parse_type final : public parse_type

> { ::std::string surface_;
> ::std::string lpar;
> ::std::string rpar;
> sum_parse_type sum;
> paren_parse_type( ::std::string surface, ::std::string lpar, ::std::string rpar, sum_parse_type sum ):
> surface_{ ::std::move(surface) }, lpar{ ::std::move(lpar) }, rpar{ ::std::move(rpar) }, sum{ ::std::move(sum) } {}
> ::std::string const space{ " "s };
> ::std::string surface() const override { return surface_; }
> ::std::string description() const override

> { return lpar + space + sum.description() + space + rpar; }};
>
>
> template< typename scanner >struct parser_type
> { scanner sc;

> explicit parser_type( scanner && sc ): sc{ ::std::move( sc )} {}
>
> name_parse_type get_name()
> { ::std::string t;

> t = sc.get(); return name_parse_type{ t }; }
>
> number_parse_type get_number()

> { ::std::string t; t = sc.get(); return number_parse_type{ t }; }

>
> product_parse_type get_product()
> { primary_parse_type const component { get_primary() };

> ::std::string preview { sc.peek() };

> bool const is_product { preview.length() > 0 && isdigit( preview.at( 0 )) };

> ::std::string const multiplicity { is_product ? sc.get(): "1"s };

> return product_parse_type{ component.surface() +( is_product ? preview : ""s ), new primary_parse_type { component }, multiplicity }; }
>
> sum_parse_type get_sum()
> { auto result { sum_parse_type{ ""s, {} }};
> product_parse_type component { get_product() };
> if( component.surface() == ""s )return result;
> result.append_surface( component.surface() );
> result.append_addend( component );

> ::std::string preview;

> while( true )
> { preview = sc.peek();

> bool const is_sum { preview.length() > 0 &&( isalpha( preview.front() )|| preview.front() == '(' || preview.front() == '[' )};

> if( !is_sum )return result;
> component = get_product();
> result.append_surface( component.surface() );
> result.append_addend( component ); }
> return sum_parse_type{ ""s, {} }; }
>
> paren_parse_type get_paren()

> { ::std::string const lpar { sc.get() };
> sum_parse_type const sum { get_sum() };
> ::std::string const rpar { sc.get() };

> return paren_parse_type
> { lpar + sum.surface() + rpar, lpar, rpar, sum }; }
>
> primary_parse_type get_primary()
> { auto next { sc.peek() };
> if( isalpha( next.front() ))
> { name_parse_type name = get_name();
> return new name_parse_type{ name }; }

> if( next.front() == '(' || next.front() == '[' )

> { paren_parse_type paren = get_paren();
> return new paren_parse_type{ paren }; }
> return nullptr; }};
>
> int main()

> { ::std::istringstream s{ "[Be(N(CH3)2)2]3" };
> scanner< ::std::istringstream >sc{ ::std::move( s )};
> parser_type< scanner< ::std::istringstream > >parser{ ::std::move( sc )};
> sum_parse_type sum_value = parser.get_sum();
> ::std::cout << sum_value.surface() << '\n';
> ::std::cout << sum_value.description() << '\n'; }

You've won the 2019 obfuscated code contest!

Mr Flibble

unread,

Jun 30, 2019, 2:54:42 PM6/30/19

to

On 30/06/2019 15:04, Stefan Ram wrote:

> alexo <ale...@inwind.it> writes:
>> but not the following:
>> [Be(N(CH3)2)2]3
>

What's with all the `::std::`? `std::` is fine. Your code is nearly as
bad as Alf's.

Bonita Montero

unread,

Jul 1, 2019, 12:56:55 AM7/1/19

to

> What's with all the `::std::`? `std::` is fine. Your code is nearly as
> bad as Alf's.

::std:: is because of Stefan's compulsiveness.

Juha Nieminen

unread,

Jul 1, 2019, 3:30:50 AM7/1/19

to

Scott Lurndal <sc...@slp53.sl.home> wrote:
> If it works, why "improve" it? strtok is perfectly legal C++.

Because strtok has known and acknowledged limitations which make it
unsuitable for many applications. Most prominently, it maintains
an internal singleton state, which means you can't strtok two strings
in alternation (not even if you do it inside one single thread.
Of course strtok is completely non-thread-safe and cannot be made
thread-safe even by using locking from the outside.)

There exist non-singleton implementations of strtok, where the
internal state is kept in a struct, but these are, so far, non-standard
library extensions.

Scott Lurndal

unread,

Jul 1, 2019, 9:09:22 AM7/1/19

to

Juha Nieminen <nos...@thanks.invalid> writes:
>Scott Lurndal <sc...@slp53.sl.home> wrote:
>> If it works, why "improve" it? strtok is perfectly legal C++.
>
>Because strtok has known and acknowledged limitations which make it
>unsuitable for many applications.

I'm well aware of strtok and its shortcomings.

However, my question stands. If it works already [in an application], why change it?

The impression I got was that the OP didn't think it was "C++" enough.

alexo

unread,

Jul 1, 2019, 9:21:39 AM7/1/19

to

Il 28/06/19 19:14, alexo ha scritto:
> Hello,

>
> I would like to improve a chemical formula parser that I wrote from
> scratch without using tokenizing functions that correctly handles the
> formula:
>
> Fe4[Fe(CN)6]3*6H2O
>

> but not the following:
>
> [Be(N(CH3)2)2]3
>

> my asking is:

> is there a purely C++ function that behaves like the C strtok() ?
>

> thank you
>
> alessandro

I've seen that my asking teased some of you. So I public the code I
wrote for my parser. It may be appear simple, stupid or whatever, I ask
you your suggestions about.
Remember thet I'm not a professional programmer nor a student but just
an hobbyst C++ coder.

// it needs of a std::vector<string> named 'atoms'

// the parse_formula header is the following:

static void parse_formula(string& f, unsigned int mc_factor = 1);

// parse_formula implementation
// in a certain point of the code it calls itself using mc_factor
// to remember the multiplicative coefficient of the former
// parentheses analysis.

void WCalc::parse_formula(string& f, unsigned int mc_factor)
{
unsigned int number_of_atoms = 0;

string s {};

if(f.length() == 0 || f.length() > SIZE)
{
error = true;
return;
}

for(unsigned int i = 0; i < f.length(); i++) {
if(islower(f[i])) {
error = true;
break;
}
if(f.length() == 1 and isupper(f[i])) // single letter atom
{
s = f[i];
atoms.push_back(s);
s.clear();
number_of_atoms++;
coefficients.push_back(1 * mc_factor); // set atomic
coefficient to 1
break;
}
if(isupper(f[i]) and isupper(f[i + 1])) { // single atom
s = f[i];
atoms.push_back(s);
s.clear();
number_of_atoms++;
coefficients.push_back(1 * mc_factor); // set atomic
coefficient to 1
continue;
}
if(isupper(f[i]) and islower(f[i + 1])) { // two letters
atom...
if(islower(f[i + 2])) {
cout << "\nwrong input";
error = true;
break;
} else {
s = f[i];
s += f[i + 1];
atoms.push_back(s);
number_of_atoms++;
s.clear();

if(isdigit(f[i + 2])) { // ...and its atomic
coefficient
int ks = i + 2;
int ke = ks;
while(isdigit(f[ke])) {
ke++;
}

coefficients.push_back(atoi(f.substr(ks, ke -
ks).c_str()) * mc_factor);

i += ke - ks;
}
else
{
coefficients.push_back(1 * mc_factor); //
atomic coefficient 1
i += 1;
}
}
}
if(isupper(f[i]) and isdigit(f[i + 1])) { // one letter atom...
s = f[i];
atoms.push_back(s);
number_of_atoms++;
s.clear();

int ks = i + 1;
int ke = ks;

while(isdigit(f[ke])) { // ... and its atomic coefficient
ke++;
}

coefficients.push_back(atoi(f.substr(ks, ke -
ks).c_str()) * mc_factor);

i += ke - ks;
}
if(isupper(f[i]) and (f[i + 1] == '*')) { // if we reach
the crystallization molecules
s = f[i];
atoms.push_back(s);
s.clear();
number_of_atoms++;
coefficients.push_back(1 * mc_factor);
}
if(isupper(f[i]) and (f[i + 1] == '\0')) { // if we have
the last atom of the formula
s = f[i];
atoms.push_back(s);
s.clear();
number_of_atoms++;
coefficients.push_back(1 * mc_factor);
}

if(isupper(f[i]) and ((f[i + 1] == '(') || (f[i + 1] ==
'['))) {
s = f[i];
atoms.push_back(s);
s.clear();
number_of_atoms++;
coefficients.push_back(1 * mc_factor);
}
if(isupper(f[i]) and (f[i + 1] == ')' || f[i + 1] == ']') ) {
s = f[i];
atoms.push_back(s);
s.clear();
number_of_atoms++;
coefficients.push_back(1 * mc_factor);
continue;
}
if(ispunct(f[i]) and f[i] != '(' and f[i] != ')' and f[i]
!= '[' and f[i] != ']' and f[i] != '*') {
error = true;
break;
}

if(f[i] == '(') { // handle round parentheses
int k = i + 1;
int cpindex = i + 2;
int index = i + 2;

while(f[cpindex] != ')') {
cpindex++;
}

index = cpindex + 1;

while(isdigit(f[index])) {
index++;
}

s = f.substr(k, cpindex - k);

if(index > cpindex + 1) {
mc_factor *= atoi(f.substr(cpindex + 1, index -
cpindex).c_str());
}
else
{
mc_factor = 1;
}

i += index -k;
parse_formula(s, mc_factor);
mc_factor = 1;
}

if(f[i] == '[') { // handle square parentheses
int k = i + 1;
int csqpindex = i + 1;
int sqindex = i + 2;

while(f[csqpindex] != ']') {
csqpindex++;
}

sqindex = csqpindex + 1;

while(isdigit(f[sqindex])) {
sqindex++;
}

s = f.substr(k, csqpindex - k);

if(sqindex > csqpindex + 1) {
mc_factor = atoi(f.substr(csqpindex + 1, sqindex -
csqpindex).c_str());
}
else
{
mc_factor = 1;
}

i += sqindex - k;
parse_formula(s, mc_factor);
mc_factor = 1;
}

if(f[i] == '*') // crystallization molecules
{
int ks = i+1;

if(isupper(f[ks]))
{
mc_factor = 1;
s = f.substr(ks, strlen(f.c_str()) - ks);
parse_formula(s, mc_factor);

break;
}

int kc, kf, kp;
kc = kf = kp = i+1;

if(isdigit(f[kc]))
{
while(isdigit(f[kc]))
{
kc++;
kf++;
}

kf = kp = kc;

if(f[kp] == '(' or f[kp] == '[' )
{
kf = kp+1;

s = f.substr(kf, strlen(f.c_str()) - (kf+1));
mc_factor = atoi(f.substr(ks, kc - ks).c_str());

parse_formula(s, mc_factor);

break;
}
else
{
kf = kc;

s = f.substr(kf, - ks);
mc_factor = atoi(f.substr(ks, kc - ks).c_str());

parse_formula(s, mc_factor);

break;
}
}
}
}
}

James Kuyper

unread,

Jul 1, 2019, 9:45:25 AM7/1/19

to

strtok_s() is optional, but is fully specified in the C standard, with
that definition normatively cross-referenced by the C++ standard. I
don't see how you could label it non-standard.

alexo

unread,

Jul 2, 2019, 1:59:19 PM7/2/19

to

Il 02/07/19 19:06, Stefan Ram ha scritto:

> alexo <ale...@inwind.it> writes:
>> But to be perfect it should handle even the molecules of crystallization
>> that are written in this way (but I didn't talk about them in my post:
>> CuSO4*5H2O
>

> I do not know how crystallization should be parsed,
> therefore, I ignored crystallization.

take for example CuSO4*5H2O

the number 5 following the '*' character is the number of molecules
that follow it, in that case H2O, so the parser should read

10 Hydrogen atoms
5 Oxygen atoms

> In the meantime, I have developed another idea:
>
> We parse to build a syntax tree in memory and then
> evaluate the syntax tree somehow.
>
> But a formula like "[Fe(CO)6][Sb2F11]2" already /is/ a
> syntax tree from a certain point of view. So, I wonder
> whether a parser is needed at all! Instead, one can just
> decorate the tree with some functions for traversal and
> then start to do the evaluation directly on this tree
> which is given as (for example) "[Fe(CO)6][Sb2F11]2".
>
> How to do this evaluation depends on the kind of evaluation
> wanted here. I don't know what kind of evaluation is wanted
> here.

I'm Sorry but I don't know what a syntax tree is. I know what an
informatic tree is but I never implemented one myself.

I've published the code of my parser in the hope that someone could
suggest me improves, but I think that the group thinks I'm a lazy
student :( but that's not so. I'm not a student anymore. I've had my
laurea in chemistry 19 years ago.

I have to say moreover that I didn't understand the code you wrote some
days ago. It's full of things I don't yet know. But as I've already told
it works.

Anyway a big thank you for your interest.

alessandro

alexo

unread,

Jul 2, 2019, 3:52:20 PM7/2/19

to

Il 02/07/19 20:22, Stefan Ram ha scritto:

> [.. code omitted ..]

that is amazing! Progrmming is amazing!

--- Warning --- personal history follows.
if you are not interested in my life, skip reading further.
You've been advised ;)

When I was a kid my grandpa bought me a C64. I live in Italy and at that
times I was about 7 or 8 years old. I remember that besides the games
that I never ended, I tried to make some BASIC code programs working,
but with no or few success. I was always making transcription errors.
Moreover the instructions were in English - I couldn't speak nor read it
yet, and those listings were not 'Hello world' programs, but several
pages long and I never had the satisfaction of making something of them
working. So my experience with computers was initially frustrating. When
I chose the university faculty, I opted for chemistry, but it was only
at the 3rd year (in 1997) that I had the possibility to study a
programming language. It was Fortran77 - the teacher was not up-to-date
in the IT world - and that language even if old, opened me a world. But
it was too late. My study road was at half a half and it was late to
change. But what remained from that days was the passion. My first
'hobbyist' programming attempts are dated back to the very first
years of 2000's whit Java.

But my step were:

learning a bit of Java -> learning a bit C -> learning C++ (still in
progress)

Mr Stroustrup writes that it's not necessary to learn C before learning
C++ but to me was nonsense learning OO programming (that anyway uses
structured statements) without knowing anything about structured
programming. And Java was born to avoid pointers that are essential in C
and C++. So since this languages share the same syntax I wanted to know
more about C.
Now java is set aside, but I feel still at a beginner level both with C
and C++.

I know a bit about linked lists and I've read two sorting algorithms
merge and quick sort.
actually I've written two GUI interfaces in FLTK and a few more things

Even if I read of programming from about 20 years I have to admit that
I'm not constant in writing code nor in the point were I'm.

My parser was a brute-force approach. It reads the formula left to right
like a human does. and recursively calls itself to analyze blocks inside
parentheses. But something is not working as I expected. Maybe I'll be
able to fix it or maybe not. It doesn't matter. Its a hobby.

thank to you all in the group.