extracting substrings from a string

porp...@gmail.com

unread,

Feb 28, 2018, 2:25:07 PM2/28/18

to

Hi,
I would like to parse the following string to extract the following parts:
std::string unsername;
std::string host;
int port;
std::string db;

protocol://[username:password@]host[:port]/db

[...] designates the optional parts

My algorithm uses std::string::find_first_of heavily. In fact I don't like it. It doesn't look clean.
I wonder whether there is an efficient way of doing this using only a standard C++ (11+ allowed) or boost (C++ standard preferred)
Idealy would be to have only a single pass through the string.

Could you please give me some hints or provide some kind of code snippet.

Thanks in advance.

Paavo Helde

unread,

Feb 28, 2018, 2:29:19 PM2/28/18

to

On 28.02.2018 21:24, porp...@gmail.com wrote:
> Hi,
> I would like to parse the following string to extract the following parts:
> std::string unsername;
> std::string host;
> int port;
> std::string db;
>
> protocol://[username:password@]host[:port]/db
>
> [...] designates the optional parts
>
> My algorithm uses std::string::find_first_of heavily. In fact I don't like it. It doesn't look clean.
> I wonder whether there is an efficient way of doing this using only a standard C++ (11+ allowed) or boost (C++ standard preferred)
> Idealy would be to have only a single pass through the string.

Take a look at <regex>: http://en.cppreference.com/w/cpp/regex

Richard

unread,

Feb 28, 2018, 4:38:08 PM2/28/18

to

[Please do not mail me a copy of your followup]

porp...@gmail.com spake the secret code
<6aadabbe-7e8f-494e...@googlegroups.com> thusly:

>I would like to parse the following string to extract the following parts:

^^^^^

This is the key to your whole question. When you need to parse
something, use a parser :).

Options are:
0) Use an existing library
1) find a regex pattern to match your input
2) write a parser by hand
3) use a parser library
4) use a parser generator

My opinions on pros/cons of the above options:

0) Recognize that you're parsing a URL. Lots of people have already
solved this problem, so look for a library that already does this and
use it. Google "c++ url parser".

1) For simple things, a regex is fine, but I would put this problem as
too complicated. The regex is going to look really ugly considering
all the optional parts.

2) Writing a parser by hand for this shouldn't be too hard. In fact,
one shows up on stackoverflow from the above google search.

3) You can use Boost.Spirit to write parsers in a pretty succinct
manner. They are fast. However, other people unfamiliar with
Boost.Spirit may find your code hard to understand. If you're
comfortable with Spirit, this is an easy way to get a high quality
parser. If you're uncomfortable with Spirit (or your team is
uncomfortable), then a hand written parser may be better.

4) You could use YACC to generate a parser for this structure.
However, IMO it would be overkill. Debugging through YACC parsers is
pretty disgusting IMO.
--
"The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline>
The Terminals Wiki <http://terminals-wiki.org>
The Computer Graphics Museum <http://computergraphicsmuseum.org>
Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

Jorgen Grahn

unread,

Feb 28, 2018, 5:06:05 PM2/28/18

to

On Wed, 2018-02-28, porp...@gmail.com wrote:
> Hi,
> I would like to parse the following string to extract the following parts:
> std::string unsername;
> std::string host;
> int port;
> std::string db;
>
> protocol://[username:password@]host[:port]/db
>
> [...] designates the optional parts

That looks suspiciously like an URL, or URI or whatever it's called.
Can't you rely on the formal definition in some RFC instead of
describing it vaguely yourself? You can add extra limitations if you
want to (e.g. you seem to mandate a cleartext password with the user
name)?

> My algorithm uses std::string::find_first_of heavily.

I personally don't like the std::string methods, the string::npos and
all that business. Consider using <algorithm> on begin(s) .. end(s),
which IMO feels more idiomatic.

> In fact I don't like it. It doesn't look clean.

A parser doesn't have to look clean. Dump it in a well-documented
function, and write unit tests.

> I wonder whether there is an efficient way of doing this using only
> a standard C++ (11+ allowed) or boost (C++ standard preferred)
> Idealy would be to have only a single pass through the string.
>
> Could you please give me some hints or provide some kind of code snippet.

I second the recommendation of std::regex, or splitting it up a bit by
other means and then using std::regex on some of the parts.

But there are some issues you need to clarify for yourself (e.g. by
using an existing formal definition of the syntax; see above):

- Can the username:password part contain :, @ or /? That would mean
you cannot start by splitting on the third / in the string.

- Can the host contain a :, and how would that work with host:port?
Host names don't contain colons, but IPv6 addresses do. In an URL,
you'd write it like this [::1]:80.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Christiano

unread,

Feb 28, 2018, 8:50:04 PM2/28/18

to

I would use regex as well.

Ralf Goertz

unread,

Mar 1, 2018, 4:05:46 AM3/1/18

to

Am Wed, 28 Feb 2018 21:37:57 +0000 (UTC)
schrieb legaliz...@mail.xmission.com (Richard):

> [Please do not mail me a copy of your followup]
>
> porp...@gmail.com spake the secret code
> <6aadabbe-7e8f-494e...@googlegroups.com> thusly:
>
> >I would like to parse the following string to extract the following
> >parts:
> ^^^^^
>
> This is the key to your whole question. When you need to parse
> something, use a parser :).
>
> Options are:
> 0) Use an existing library
> 1) find a regex pattern to match your input
> 2) write a parser by hand
> 3) use a parser library
> 4) use a parser generator
>
> My opinions on pros/cons of the above options:
>

> 1) For simple things, a regex is fine, but I would put this problem as
> too complicated. The regex is going to look really ugly considering
> all the optional parts.

>> protocol://[username:password@]host[:port]/db

Hm, ugliness is in the eye of the beholder [quick & dirty]:

#include <iostream>
#include <regex>
#include <string>

using namespace std;

int main() {
string test[]=
{"http://testhost.dom/rest",
"http://us...@testhost.dom/rest",
"http://us...@testhost.dom:1234/rest",
"http://user:pas...@testhost.dom:4353/rest",
"http://user:pas...@testhost.dom/rest"};
string url_string("([^:]+)://((([^:@]+)(:([^@]+))?@))?([^:/]+)(:([0-9]+))?/(.+)");
regex url(url_string);
cout<<"The regex "<<url_string<<endl<<endl;
for (auto &s:test) {
smatch sm;
auto matches=regex_match(s,sm,url);
if (!matches) {
cout<<"doesn't match "<<s<<endl;
} else {
cout<<"matches "<<s<<endl;
cout<<"Protokoll: "<<sm[1]<<endl;
cout<<"User: "<<sm[4]<<endl;
cout<<"Password: "<<sm[6]<<endl;
cout<<"Host: "<<sm[7]<<endl;
cout<<"Port: "<<sm[9]<<endl;
cout<<"File: "<<sm[10]<<endl<<endl;
}
}
}

Of course, care must be taken if user and/or password contain one of the
special characters.

James R. Kuyper

unread,

Mar 1, 2018, 8:33:07 AM3/1/18

to

On 03/01/2018 04:05 AM, Ralf Goertz wrote:
...

> cout<<"The regex "<<url_string<<endl<<endl;

...

> cout<<"doesn't match "<<s<<endl;
> } else {
> cout<<"matches "<<s<<endl;
> cout<<"Protokoll: "<<sm[1]<<endl;

Protocol? The rest of your text use English spellings.

Ralf Goertz

unread,

Mar 1, 2018, 9:46:15 AM3/1/18

to

Am Thu, 1 Mar 2018 08:32:36 -0500
schrieb "James R. Kuyper" <james...@verizon.net>:

> On 03/01/2018 04:05 AM, Ralf Goertz wrote:
> ...
> > cout<<"The regex "<<url_string<<endl<<endl;
> ...
> > cout<<"doesn't match "<<s<<endl;
> > } else {
> > cout<<"matches "<<s<<endl;
> > cout<<"Protokoll: "<<sm[1]<<endl;
>
> Protocol? The rest of your text use English spellings.

Well, I am German and similar words can make me switch languages. Sorry
about that. Yes, Protocol, which -- used in the middle of a sentence
like I just did -- should be spelled with a small p. But we Germans (big
G?) like to capitali[sz]e not only „God“ but all nouns (#atheism). At
least we are humble enough to lowercase „I“. :-)

Richard

unread,

Mar 1, 2018, 1:03:10 PM3/1/18

to

[Please do not mail me a copy of your followup]

Ralf Goertz <m...@myprovider.invalid> spake the secret code
<20180301100...@delli.fritz.box> thusly:

>Hm, ugliness is in the eye of the beholder [quick & dirty]:
>

> string
>url_string("([^:]+)://((([^:@]+)(:([^@]+))?@))?([^:/]+)(:([0-9]+))?/(.+)");

Even as a regular user of regex'es I consider that truly barfworthy :)

Jorgen Grahn

unread,

Mar 1, 2018, 3:24:33 PM3/1/18

to

I couldn't get an alternative to work in a few minutes (shame on me!)
but some critizism anyway:

- You assume the port has to be numerical, but it costs nothing to let
the user give standard service names. (This was one of the
weaknesses in the OP's specification.)

- You could probably simplify the expression by using the non-greedy
modifier here and there e.g. .+?: instead of [^:]+: to say "some
text before a colon".

- More weaknesses from the specification: I suspect a valid protocol
name is matched by \w+ ("one or more word characters") but you
accept for example a single space.

- The OP should add plenty of negative tests too.

...

> Of course, care must be taken if user and/or password contain one of the
> special characters.

Ralf Goertz

unread,

Mar 2, 2018, 3:49:04 AM3/2/18

to

Am 1 Mar 2018 20:24:14 GMT
schrieb Jorgen Grahn <grahn...@snipabacken.se>:

> On Thu, 2018-03-01, Ralf Goertz wrote:
> > Am Wed, 28 Feb 2018 21:37:57 +0000 (UTC)
> > schrieb legaliz...@mail.xmission.com (Richard):
> >

> >> porp...@gmail.com spake the secret code
> >> <6aadabbe-7e8f-494e...@googlegroups.com> thusly:
> >>

> >>> protocol://[username:password@]host[:port]/db
> >
> > Hm, ugliness is in the eye of the beholder [quick & dirty]:
> >
> > #include <iostream>
> > #include <regex>
> > #include <string>
> >
> > using namespace std;
> >
> > int main() {
> > string test[]=
> > {"http://testhost.dom/rest",
> > "http://us...@testhost.dom/rest",
> > "http://us...@testhost.dom:1234/rest",
> > "http://user:pas...@testhost.dom:4353/rest",
> > "http://user:pas...@testhost.dom/rest"};
> > string
> > url_string("([^:]+)://((([^:@]+)(:([^@]+))?@))?([^:/]+)(:([0-9]+))?/(.+)");
>
> I couldn't get an alternative to work in a few minutes (shame on me!)
> but some critizism anyway:
>
> - You assume the port has to be numerical, but it costs nothing to let
> the user give standard service names. (This was one of the
> weaknesses in the OP's specification.)

Hm in https://en.wikipedia.org/wiki/URL it is explicitly stated: "An
optional port *number*, separated from the hostname by a colon"
[emphasis by me]

> - You could probably simplify the expression by using the non-greedy
> modifier here and there e.g. .+?: instead of [^:]+: to say "some
> text before a colon".

The reason I don't use non-greedy regexes is that I extensively program
using „flex“ and that does not support non-greediness.

> - More weaknesses from the specification: I suspect a valid protocol
> name is matched by \w+ ("one or more word characters") but you
> accept for example a single space.

That's right of course. But the goal was to show how to extract the
various fields of a well-formed url.

> - The OP should add plenty of negative tests too.

ACK

Jorgen Grahn

unread,

Mar 2, 2018, 4:43:57 PM3/2/18

to

Oh. Makes sense when you think about it: if you have to specify
anything, it's because you want to use a non-standard number. I
didn't really look at the RFC, because I was uncertain if the OP
really wanted to parse an URL.

But I still want to point out that in general, users expect to be able
to enter service names (just as they expect to be able to enter domain
names rather than IP addresses).

>> - You could probably simplify the expression by using the non-greedy
>> modifier here and there e.g. .+?: instead of [^:]+: to say "some
>> text before a colon".
>
> The reason I don't use non-greedy regexes is that I extensively program
> using „flex“ and that does not support non-greediness.

I recommend getting familiar with them: in Perl and Python regular
expressions I tend to use them a lot. Those REs tend to be easier to
read, IMO.

But I'm happy to hear someone still uses flex.

>> - More weaknesses from the specification: I suspect a valid protocol
>> name is matched by \w+ ("one or more word characters") but you
>> accept for example a single space.
>
> That's right of course. But the goal was to show how to extract the
> various fields of a well-formed url.
>
>> - The OP should add plenty of negative tests too.
>
> ACK

Richard

unread,

Mar 2, 2018, 5:04:49 PM3/2/18

to

[Please do not mail me a copy of your followup]

Jorgen Grahn <grahn...@snipabacken.se> spake the secret code
<slrnp9jhce.e...@frailea.sa.invalid> thusly:

>But I still want to point out that in general, users expect to be able
>to enter service names (just as they expect to be able to enter domain
>names rather than IP addresses).

In the case of URLs I wouldn't start accepting service names instead
of port numbers because it confuses the issue of what is a URL.

Jorgen Grahn

unread,

Mar 3, 2018, 3:01:55 AM3/3/18

to

On Fri, 2018-03-02, Richard wrote:
> [Please do not mail me a copy of your followup]
>
> Jorgen Grahn <grahn...@snipabacken.se> spake the secret code
> <slrnp9jhce.e...@frailea.sa.invalid> thusly:
>
>>But I still want to point out that in general, users expect to be able
>>to enter service names (just as they expect to be able to enter domain
>>names rather than IP addresses).
>
> In the case of URLs I wouldn't start accepting service names instead
> of port numbers because it confuses the issue of what is a URL.

Me neither, for the reason you said, and also (see upthread) because
it wouldn't make sense.

Ralf Goertz

unread,

Mar 3, 2018, 3:40:52 AM3/3/18

to

Am 2 Mar 2018 21:43:42 GMT
schrieb Jorgen Grahn <grahn...@snipabacken.se>:

> But I'm happy to hear someone still uses flex.

I think it is the very best tool available for my task which is
extracting information from lots and lots of similarly structured tables
in text files. It would be an even better tool if yytext were a vector
like smatch so that I would have access to the submatches.