Problem with finding multiple matches using RegEx

195 views
Skip to first unread message

Sanje²v

unread,
Jun 17, 2010, 1:54:55 AM6/17/10
to wx-dev
Hi all,
I am having problem with using regular expression in my wxWidgets
application. The application is to find all matches of e-mail
addresses within a string.

Here's the code:
...
wxString text = _T("This is a sample text and my e-mail address is
someth...@hotmail.com, what...@hotmail.com, Exa...@yahoo.com.
Hello! There wh@t's your's? I suppose anon...@mail.com?");

wxRegEx reEmail(_T("[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/
=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-
z0-9-]*[a-z0-9])?"), wxRE_ADVANCED | wxRE_ICASE);

reEmail.Matches(text);

wcout << reEmail.GetMatchCount() << _T('\n'); // Returns just 1

wcout << reEmail.GetMatch(text, 0).wchar_str().data(); // Why is
'text' required to be passed again here?
wcout << reEmail.GetMatch(text, 1).wchar_str().data(); // Fails
...

How do I extract multiple matches? Do I have to use Mid()? If so,
wxWidgets must really consider redesigning this confusing class.

Thanks,
Sanje2v

Vadim Zeitlin

unread,
Jun 17, 2010, 5:24:58 AM6/17/10
to wx-...@googlegroups.com
On Wed, 16 Jun 2010 22:54:55 -0700 (PDT) Sanje²v <swt...@gmail.com> wrote:

S> wxRegEx reEmail(_T("[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/
S> =?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-
S> z0-9-]*[a-z0-9])?"), wxRE_ADVANCED | wxRE_ICASE);
S>
S> reEmail.Matches(text);
S>
S> wcout << reEmail.GetMatchCount() << _T('\n'); // Returns just 1

Well, yes, there are no capture groups in your regex. So it returns 1
because the regex matches. Please read GetMatchCount() documentation.

S> wcout << reEmail.GetMatch(text, 0).wchar_str().data(); // Why is
S> 'text' required to be passed again here?

Because wxRegEx doesn't copy the (potentially big) string unnecessarily.

S> How do I extract multiple matches? Do I have to use Mid()? If so,
S> wxWidgets must really consider redesigning this confusing class.

I think you need to understand that a regex simply matches or doesn't
match. If you want to apply it again, you need to do it yourself. I.e. in
this case you can get the end of the first match (from GetMatch()) and call
Matches() again starting at this offset. And again and again until it
doesn't match any more.

Regards,
VZ

Sanje²v

unread,
Jun 17, 2010, 9:56:24 AM6/17/10
to wx-dev
> If you want to apply it again, you need to do it yourself. I.e. in
> this case you can get the end of the first match (from GetMatch()) and call
> Matches() again starting at this offset. And again and again until it
> doesn't match any more.

Thanks VZ for your reply. I finally came up with this:

size_t start = 0;
size_t len = 0;
size_t prevstart = 0;

while(reEmail.Matches(text.Mid(prevstart)) && reEmail.GetMatch(&start,
&len))
{
wcout << text.Mid(prevstart + start, len).wchar_str().data();
prevstart += start + len;
}

Though the code does the job, it is hard to get grasp of the logic. I
really think wxWidgets should redesign wxRegEx class. It's filled with
misnamed functions such as 'GetMatchCount()' (should be called
GetCapturedExpressionCount() may be?) and has a confusing function
interface. It's sad that for what wxRegEx class is used mostly, i.e.
extracting multiple data matches from input, we programmers have to do
away with cryptic code as above. Is there anyway to file a improvement
proposal to wxWidgets heads?

-Sanje2v

Vadim Zeitlin

unread,
Jun 17, 2010, 6:09:08 PM6/17/10
to wx-...@googlegroups.com
On Thu, 17 Jun 2010 06:56:24 -0700 (PDT) Sanje²v <swt...@gmail.com> wrote:

S> size_t start = 0;
S> size_t len = 0;
S> size_t prevstart = 0;
S>
S> while(reEmail.Matches(text.Mid(prevstart)) && reEmail.GetMatch(&start,
S> &len))
S> {
S> wcout << text.Mid(prevstart + start, len).wchar_str().data();
S> prevstart += start + len;
S> }
S>
S> Though the code does the job, it is hard to get grasp of the logic.

Really? How much simpler can it be when this is exactly the C++
translation of the following pseudo-code:

while match found:
show matching string
advance past its end

I can't help wondering how else can this be written.

S> I really think wxWidgets should redesign wxRegEx class.

FWIW I disagree.

S> It's filled with misnamed functions such as 'GetMatchCount()' (should be
S> called GetCapturedExpressionCount() may be?)

First, "one" != "filled with". Second, GetMatchCount() might be slightly
unclear but IMHO it takes a big effort to not understand what it does after
reading its documentation.

S> and has a confusing function interface.

What do you mean by this? It seems pretty logical to me.

S> It's sad that for what wxRegEx class is used mostly, i.e.
S> extracting multiple data matches from input,

I question the assumption that it's mostly used for this. For instance I
never used it for it so far.

S> we programmers have to do away with cryptic code as above.

Please propose a simpler version.

S> Is there anyway to file a improvement proposal to wxWidgets heads?

This should be done on our Trac (http://trac.wxwidgets.org/) but in this
particular case I don't think it should be done at all.

Regards,
VZ

pete_b

unread,
Jun 18, 2010, 5:44:54 AM6/18/10
to wx-dev
You can always write a thin wrapper around wxRegEx to suit your needs,
or maybe
write a wxRegExMatchIterator class.
The interface of wxRegEx is minimal but complete so it is not likely
to be difficult.
Reply all
Reply to author
Forward
0 new messages