Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Results of the wildcard-matchikng exercise

288 views
Skip to first unread message

Anton Shepelev

unread,
Aug 16, 2018, 2:56:50 PM8/16/18
to
Hello, ladies and laddies!

I hereby commence this new thread for the submission and
discussinon of entries to our little wildcard-matching
exercise announced in this post:

Subject: A small programming exercise: wildcard matching
Date: Sat, 11 Aug 2018 15:49:53 +0300
Message-ID: <20180811154953.ac18...@gmail.com>

Please, publish your entries as immediate replies to this
message in order to facilitate cohesion.

--
() ascii ribbon campaign -- against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

Anton Shepelev

unread,
Aug 16, 2018, 3:57:00 PM8/16/18
to
Stefan Ram:

> I did not wish to read literatur about the implementation
> of wildcards, and instead wanted to follow my own star.

You meant your own asterisk?

Anton Shepelev

unread,
Aug 16, 2018, 4:29:42 PM8/16/18
to
Since Stefan has started posting his solutions,
I will now violate the rules in order to set an example of
a submission. This is from a colleague of mine who has not
had the gut to install and configure a newsreader, yet is
participating as a special guest. I will forward your
comments back to him.

I have omitted his surname in the subject because I forgot
to ask his permission, and the name alone is harmless. I don't
think it is convenient to separate the description and
implementation of an algorithm, so please post them together,
for having every second-level post contain one solution is very
handy. If you want to write about your "background", as Stefan
has done, please do it in at the start of the article with your
first solution.

The only comment I have about the following is that it works
by splitting the pattern at asterskses and matching the
split sections sequentially against the target string.

My colleague is a C++ manical, a master of templates and magister
of all compile-time magick (none of which may show in C)

const char * findSignature(const char * a_pcString, int a_nStringLen, const char * a_pcSignature, int a_nSignatureLen)
{
const char * pcString = a_pcString;
const char * pcStringEnd = a_pcString + a_nStringLen;
const char * pcSignature = a_pcSignature;
const char * pcSignatureEnd = a_pcSignature + a_nSignatureLen;
const char * pcFirstMatch = 0;
while(pcString != pcStringEnd)
{
if(pcSignature == pcSignatureEnd)
{
return pcFirstMatch;
}

if((*pcSignature == '?') || (*pcSignature == *pcString))
{
if(pcFirstMatch == 0)
{
pcFirstMatch = pcString;
}
++pcSignature;
++pcString;
}
else
{
if(pcFirstMatch == 0)
{
++pcString;
}
else
{
pcSignature = a_pcSignature;
pcFirstMatch = 0;
}
}
}

if(pcSignature == pcSignatureEnd)
{
return pcFirstMatch;
}

return 0;
}

const char * nextToken(const char * a_pszString, const char a_cDelim, int * a_nTokenLen)
{
static char * pcString = 0;

if(a_pszString != 0)
{
pcString = (char *) a_pszString;
}

if(!pcString)
{
return 0;
}

char * pcBegin = pcString;
while(*pcString)
{
if(*pcString == a_cDelim)
{
break;
}
++pcString;
}

*a_nTokenLen = pcString - pcBegin;

if(!*pcString)
{
pcString = 0;
}
else
{
++pcString;
}

if(*a_nTokenLen > 0)
{
return pcBegin;
}
else
{
return "";
}
}

int wcmatch(const char * a_pszPattern, const char * a_pszString)
{
int nPatternLen = strlen(a_pszPattern);
int nStringLen = strlen(a_pszString);

//P("") for S("") = 1
if((nPatternLen == 0) && (nStringLen == 0))
{
return 1;
}

//P("*") for S("") = 1
if((nPatternLen == 1) && (*a_pszPattern == '*') && (nStringLen == 0))
{
return 1;
}

//Another empty variants
if((nPatternLen == 0) || (nStringLen == 0))
{
return 0;
}

//Check each token (splitted by star) match in string one by one

const char * pcString = a_pszString;
int nTokenLen = 0;
const char * pcToken = nextToken(a_pszPattern, '*', &nTokenLen);
int nLastTokenLen = -1;
const char * pcLastToken = 0;
int nTokenCount = 0;
while(pcToken)
{
++nTokenCount;

//Stars produces empty tokens => we can ignore them
if(nTokenLen > 0)
{
const char * pcMatch = findSignature(pcString, nStringLen, pcToken, nTokenLen);
if(!pcMatch)
{
return 0;
}

//First token without preceding star must be only at first position in string!
if(nLastTokenLen == -1)
{
if(pcMatch != pcString)
{
return 0;
}
}

nStringLen -= pcMatch - pcString + nTokenLen;
pcString = pcMatch + nTokenLen;
}

nLastTokenLen = nTokenLen;
pcLastToken = pcToken;

pcToken = nextToken(0, '*', &nTokenLen);
}

//Last token
if(nLastTokenLen > 0)
{
//If there is only one token in pattren, it must be the same length as string (P("n") for S("nn") = 0)
if((nTokenCount == 1) && (nStringLen > 0))
{
return 0;
}

//Last token with preceding star must be checked until string end (P("?*n") for S("nnc") = 0)
while(nStringLen > 0)
{
const char * pcMatch = findSignature(pcString, nStringLen, pcLastToken, nLastTokenLen);
if(!pcMatch)
{
return 0;
}

nStringLen -= pcMatch - pcString + nLastTokenLen;
pcString = pcMatch + nLastTokenLen;
}
}

return 1;
}

Anton Shepelev

unread,
Aug 16, 2018, 5:33:29 PM8/16/18
to
Stefan Ram:

> I believed that "Please, publish your entries as immediate
> replies to this message in order to facilitate cohesion."
> was supposed to supersede the "Friday at your location
> rule".

It was an unchary usage of "immediate". I meant direct
replies to that message rather than replies to replies.
Sorry for the confusion.

Anton Shepelev

unread,
Aug 16, 2018, 5:54:32 PM8/16/18
to
It being 00:53 here and I being drowsy, I will start from a
small recursive solution, which is not entirely my own, for
I had not thought about recursion until I saw it employed in
a buggy and inefficient function I found in the internet. I
therefore do not consider it my primary contribution and
assign it the number two.

I fixed it and removed one redundant recursive invocation by
way of tail-call optimisation described here:

https://en.wikipedia.org/wiki/Tail_call

I believe everybody's recursive solutions are small and
similar, so there is very little to say. In mine, the
recursive call is generous (skips the asterisk) and the
iterative continuation is greedy (eats a symbol from the
string). After I came up with a working function, I have
tried to make it as small as possible, whence the tricky use
++ and --. The result is quite different from, and I belive
better than, the original.

I have not tested this conjecture, but meseems this function
will perform the worst when there are many asterisks but few
symbols for them to match, because then the recursion will
have to traverse the largest tree until before reaching a
correct fit.

/* Check whether wildcard pattern <pattern> matches string <text> */
char wcmatch /* returns 1 upon match and 0 upon mismatch */
( char const * pattern, /* the pattern, e.g. "*.txt" */
char const * text /* the string to match, e.g. "readme.txt" */
)
{ while( 1 == 1 )
{ switch( *pattern )
{ default : if( *text != *pattern ) return 0; break;
case '?' : if( !*text ) return 0; break;
case '\0': return !*text;
case '*' :
if( wcmatch( pattern + 1, text ) ) return 1;
if( *text ) --pattern;
else return 0;
break;
}
++text;
++pattern;

Anton Shepelev

unread,
Aug 16, 2018, 5:56:32 PM8/16/18
to
[I accidentally posted this on behalf of Nikita, so disregard
the wrong post and reply to this one]

Anton Shepelev

unread,
Aug 16, 2018, 5:59:32 PM8/16/18
to
I wrote:

> It being 00:53 here and I being drowsy [...]

I should have waited till morning. Please, ignore this sub-
thread and refer instead to the one titled

Anton: wildcard matching exercise: solution 2

Anton Shepelev

unread,
Aug 16, 2018, 6:29:22 PM8/16/18
to
This is a solution from one Kirk J. Krauss:
https://preview.tinyurl.com/ydecbdjp
https://web.archive.org/web/20180101110925/http://www.drdobbs.com/architecture-and-design/matching-wildcards-an-empirical-way-to-t/240169123
found by Stefan Ram:

It is C++, but right on spot for our exercise:

// This function compares text strings, one of which can have wildcards
// ('*' or '?').
//
bool WildTextCompare(
char *pTameText, // A string without wildcards
char *pWildText // A (potentially) corresponding string with wildcards
)
{
// These two values are set when we observe a wildcard character. They
// represent the locations, in the two strings, from which we start once
// we've observed it.
//
char *pTameBookmark = (char *) 0;
char *pWildBookmark = (char *) 0;

// Walk the text strings one character at a time.
while (1)
{
// How do you match a unique text string?
if (*pWildText == '*')
{
// Easy: unique up on it!
while (*(++pWildText) == '*')
{
} // "xy" matches "x**y"

if (!*pWildText)
{
return true; // "x" matches "*"
}

if (*pWildText != '?')
{
// Fast-forward to next possible match.
while (*pTameText != *pWildText)
{
if (!(*(++pTameText)))
return false; // "x" doesn't match "*y*"
}
}

pWildBookmark = pWildText;
pTameBookmark = pTameText;
}
else if (*pTameText != *pWildText && *pWildText != '?')
{
// Got a non-match. If we've set our bookmarks, back up to one
// or both of them and retry.
//
if (pWildBookmark)
{
if (pWildText != pWildBookmark)
{
pWildText = pWildBookmark;

if (*pTameText != *pWildText)
{
// Don't go this far back again.
pTameText = ++pTameBookmark;
continue; // "xy" matches "*y"
}
else
{
pWildText++;
}
}

if (*pTameText)
{
pTameText++;
continue; // "mississippi" matches "*sip*"
}
}

return false; // "xy" doesn't match "x"
}

pTameText++;
pWildText++;

// How do you match a tame text string?
if (!*pTameText)
{
// The tame way: unique up on it!
while (*pWildText == '*')
{
pWildText++; // "x" matches "x*"
}

if (!*pWildText)
{
return true; // "x" matches "x"
}

return false; // "x" doesn't match "xy"
}
}
}

fir

unread,
Aug 16, 2018, 6:33:49 PM8/16/18
to
fir's typical code; co called 'no brainer' i guess - as someone previously did that thios obvious way, split wildcard parretn over the '*'s and check if all its parts (words) are just in text, one against another (only you must use strcomp version that returns chars equal is some are ?)

i use my own library functions, mainly one that split, one that is analog tos strcmp, one that is analog to strfind (i dont want to include it here as there are library ones and somewhat strightforward
(though to be hones when one would like to write split there is rishy to make faults, at least for me i dont know why ;c)


int check_match(chunk txt, chunk pattern)
{
chunks words = SplitOnCharacter(pattern, '*');
int words_len = ChunksLength(words);

if(pattern.beg[0]!='*')
if(!compare_left_WITHASKSIGNS(txt, words.first[0]))
return -1;

if(pattern.end[0]!='*')
if(!compare_right_WITHASKSIGNS(txt, words.last[0]))
return -2;


chunk input_txt = txt;
for(int n=0; n<words_len; n++)
{
chunk found = find_WITHASKSIGNS(input_txt, words.first[n]);
if(found.beg)
input_txt.beg = found.end + 1;
else
return -3;
}

return 1;
}




Ben Bacarisse

unread,
Aug 16, 2018, 9:00:54 PM8/16/18
to
My first version was this recursive implementation. It is probably the
same as several others, though I could not resist a small optimisation.
Since multiple *s are the same as one, I skip * sequences rather than
single * characters.

int wcmatch(const char *pat, const char *text)
{
switch (*pat) {
case '?':
return *text != 0 && wcmatch(pat + 1, text + 1);
case '*':
do if (wcmatch(pat + strspn(pat, "*"), text))
return true;
while (*text++);
return false;
default:
return *pat == *text && (*pat == 0 || wcmatch(pat + 1, text + 1));
}
}

In order to put less demand on the stack, this can be turned into a
looping version that uses recursive calls only for matching the *:

int wcmatch(const char *pat, const char *text)
{
for (; *pat; pat++, text++) {
switch (*pat) {
case '?':
if (*text == 0) return false;
break;
case '*':
do if (wcmatch(pat + strspn(pat, "*"), text))
return true;
while (*text++);
return false;
default:
if (*pat != *text) return false;
break;
}
}
return *text == 0;
}

I'll give another, more "industrial", solution in another post.

--
Ben.

Ben Bacarisse

unread,
Aug 16, 2018, 9:01:15 PM8/16/18
to
Here is a version that tries to be more respectful of C's resource
priorities. As such, it does not use recursion, and it uses malloc to
allocate some temporary data and not, as I originally, planned, a VLA.

The plan is a simple one that I am sure will be seen in other
submissions. The pattern is split at the *s (well, again, as in my
other versions, at sequences of *s) and the resulting "fixed width"
pieces are matched in sequence.

A couple of details... The match part that comes at the end always
includes the null byte. It is matched like any other character and give
a simply way to ensure that that part of the pattern matches only at the
end of the subject string. Because there is no automatic way to
similarly force a match at the start of a string, that piece has to be
handled differently.

A pattern with a * is always considered to have as least two fixed width
parts. Thus "*" is treated as a zero-length match at the start and a
1-byte match of a null byte.

In addition, space for an extra part is allocated to act as a sentinel
so as to avoid having to keep a count of the number of parts. The parts
are processed until one is found with a null pattern.

This version was supposed to have a "compile" phase where the pattern
could be prepared, so as to be applied to many subject strings. Since
this design could make use of more sophisticated string matching
algorithms like Boyer-Moore or KMP, it could well be worth having such a
separate phase. Needless to say, that never happened, but I think that
separating out finding the parts from matching them does add a little to
the clarity.

Finally, I acknowledge that I use a style the many here find too dense.
Running the code though 'indent' will only help a little as I find
myself drawn to writing things like

(++part)->pattern = pat += strspn(pat, "*");

All I can say in my defence is that I wrote the code for fun and I
learned C in the era of while (*dst++ = *src++);.


struct match_part {
const char *pattern;
ptrdiff_t length;
};

static bool match_char(char pattern, char text)
{
return pattern == text || pattern == '?' && text;
}

static bool matches_here(const struct match_part *part, const char *text)
{
for (ptrdiff_t i = 0; i < part->length; i++)
if (!match_char(part->pattern[i], text[i]))
return false;
return true;
}

static ptrdiff_t next_match(const struct match_part *part, const char *text)
{
const char *start = text;
do if (matches_here(part, start))
return start - text;
while (*start++);
return -1;
}

static size_t count_parts(const char *pat)
{
size_t n = 1;
while (*pat)
if (*pat++ == '*') {
n += 1;
pat += strspn(pat, "*");
}
return n;
}

static struct match_part *find_parts(const char *pat)
{
struct match_part *parts = malloc((count_parts(pat) + 1) * sizeof *parts);
if (parts) {
struct match_part *part = parts;
part->pattern = pat;
while (*pat)
if (*pat++ == '*') {
part->length = pat - part->pattern - 1;
(++part)->pattern = pat += strspn(pat, "*");
}
part->length = pat - part->pattern + 1;
part[1] = (struct match_part){0, 0};
}
return parts;
}

int wcmatch(const char *pat, const char *text)
{
struct match_part *parts = find_parts(pat);
ptrdiff_t d = -1;
if (parts && matches_here(parts, text)) {
const struct match_part *part = parts;
d = 0;
text += part->length;
while ((++part)->pattern && (d = next_match(part, text)) >= 0)
text += part->length + d;
}
free(parts);
return d >= 0;
}

--
Ben.

Scott

unread,
Aug 16, 2018, 9:37:13 PM8/16/18
to
On Thu, 16 Aug 2018 21:56:42 +0300, Anton Shepelev
<anto...@gmail.com> wrote:

>Hello, ladies and laddies!
>
>I hereby commence this new thread for the submission and
>discussinon of entries to our little wildcard-matching
>exercise announced in this post:
>
> Subject: A small programming exercise: wildcard matching
> Date: Sat, 11 Aug 2018 15:49:53 +0300
> Message-ID: <20180811154953.ac18...@gmail.com>
>
>Please, publish your entries as immediate replies to this
>message in order to facilitate cohesion.

Oh, in that case I won't wait.

int wcmatch(const char const *pat, const char const *text) {
int i;
while (*pat) {
if (*pat == '*') {
pat++;
if (!*pat) return 1; /* '*' at end of pat matches any */
/* Otherwise we bump and go again */
while (*text) {
i = wcmatch(pat, text);
if (i) return i;
text++;
}
} else if (*pat == '?') { /* Match any except nul */
if (!*text) return 0;
text++;
pat++;
} else if (*pat == *text) { /* Literal match */
text++;
pat++;
} else { /* No match */
return 0;
}
}
/* If target and pattern are both done, that's a match */
return !*text;
}

Joe Pfeiffer

unread,
Aug 16, 2018, 10:04:43 PM8/16/18
to
Ah, OK -- I'd posted my solution earlier, but not in this thread. Here
we go:

char wcmatch(char const * const pat, char const * const txt)
{
// empty strings match
if ((*pat == '\0') && (*txt == '\0'))
return 1;

// if first char of pattern is a *, try consuming it
// or consuming first char of txt unless txt is empty
if (*pat == '*')
return wcmatch(pat+1, txt) || ((*txt != '\0') && wcmatch(pat, txt+1));

// if only one string is empty, no match
if ((*pat == '\0') != (*txt == '\0'))
return 0;

// if first char of pattern matches first char of text,
// test rest of string
if ((*pat == '?') || (*pat == *txt))
return wcmatch(pat+1, txt+1);

// otherwise failure
return 0;
}

I expect this will look very similar to other recursive solutions.

Joe Pfeiffer

unread,
Aug 16, 2018, 11:19:20 PM8/16/18
to
Decided to do a little optimization; now it only drops into recursion
when processing asterisks

char wcmatch(char const * const pat, char const * const txt)
{
char const *p = pat;
char const *t = txt;

// run down matching prefix
while ((*p != '\0') &&
(*t != '\0') &&
(*p != '*') &&
((*p == '?') ||
(*p == *t))) {
p++;
t++;
}

// if first char is * collapse multiple *'s, then try consuming
// either the * or the first txt char unless txt is empty
if (*p == '*') {
while (*(p+1) == '*')
p++;
return wcmatch(p+1, t) || ((*t != '\0') && wcmatch(p, t+1));
}

// if there's anything left in either string we fail
return ((*p == '\0') && (*t == '\0'));
}

Richard Damon

unread,
Aug 17, 2018, 12:15:15 AM8/17/18
to
On 8/16/18 2:56 PM, Anton Shepelev wrote:
> Hello, ladies and laddies!
>
> I hereby commence this new thread for the submission and
> discussinon of entries to our little wildcard-matching
> exercise announced in this post:
>
> Subject: A small programming exercise: wildcard matching
> Date: Sat, 11 Aug 2018 15:49:53 +0300
> Message-ID: <20180811154953.ac18...@gmail.com>
>
> Please, publish your entries as immediate replies to this
> message in order to facilitate cohesion.
>

Posted on the other thread, so reposting here.

My minimal pattern match code combining cases a lot to be tight. Note
that I hope that the final recursive call will be tail optimized, so the
only actual recursion is to handle *
Without tail recursion elimination, recursion depth is length of txt +
number of * in pat

char wcmatch(char const* const pat, char const* const txt) {
if(pat[0] == '*') return wcmatch(pat+1, str) || (txt[0] &&
wcmatch(pat, str+1));
if(pat[0] == 0 || txt[0]) return pat[0] == txt[0];
return (pat[0] == '?' || pat[0] == txt[0]) && wcmatch(pat+1, txt+1);
}

Rosario19

unread,
Aug 17, 2018, 1:03:38 AM8/17/18
to
On Fri, 17 Aug 2018 00:54:24 +0300, Anton Shepelev wrote:

>/* Check whether wildcard pattern <pattern> matches string <text> */
>char wcmatch /* returns 1 upon match and 0 upon mismatch */
>( char const * pattern, /* the pattern, e.g. "*.txt" */
> char const * text /* the string to match, e.g. "readme.txt" */
>)
>{ while( 1 == 1 )
> { switch( *pattern )
> { default : if( *text != *pattern ) return 0; break;
> case '?' : if( !*text ) return 0; break;
> case '\0': return !*text;
> case '*' :
> if( wcmatch( pattern + 1, text ) ) return 1;
> if( *text ) --pattern;
> else return 0;
> break;
> }
> ++text;
> ++pattern;
> }
>}

i think if pattern is "" it goes out of the string (so seg fault
because goes out the array if pc is ok)
*pattern can be 0 in the call wcmatch(pattern+1, text) too...

Rosario19

unread,
Aug 17, 2018, 1:29:32 AM8/17/18
to
i see it wrong can not be because switch...

Ike Naar

unread,
Aug 17, 2018, 3:27:43 AM8/17/18
to
On 2018-08-16, Anton Shepelev <anto...@gmail.com> wrote:
> Hello, ladies and laddies!
>
> I hereby commence this new thread for the submission and
> discussinon of entries to our little wildcard-matching
> exercise announced in this post:
>
> Subject: A small programming exercise: wildcard matching
> Date: Sat, 11 Aug 2018 15:49:53 +0300
> Message-ID: <20180811154953.ac18...@gmail.com>
>
> Please, publish your entries as immediate replies to this
> message in order to facilitate cohesion.

The short version (as promised)

int match(char const*,char const*);
int star(char const*p,char const*t){return match(p,t)||(*t&&star(p,t+1));}
int match(char const*p,char const*t){return(*p==0&&*t==0)||(*p=='*'&&
star(p,t))||(*p=='?'&&*t&&match(p+1,t+1))||(*p==*t&&match(p+1,t+1));}

The slightly more readable version:

int match(char const *pat, char const *txt);

static int star(char const *pat, char const *txt)
{
return match(pat, txt) || (*txt && star(pat, txt + 1));
}

static int match(char const *pat, char const *txt)
{
return (*pat == 0 && *txt == 0)
|| (*pat == '*' && star(pat + 1, txt))
|| (*pat == '?' && *txt && match(pat + 1, txt + 1))
|| (*pat == *txt && match(pat + 1, txt + 1));
}

Anton Shepelev

unread,
Aug 17, 2018, 4:27:25 AM8/17/18
to
I wrote:

>In mine, the recursive call is generous (skips the
>asterisk) and the iterative continuation is greedy (eats a
>symbol from the string). After I came up with a working
>function, I have tried to make it as small as possible,
>whence the tricky use ++ and --. The result is quite
>different from, and I belive better than, the original.
>
>I have not tested this conjecture, but meseems this
>function will perform the worst when there are many
>asterisks but few symbols for them to match, because then
>the recursion will have to traverse the largest tree until
>before reaching a correct fit.

It should be the other way round: few astersks with many
symbols to swallow. I am at work now an shan't be able to
most my own solution until the evening.

--
() ascii ribbon campaign - against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

fir

unread,
Aug 17, 2018, 5:13:14 AM8/17/18
to
lulz this is chunk solution... (hovever note that as to chunks the value is in
founding hov well it /generalizes/ fot text processing bazed work.. and how one can build library on it

(thets why one should name this structure
wthi general name not "match_part",

struct chunk { char *beg; int len; };
struct chunk2 { char *start; int length; };

(or will use for short ->
struct chunk2 { char *s; int l; }; )

i name it temporarely as chunk2 as usually i use char* beg; char* end;
(hovever se my notes on this 2nd form
it may be more friendly by most users accustomed to c(maybe its general more firendly)

for me most interesting here is this find_parts function which is worth noting general "split" function (you could
make o lot of uses reusing this function - lik split loaded text file on lines, then split each of this line on words , or c syntax atoms, the value is such general usage)

i would change few thins here (addopting more to my style and inviting this generalisation)

i revrite this find_parts function wanting to see how it would look in more my style (but not compiling and testing for mistakes i made rewriting this)

four quite important notes (in addition to one or two previous on those generalisations)

1) you dont need to precompute numbers of final chunks, i just call reallock in loop with now worry (year ago when begining to use it i also was precounting but later realized that reallock has internal mechanics to prealloc more for you so in fact you may use it)

2) i dont like to use -> and it is no
need to avoid stating chunk2 part instead of chunk2* part and sticking to ->

3) better use static 'seed' for this reallock then there is automatic guard for memory leaks (you can only left on not dellocked (which is not bad) but still next usage will reuse it

4) i also do not usually just ret a pointer to this list anded zero (i found ended zero arrays as often a violations) but wrap it up ich formal chunk2s (chunks2) structure but here it may stay as it is

(hovever changing it i probably make mistakes rrewritting ;c as i dont want focus myself to realy check it )

[well changed my mind i will revrite it a bit l8er]

Rick C. Hodgin

unread,
Aug 17, 2018, 5:52:45 AM8/17/18
to
On 8/16/2018 2:56 PM, Anton Shepelev wrote:
> Please, publish your entries as immediate replies to this
> message in order to facilitate cohesion.

Solution #1.

My first solution is a non-recursive one, and should deal
with strings of any length and any combination of * ? and
regular letters without any consumption of RAM. It is al-
gorithmic, and proceeds logically through the wildcard
pattern to find the match.

Note: I've edited it down so it fits on these posts formatted.
It normally is simply wide / extended well beyond 80 columns in
places:

int wcmemcmp(const char *l, const char *r, size_t len)
{
size_t i;

// Iterate through each character, recognizing l[n] can have
// question marks to allow anything
for (i = 0; i < len; ++i)
{
// We skip ? characters, and fail on mismatches
if (l[i] != '?' && l[i] != r[i])
return(-1); // Fail
}

// If we get here, it matched
return(0);
}

// The logic / reasons why is in the comments. It goes through
// the process systematically, accounting for each thing as it
// goes. I pass in needleLength and haystackLength to allow for
// non-string-only compares.
int wcmatch1(const char *needle, const char *haystack,
int needleLength, int haystackLength)
{
// ni=needle iterator, hi=haystack iterator
int ni, hi, matchLength, lnResult;

// Make sure our environment is sane
if (!needle || !haystack)
return(0);

// Iterate through the needle, matching as long as we match
for (ni = 0, hi = 0; ni < needleLength; ++ni, ++hi)
{
// Grab this part of the needle we're searching for
switch (needle[ni])
{
case '*':
// Are we at the end of the needle?
if (ni == needleLength - 1)
return(1); // Everything after can match,
// so it's a match

// Look after the * to see what's there
for (++ni; ni < needleLength; )
{
// Repeating asterisks?
if (needle[ni] == '*')
{
// Skip repeating asterisks
++ni;
continue;
}

// We're past wildcards, so we must match
// whatever's from here until the next wc
if (haystack[hi] == 0)
return(0); // Failure, looking for ?
// beyond end of haystack

// Count the length of non-wildcard characters
// in the needle
for (matchLength = 0;
ni + matchLength < needleLength
&& needle[ni + matchLength] != '*'; )
++matchLength;

// Was this the end of the needle?
if (ni + matchLength == needleLength)
{
// We have to find the tail, it's like
// "*n" matching "anton"
if (hi + matchLength > haystackLength)
return(0); // There isn't enough
// room for it to match

// See if it's there at the end
return(wcmemcmp(needle + ni,
haystack
+ haystackLength
- matchLength,
matchLength) == 0);
}

// We must find this many matching characters
for (lnResult = -1;
hi <= haystackLength - matchLength;
++hi)
{
// Is this a match?
if ((lnResult = wcmemcmp(needle + ni,
haystack + hi,
matchLength)) == 0)
break; // Yes
}

// Did we find it?
if (lnResult != 0)
return(0); // No

// Skip past it in both
ni += matchLength - 1;
hi += matchLength - 1;

// We're done at this level, continue on
break;
}

// If we've reached the end of the needle string, and
// the last one was an asterisk, we accept anything
if (needle[ni] == 0 && needle[ni - 1] == '*')
return(1);

// Keep going
break;

case '?':
// This character is always a match if there's room
// in the haystack
if (haystack[hi] == 0)
return(0); // Failure, looking for ? beyond
// end of haystack
break;

default:
// Must match exactly
if (haystack[hi] == 0 || needle[ni] != haystack[hi])
return(0);
break;
}
}

// If we get here, it matched all the way through so far
return(needle[ni] == 0 && haystack[hi] == 0);
}

--
Rick C. Hodgin

Rick C. Hodgin

unread,
Aug 17, 2018, 5:56:10 AM8/17/18
to
On 8/16/2018 2:56 PM, Anton Shepelev wrote:
> Please, publish your entries as immediate replies to this
> message in order to facilitate cohesion.

Solution #2.

My second solution was my attempt at an original recursive algorithm.
It fails on any wildcards that have something like "*n" when searching
for "anton" (because it can't handle the leading anything, followed by
the solid ending.

If you single-step through, it matches properly until the last step.
I haven't spent enough time figuring out what to do to correct it.

// I use a flag to indicate how we're searching, so it
// will respond different on non-match "failures". As
// such, it is initially called with 0 for the 3rd param.
int wcmatch2(const char *needle, const char *haystack,
int searchingForwardForMatch)
{
for (char c; (c = *needle); ++haystack)
{
if (c == '*') {
return(wcmatch2(needle + 1, haystack, 1));

} else if (c == '?') {
if (!*haystack) return 0;
searchingForwardForMatch = 0;

} else if (c != *haystack) {
if (!searchingForwardForMatch) return 0;
else continue;
}

// Increment only sometimes
++needle;
}
return (*(needle-1) == '*' || *needle == *haystack);
}

--
Rick C. Hodgin

Ben Bacarisse

unread,
Aug 17, 2018, 7:05:01 AM8/17/18
to
r...@zedat.fu-berlin.de (Stefan Ram) writes:

> r...@zedat.fu-berlin.de (Stefan Ram) writes:
>>Anton Shepelev <anto...@gmail.com> writes:
>>>The only comment I have about the following is that it works
>>>by splitting the pattern at asterskses and matching the
>>>split sections sequentially against the target string.
>>Trying to understand how this works (I do not understand
>>it at the moment), lead me to
>
> I now believe that I might understand how the split-pattern
> algorithms work: It is never wrong to match the /first/
> segment from the text that matches the current segment from
> the splitted pattern, because, doing this, one never obstructs
> possibilities to successfully match the following pattern segments.
> Therefore, in this case, no kind of backtracking is needed
> to try other segments than the first matching segment from a text.
>
> This would not be true anymore, I believe, when the pattern
> language becomes richer.

That's right. Even with this simple pattern language, * would require
backtracking if the algorithm were greedy, but as you say, picking the
first match never prevents a full match.

--
Ben.

fir

unread,
Aug 17, 2018, 9:10:49 AM8/17/18
to
whan addig those generalisations
this

> static struct match_part *find_parts(const char *pat)
> {
> struct match_part *parts = malloc((count_parts(pat) + 1) * sizeof *parts);
> if (parts) {
> struct match_part *part = parts;
> part->pattern = pat;
> while (*pat)
> if (*pat++ == '*') {
> part->length = pat - part->pattern - 1;
> (++part)->pattern = pat += strspn(pat, "*");
> }
> part->length = pat - part->pattern + 1;
> part[1] = (struct match_part){0, 0};
> }
> return parts;
> }
>

become more like


Chunks2 split_after_char( Chunk2 txt, char split)
{
static Chunks2 table = {0};

Chunk2 word = txt;

table = AddChunk(table, word);

for(int i=0; i<txt.length; i++)
{
if(txt.start[i] == split)
{
table.chunk_[table.length-1].length -= (txt.length-(i+1));

if(!(i+1==txt.length)) //dont open if its last
{
word = {txt.start+(i+1), txt.length-(i+1)};
table = AddChunk(table, word);
}
}
}
return table;
}

this up may possibly be revriten to make shoprter, (possibly not defining word but reusing txt also one if just jecks if this is turn of the loop so its sily and may be rewritten probbly, but i got no time to rewrite it)

(this above function btw splits after each finding of char including that on the end, (which is complete in a sense it breaks input chunk on chunks and summ of lengths == length of input chunk) though here in that wildcard it would be better need of not include them so second form may be written (got no time/energy to do it)


table = AddChunk(table, word);

is just a wraper on realloc for easier usage if i just want to resize one up and add at the end

struct Chunk2 {char* start; int length;};

struct Chunks2 {Chunk2* chunk_; int length;};

Chunks2 resize_Chunks(Chunks2 chunkz, int new_size)
{
chunkz.chunk_ = (Chunk2*) realloc(chunkz.chunk_, sizeof(Chunk2)*new_size);
chunkz.length = new_size;
return chunkz;
}

Chunks2 resize_Chunks_one_up(Chunks2 czunkz)
{
czunkz = resize_Chunks(czunkz, czunkz.length+1);
return czunkz;

}

Chunks2 AddChunk(Chunks2 czunkz, Chunk2 added_one)
{
czunkz = resize_Chunks_one_up(czunkz);
czunkz.chunk_[czunkz.length-1] = added_one;
return czunkz;
}


some may wrote whole small library of such functions for convenience


the strength of that splitting lays in it that it justt produces a table of pointers pointing to oryginal ram but not reads anything to it (and that btw is a reasom why zero ended c strings are bad concept, same as pascal strings denoting length before string contents) you cant split on zero-ended strigs by not touching oryginal where you can do it by splitting on
raw chunks - so this obwiously mean raw chunks are much better and that realization pointed me to a direction to explore them )

fir

unread,
Aug 17, 2018, 9:21:05 AM8/17/18
to
W dniu piątek, 17 sierpnia 2018 15:10:49 UTC+2 użytkownik fir napisał:
>
> Chunks2 split_after_char( Chunk2 txt, char split)
> {
> static Chunks2 table = {0};
>
> Chunk2 word = txt;
>
> table = AddChunk(table, word);
>
> for(int i=0; i<txt.length; i++)
> {
> if(txt.start[i] == split)
> {
> table.chunk_[table.length-1].length -= (txt.length-(i+1));
>
> if(!(i+1==txt.length)) //dont open if its last
> {
> word = {txt.start+(i+1), txt.length-(i+1)};
> table = AddChunk(table, word);
> }
> }
> }
> return table;
> }
>

this above recipe is bazically


chunks Split( chunk txt, char split)
{
chunks table = 0;

for(int i=0; i < txt.length; i++)
{
// chunk word =
table = AddChunk(table, word);
}

return table;
}

i mean adding chunks to table in a loop, (you dont even need to track a 'itereator'
to its top) one just need to add some criterium code that sets those words properly, but ta the moment im not ina mood for that - siometimes those kind of
routines warp/wrap my head to much (though if written are very good to use)

Joe Pfeiffer

unread,
Aug 17, 2018, 9:39:25 AM8/17/18
to
Anton Shepelev <anton.txt@g{oogle}mail.com> writes:

> I wrote:
>
>>In mine, the recursive call is generous (skips the
>>asterisk) and the iterative continuation is greedy (eats a
>>symbol from the string). After I came up with a working
>>function, I have tried to make it as small as possible,
>>whence the tricky use ++ and --. The result is quite
>>different from, and I belive better than, the original.
>>
>>I have not tested this conjecture, but meseems this
>>function will perform the worst when there are many
>>asterisks but few symbols for them to match, because then
>>the recursion will have to traverse the largest tree until
>>before reaching a correct fit.
>
> It should be the other way round: few astersks with many
> symbols to swallow. I am at work now an shan't be able to
> most my own solution until the evening.

My intuition is that given any solution, a pattern-text pair can be
constructed that will drive the solution into exploring the whole space
of possible matches and hence will be exponential. What that pair looks
like will depend on the solution you're trying to drive into poor
behavior.

Anton Shepelev

unread,
Aug 17, 2018, 9:41:01 AM8/17/18
to
Joe Pfeiffer:

>My intuition is that given any solution, a pattern-text
>pair can be constructed that will drive the solution into
>exploring the whole space of possible matches and hence
>will be exponential. What that pair looks like will depend
>on the solution you're trying to drive into poor behavior.

I think it doesn't happen with my iterative solution, which
I will post in the evening.

fir

unread,
Aug 17, 2018, 10:04:33 AM8/17/18
to
alright this works

Chunks2 split_after_char( Chunk2 text, char split)
{
static Chunks2 table = {0};

for(Chunk2 word = {text.start, 0};text.length--; text.start++)
{
word.length++;

if(*text.start==split)
{
table = AddChunk(table, word);
word = {text.start+1, 0};
}
}
return table;
}

(spo this is thosse simpler version i was talking to. its not only less code it
allows to expres thoughts more naturally than in that weird old c form

for example you may try

if(word.length>5)
table = AddChunk(table, word);

add only those that are longer than 5 -
its expresses naturelely

Malcolm McLean

unread,
Aug 17, 2018, 10:35:12 AM8/17/18
to
A dynamic programming solution should be O(N^2) and should solve all possible
tame / wild pairs. So grossly inefficient for the common case, but useful
if you have to cope with malicious input.

It works by creating a rectangle with the tame and wild strings as the
axes. You then start at the top left corner and trace a path through
the possible matches. Since we just need a match rather than the most
efficient match, if there is a path to the bottom right we pass and if
it is all blocked off by illegal moves we fail.

Ben Bacarisse

unread,
Aug 17, 2018, 10:38:39 AM8/17/18
to
Matching every non-* part at the first possible location will produce a
linear-time algorithm.

--
Ben.

fir

unread,
Aug 17, 2018, 10:41:20 AM8/17/18
to
duh, there was yet small bug in it
(it only was adding word when it noticed split char so if it not ended bysplit char it was ignored

but then word.length after a loop is > 0
so need to add

if(word.length>0)
table = AddChunk(table, word);

i hope it is ok


Chunks2 split_after_char( Chunk2 text, char split)
{
static Chunks2 table = {0};

Chunk2 word = {text.start, 0};

for(;text.length--; text.start++)
{
word.length++;
// printf("\n %c %d", *text.start, word.length);

if(*text.start==split)
{
table = AddChunk(table, word);
word = {text.start+1, 0};
}
}

if(word.length>0)

Ben Bacarisse

unread,
Aug 17, 2018, 10:52:29 AM8/17/18
to
What do you mean by "malicious input" and the "the most efficient
match"? Can you give examples and, ideally, an implementation? It does
not sound as efficient as the fast solutions already posted, but it
sounds like it would be a fundamentally different algorithm.

--
Ben.

Anton Shepelev

unread,
Aug 17, 2018, 10:53:00 AM8/17/18
to
Ben Bacarisse:

>Matching every non-* part at the first possible location
>will produce a linear-time algorithm.

Indeed, and my solution has also special handling of the
last such section in order to avoid a full-string scan in
cases such as "*b" for "aaaaaaaaaaabbbbbbbbb".

Joe Pfeiffer

unread,
Aug 17, 2018, 10:53:22 AM8/17/18
to
Yes, you are correct.

Malcolm McLean

unread,
Aug 17, 2018, 10:57:55 AM8/17/18
to
With a greedy match

"*ab" matches "xab", but not "xabab".

so you have to try again if you've got spare tame characters.

Anton Shepelev

unread,
Aug 17, 2018, 11:13:21 AM8/17/18
to
Malcolm McLean to Ben Bacarisse:

>>Matching every non-* part at the first possible location
>>will produce a linear-time algorithm.
>
>With a greedy match
>
>"*ab" matches "xab", but not "xabab".
>
>so you have to try again if you've got spare tame
>characters.

By trying again you mean resuming the search along the
target string. This algorithm *is* linear because it has no
backtracking.

fir

unread,
Aug 17, 2018, 11:18:14 AM8/17/18
to
specifically to turn this "split after " routine on "split around" it takes two lines of addition

Chunks2 split_around_character( Chunk2 text, char split)
{
static Chunks2 table = {0};

Chunk2 word = {text.start, 0};

for(;text.length--; text.start++)
{
word.length++;

if(*text.start==split)
{
word.length--; //last is split character so may shrink one
if(word.length>0) //dont add empty ones
table = AddChunk(table, word);

word = {text.start+1, 0};
}
}

if(word.length>0)
table = AddChunk(table, word);

return table ;

}



TESTING "aa**przykladowy*tekst***dla**testu*test*this**6"

debbuging Chunks2

0 start 0040B1E0 len 3 "aa*"
1 start 0040B1E3 len 1 "*"
2 start 0040B1E4 len 12 "przykladowy*"
3 start 0040B1F0 len 6 "tekst*"
4 start 0040B1F6 len 1 "*"
5 start 0040B1F7 len 1 "*"
6 start 0040B1F8 len 4 "dla*"
7 start 0040B1FC len 1 "*"
8 start 0040B1FD len 6 "testu*"
9 start 0040B203 len 5 "test*"
10 start 0040B208 len 5 "this*"
11 start 0040B20D len 1 "*"
12 start 0040B20E len 1 "6"

debbuging Chunks2

0 start 0040B1E0 len 2 "aa"
1 start 0040B1E4 len 11 "przykladowy"
2 start 0040B1F0 len 5 "tekst"
3 start 0040B1F8 len 3 "dla"
4 start 0040B1FD len 5 "testu"
5 start 0040B203 len 4 "test"
6 start 0040B208 len 4 "this"
7 start 0040B20E len 1 "6"



i talk about this split routine so much becouse it is for me much more general and important that those 'wildcards' (which are weird usecase and not quite important)

Tim Rentsch

unread,
Aug 17, 2018, 11:23:48 AM8/17/18
to
("Problems worthy of attack
prove their worth by hitting back." - Piet Hein)


Good morning, Mr. Phelps.

Intelligence sources have recovered an ingenious solution to the
wcmatch() programming challenge. The solution consists of 20 C
functions, whose prototypes are shown below.

typedef const char *const Ss;
typedef unsigned long const Zz;

extern char wcmatch( Ss, Ss );

static char wc0( Ss, Zz, Ss, Ss, Zz );
static char wc0a( Ss, Zz, Ss, Ss, Zz );
static char wc0b( Ss, Zz, Ss, Zz, Ss, Ss, Zz );
static char wc0c( Ss, Zz, Ss, Zz, Ss, Zz );

static char wc1( Ss, Zz, Ss, Ss, Zz, Ss, Zz );
static char wc1a( Ss, Zz, Ss, Ss, Zz, Ss, Zz );
static char wc1b( Ss, Ss, Zz, Ss, Zz );

static char wc2( Ss, Ss, Ss, Zz );
static char wc2a( Ss, Ss, Ss, Ss, Zz );
static char wc2b( Ss, Zz, Ss, Ss, Ss, Zz );
static char wc2c( Ss, Zz, Ss, Zz );

static char wc3( Ss, Zz, Ss, Ss, Ss, Zz );
static char wc3a( Ss, Ss, Ss, Zz, Ss );

static Ss wcf( Ss, Zz, Ss, Zz );
static char wc( Ss, Zz, Ss );

static Ss ss( Ss );
static Ss fsr( Ss, Zz );
static Ss fs( Ss );
static Zz sl( Ss, Zz );

Unfortunately much of the main body of code was lost in the
recovery process. All that remains are the contents of each
function, shown below in an unknown ordering.

return pp ? wc0a( p, pp-p, ss( pp ), w, n ) : k == n && wc( p, k, w );

return k > n ? 0 : wc( p, k, w ) ? 1 : wc2c( p, k, w+1, n-1 );

return n == 0 ? 0 : s[n-1] == '*' ? s+n-1 : fsr( s, n-1 );

return !r ? wc0c( p, k, q, m, w, n ) : wc1( p, k, q, r+1, q+m-r-1, w, n );

return k < 1 ? 1 : *p == *w || *p == '?' ? wc( p+1, k-1, w+1 ) : 0;

return wc0b( p, k, q, sl( q, 0 ), fsr( q, sl( q, 0 ) ), w, n );

return k > n ? 0 : wc( p, k, w ) ? w+k : wcf( p, k, w+1, n-1 );

return wc( t, m, w+n-m ) && wc2( q, t, w, n-m );

return wc0( p, sl( p, 0 ), fs( p ), w, sl( w, 0 ) );

return x ? wc2( p, t, x, w+n-x ) : 0;

return k+m > n ? 0 : wc1a( p, k, q, t, m, w, n );

return wc3a( q, t, w, n, wcf( p, k, w, n ) );

return *p ? sl( p+1, k+1 ) : k;

return k+m <= n && wc( p, k, w ) && wc( q, m, w+n-m );

return wc2b( p, pp-p, ss( pp ), t, w, n );

return q == t ? wc2c( p, k, w, n ) : wc3( p, k, q, t, w, n );

return *s == '*' ? ss( s+1 ) : s;

return wc( p, k, w ) && wc1b( q, t, m, w+k, n-k );

return wc2a( p, fs( p ), t, w, n );

return *s == '*' ? s : *s ? fs( s+1 ) : 0;


Your mission, should you decide to accept it, is to reconstruct
the missing functions, and produce a working wcmatch() function.
As usual should you or any of your IM force be caught or killed
the Secretary will disavow any knowledge of your actions. This
tape will self-destruct in five seconds. Good luck Jim.

Malcolm McLean

unread,
Aug 17, 2018, 11:23:56 AM8/17/18
to
The dynamic programming method is similar to the protein alignment
algorithms I used in bioinformatics. There you can have a perfect match
(the identical amino acid) a reasonable match (a chemically similar
amino acid, like leucine and isoleucine) or a total mismatch (tryptophan
and glycine). The best alignment minimises gap insertions or deletions,
and also the poor matches, according to some score function.

With the wildcard problem, we don't really have a "bad match" or a "good
match", just a match or a non-match. But you could say that a match
which has fewer wildcarded characters at the ends (on input like "*a*a*"
and "aaaaaaaaaaaaaaaa") is "better" than a match which matches it to
the first two as. But in fact that's not aconsideration, so we don't need
to score our matrix, we just flag whether cells are accessible or not
from the three cells to the left, above, and to the top left.

By malicious input, I mean input contrived to cause worst-case performance,
not normal input you would expect from typical use.

Anton Shepelev

unread,
Aug 17, 2018, 11:38:11 AM8/17/18
to
Tim Rentsch:
That belongs to the obfuscated C contest:

https://ioccc.org/

Malcolm McLean

unread,
Aug 17, 2018, 11:40:48 AM8/17/18
to
On Friday, August 17, 2018 at 4:13:21 PM UTC+1, Anton Shepelev wrote:
> Malcolm McLean to Ben Bacarisse:
>
> >>Matching every non-* part at the first possible location
> >>will produce a linear-time algorithm.
> >
> >With a greedy match
> >
> >"*ab" matches "xab", but not "xabab".
> >
> >so you have to try again if you've got spare tame
> >characters.
>
> By trying again you mean resuming the search along the
> target string. This algorithm *is* linear because it has no
> backtracking.
>
You're right.

I was thinking that on input like

"*ab" and "ababababababababab" you have to do N searches of an N length
string, but of course the search returns quickly.

fir

unread,
Aug 17, 2018, 11:52:16 AM8/17/18
to
W dniu piątek, 17 sierpnia 2018 17:13:21 UTC+2 użytkownik Anton Shepelev napisał:
> By trying again you mean resuming the search along the
> target string. This algorithm *is* linear because it has no
> backtracking.
>

one may say say this matching is linear but it is said a bit roughly - when edge words stated and dont fit it ten to be constant (saing rougly ;c linear to lenght of that words, saing roghly, coz its mostly linear to lengths of the parts that "nearly fit") - and if this not the case it may be worse than linear becouse this may work more on those attampts when places nearly fit (to thiose lengts that nearly fit).. besides its possibly not worth to go more into it ;c

Bart

unread,
Aug 17, 2018, 11:58:05 AM8/17/18
to
With an example like this, I assume the strings are as follows:

"*ab<eos>"
"ababababababababab<eos>"

With the <eos> end of string being considered part of the pattern and
part of the text. (And of course C strings already have a 0 byte for
that purpose.)

So the "ab" in pattern doesn't match any earlier ones, only the last. (I
haven't been able to think of examples where there are multiple matches
for what follows "*", and choosing the first one gives the wrong result.)

--
bart

fir

unread,
Aug 17, 2018, 12:00:07 PM8/17/18
to
W dniu piątek, 17 sierpnia 2018 09:27:43 UTC+2 użytkownik Ike Naar napisał:
>
> The slightly more readable version:
>
> int match(char const *pat, char const *txt);
>
> static int star(char const *pat, char const *txt)
> {
> return match(pat, txt) || (*txt && star(pat, txt + 1));
> }
>
> static int match(char const *pat, char const *txt)
> {
> return (*pat == 0 && *txt == 0)
> || (*pat == '*' && star(pat + 1, txt))
> || (*pat == '?' && *txt && match(pat + 1, txt + 1))
> || (*pat == *txt && match(pat + 1, txt + 1));
> }

i proclaim this as a winner... if i remember those posted forms this seem to looks simplest (if im not wrong) - if this works (im not goin into analysing those recursive forms)

is it linear like this non recursive
forms (those with findstring matched parts in a loop) or is it some higher order of cpu 'encumberance?'?

Malcolm McLean

unread,
Aug 17, 2018, 12:12:25 PM8/17/18
to
Exactly, yes, so you've got to call the modified strstr (accommodating
the question mark) repeatedly. But it returns quickly - it doesn't
search the whole of the target string.

Malcolm McLean

unread,
Aug 17, 2018, 12:20:16 PM8/17/18
to
However what about this pair?

"a*a*a*a*a*a*a*a" "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

are we still OK? If we retry the last match then we should be, but
if we retry the first we'll have a cascade of retries as each match
gets moved along one.

Anton Shepelev

unread,
Aug 17, 2018, 12:31:35 PM8/17/18
to
Bart:

>With an example like this, I assume the strings are as
>follows:
>
> "*ab<eos>"
> "ababababababababab<eos>"
>
>With the <eos> end of string being considered part of the
>pattern and part of the text. (And of course C strings
>already have a 0 byte for that purpose.)
>
>So the "ab" in pattern doesn't match any earlier ones, only
>the last.

Great idea.

Ben Bacarisse

unread,
Aug 17, 2018, 12:34:28 PM8/17/18
to
Anton Shepelev <anton.txt@g{oogle}mail.com> writes:

> Ben Bacarisse:
>
>>Matching every non-* part at the first possible location
>>will produce a linear-time algorithm.
>
> Indeed, and my solution has also special handling of the
> last such section in order to avoid a full-string scan in
> cases such as "*b" for "aaaaaaaaaaabbbbbbbbb".

Yes, I considered checking last part early but it meant a little more
logic and finding the length of the subject string which I don't
otherwise need to do.

It would almost certainly be worth it in practice though, as would
dropping the last part (at least the way I do it) when it is certain to
match as in "b*".

--
Ben.

Anton Shepelev

unread,
Aug 17, 2018, 12:36:27 PM8/17/18
to
Malcolm McLean:

>However what about this pair?
>
>"a*a*a*a*a*a*a*a"
>"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
>
>are we still OK? If we retry the last match then we should
>be, but if we retry the first we'll have a cascade of
>retries as each match gets moved along one.

No 'a' match need be retried but the last. All the previous
ones may happily rest at the earliest possible location:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaa ... retrying ............ a

But it is better explicitly to request the last match at the
end!

Ben Bacarisse

unread,
Aug 17, 2018, 12:42:50 PM8/17/18
to
You are using terms in an odd way. Greedy usually refer to wild matches
and a greedy * matches as much as possible. I think you are using the
terms to mean "match as soon as possible".

But, no, with my algorithm (well, it's hardly mine) "*ab" does match
both "xab" and "xabab". It only fails to match "xabab if the program
has a bug! The pattern "*ab" means that ab can only match at the end of
the subject string, just as "ab*" means that ab much match at the start.

The patterns in this problem are anchored. "*ab" means /^.*ab$/ in the
usual regular expression syntax.

> so you have to try again if you've got spare tame characters.

Not if you use the right algorithm.

--
Ben.

fir

unread,
Aug 17, 2018, 12:46:54 PM8/17/18
to
W dniu piątek, 17 sierpnia 2018 00:33:49 UTC+2 użytkownik fir napisał:
> fir's typical code; co called 'no brainer' i guess - as someone previously did that thios obvious way, split wildcard parretn over the '*'s and check if all its parts (words) are just in text, one against another (only you must use strcomp version that returns chars equal is some are ?)
>
> i use my own library functions, mainly one that split, one that is analog tos strcmp, one that is analog to strfind (i dont want to include it here as there are library ones and somewhat strightforward
> (though to be hones when one would like to write split there is rishy to make faults, at least for me i dont know why ;c)
>
>
> int check_match(chunk txt, chunk pattern)
> {
> chunks words = SplitOnCharacter(pattern, '*');
> int words_len = ChunksLength(words);
>
> if(pattern.beg[0]!='*')
> if(!compare_left_WITHASKSIGNS(txt, words.first[0]))
> return -1;
>
> if(pattern.end[0]!='*')
> if(!compare_right_WITHASKSIGNS(txt, words.last[0]))
> return -2;
>
>
> chunk input_txt = txt;
> for(int n=0; n<words_len; n++)
> {
> chunk found = find_WITHASKSIGNS(input_txt, words.first[n]);
> if(found.beg)
> input_txt.beg = found.end + 1;
> else
> return -3;
> }
>
> return 1;
> }

so getting back to the previous topic whan i was siang to write such routine is fairly simple (when some suggested its not ;c ) and than i couldnt say why i think its simple

what do you need here split pattern on parts by *

[DIGRESSION ON SPLIT

(you need only this split function written, (i showed code below (named split_aroud_character) this split function is of medicore complexity though i must say it weirdly wraped my head when was written - i wrote it beck then, working ok) but now wnated to rewrite it in more shoprt form and it wraped my head again ;c

but i say it was unnecesary to acomplsh this task as i had one... this function is btw almost like a fathet to text processing imo.. its just of great importance, ]

after split you got array of words comepare first and last to edges of text
(its trivial compare function is so trivial not worth mentioning.. if it passes search for rest words one after another (findst function is as much trivial as compares)

so you got involved:
- medicore split(),
- trivial compare(),
- trivial finstr(),
- and also sorta trivial final check_match()

the idea of writing it this way is also quite simple.. (and i wrote the above
in fact quite quick*, sorta like tyme you need to type it in and slight debugging [this debugging is the main work here so i mentioned it as a bigger part of the problem])

(sorta quick if you exclude my personal pains with rewritting spliter function,
my own chunk form is stylistical addition as you can write it also without chunks,
though most people probably would buildin splitting logic into the the check_match
function, - separating it as a pre pass is
ime clearer, (then i guess many from those (probably minority) who would do pre-pass itprobably would use only array of pointers, minority of them may use non-generalized chunks like ben bacrise.. i suggest use genaralized chunks idea (bcouse its wery nice to use)


Bart

unread,
Aug 17, 2018, 12:49:57 PM8/17/18
to
If you rewrite the strings so that each 'a' is numbered, and shown in
upper case in the text:

"a1*a2*a3*a4*a5*a6*a7*a8"

"A1A2A3...A35" (I've guessed the number...)

Then a1 matches A1. Then comes * which allows you to skip any characters
until you match a2, which is matched immediately with A2. That continues
until a7 matches A7, but then the next pattern to match is "a8<eos>"
which only matches "A35<eos>". That last * skips A8 to A34.

(I'm talking about my approach here, which I think is mostly linear. I
don't know about those short recursive solutions.

Mine only becomes non-linear when trying to match "abcde" with text of
"aababcabcdabcde" when it has to keep restarting the matching process
("aaaaaaaaaaaaaabcde" is a simpler text example).)

--
bart

fir

unread,
Aug 17, 2018, 1:14:33 PM8/17/18
to
yet maybe note some thing:

speaking on "key words"

"split" is one word which i could name key word (some wordst that i use like keywords as i fint their importance) (othar are "chunk", "chunks", "dll" (as i like them a lot) and yet "container" ("resizable chunks-based container" )

this container is also very important topic (like chunk idea and split function)
- i mean chunk table packed in a form of chunk defines quite precisely some kind of
basic container which imo is of much higher importance than c++ tresh containers python dirctories etc.. it just something that physicaly grows on idea of chunks mixed with reallock idea and blends well

as showed here in simple example you may define AddChunk) function that allows you to store chunks and write algorithms without even tracking index on records
(you dont need to track its 'top' as its length is its field as it should be)

it also quite neturalelly gives you a feel what you can do with that and whiat api
you may or need to write to get it handy

(which is great value becouse normally not necessarely you mak know is sane to expect form container and how to design it - here it is born naturalely, and it answers many hard questions)

few days ago i was writing text editor using this container (doubly resizable) i mean list of chunks is born on reallock
and its contained chunkas are also born on reallock and i may say i additionaly think its quite good and natural design for holding text files contents (thougn not only text you ay store what you want)

(more on this container probably in future.. when mu experience with that will grow up)

Bart

unread,
Aug 17, 2018, 1:16:54 PM8/17/18
to
On 17/08/2018 17:49, Bart wrote:

> (I'm talking about my approach here, which I think is mostly linear. I
> don't know about those short recursive solutions.

My approach does things left to right. This seems rather arbitrary, and
you hope it would work the same way right to left.

By running Stefan's tests with all the strings reversed, I still get the
same results, so it looks like it doesn't matter which end you start from.

"*n" matches "anton", and "n* matches "notna".

--
bart

Rick C. Hodgin

unread,
Aug 17, 2018, 1:23:01 PM8/17/18
to
I had thoughts on this when I was developing and debugging my
algorithm. I concluded that there really needs to be a different
algorithm for the terminal *n cases, where it does look to the
right-side for the match.

For all other types, it looks from the left. If there is an
intermediate *n in there somewhere (ab*n*k) then it would search
left-to-right for that component, finding the first match, and
then on the last *k it would look for something ending in k only.

Seems to be the way the wildcard match should work.

My longer solution #1 does this. My shorter solution #2 does not
handle *n searches. It fails on them all I believe.

--
Rick C. Hodgin

Ben Bacarisse

unread,
Aug 17, 2018, 1:24:42 PM8/17/18
to
Anton Shepelev <anton.txt@g{oogle}mail.com> writes:

> Bart:
>
>>With an example like this, I assume the strings are as
>>follows:
>>
>> "*ab<eos>"
>> "ababababababababab<eos>"
>>
>>With the <eos> end of string being considered part of the
>>pattern and part of the text. (And of course C strings
>>already have a 0 byte for that purpose.)
>>
>>So the "ab" in pattern doesn't match any earlier ones, only
>>the last.
>
> Great idea.

I guess people are not reading the explanations we post! :-(

--
Ben.

Ben Bacarisse

unread,
Aug 17, 2018, 1:32:41 PM8/17/18
to
fir <profes...@gmail.com> writes:

> W dniu piątek, 17 sierpnia 2018 09:27:43 UTC+2 użytkownik Ike Naar napisał:
>>
>> The slightly more readable version:
>>
>> int match(char const *pat, char const *txt);
>>
>> static int star(char const *pat, char const *txt)
>> {
>> return match(pat, txt) || (*txt && star(pat, txt + 1));
>> }
>>
>> static int match(char const *pat, char const *txt)

(small technical matter -- you have a static definition and a non-static
declaration.)

>> {
>> return (*pat == 0 && *txt == 0)
>> || (*pat == '*' && star(pat + 1, txt))
>> || (*pat == '?' && *txt && match(pat + 1, txt + 1))
>> || (*pat == *txt && match(pat + 1, txt + 1));
>> }
>
> i proclaim this as a winner... if i remember those posted forms this
> seem to looks simplest (if im not wrong) - if this works (im not goin
> into analysing those recursive forms)
>
> is it linear like this non recursive
> forms

No, it's exponential in the worst cases like the other recursive forms.
(Non exponential recursive versions are possible but they don't use this
simple algorithm.)

--
Ben.

Ben Bacarisse

unread,
Aug 17, 2018, 2:00:13 PM8/17/18
to
<snip>
>> What do you mean by "malicious input" and the "the most efficient
>> match"? Can you give examples and, ideally, an implementation? It does
>> not sound as efficient as the fast solutions already posted, but it
>> sounds like it would be a fundamentally different algorithm.
<snip>
> By malicious input, I mean input contrived to cause worst-case performance,
> not normal input you would expect from typical use.

I thought that's what you meant. I asked because, since there is a
liner algorithm[1], I could not contrive any cases where an O(N^2)
solution would be useful. But maybe you wrote the above thinking that
the only other algorithms were exponential.

[1] Exactly what I means by this is a bit vague. There are at least two
"sizes" to consider -- the pattern length, p, and the subject length s.
I think that if you do the fixed-size matches with Boyer-Moore (+Galil)
you get O(p+s) even for these * patterns.

--
Ben.

luser droog

unread,
Aug 17, 2018, 2:11:17 PM8/17/18
to
On Thursday, August 16, 2018 at 1:56:50 PM UTC-5, Anton Shepelev wrote:
> Hello, ladies and laddies!
>
> I hereby commence this new thread for the submission and
> discussinon of entries to our little wildcard-matching
> exercise announced in this post:
>
> Subject: A small programming exercise: wildcard matching
> Date: Sat, 11 Aug 2018 15:49:53 +0300
> Message-ID: <20180811154953.ac18...@gmail.com>
>
> Please, publish your entries as immediate replies to this
> message in order to facilitate cohesion.


Sigh. It took much longer to find my bug than it did to write
in the first place. What happened was when I sat down to get
started over the weekend, my internet connectivity dropped out.
And I had the order of the arguments reversed in the prototype.
When I ran the tests, I went ahead and fixed all the function
prototypes ... but neglected to fix all the recursive calls.

But my small one works with the 9 test cases and so I'm not
trying to torture test it just yet.


int
wcmatch2( char const *pat, char const *str ){
return !str || !pat ? 0 :
*str == *pat || *pat == '?' ? !*pat || wcmatch2( pat+1, str+1 ) :
*pat == '*' ? wcmatch2( pat+1, str ) || wcmatch2( pat, str+1 ) :
0;
}



I also have this variant which expresses the same logic
but with if/else statements instead of expressions. I also
changed the pointer-wise expressions to array notation just
to make it even more different.


int
wcmatch6( char const *pat, char const *str ){
if( str == NULL || pat == NULL ) return 0;

// printf( "%c.%c\n", *pat, *str );
if( str[0] == pat[0] || pat[0] == '?' )
if( pat[0] == '\0' )
return 1;
else
return wcmatch6( &pat[1], &str[1] );

if( pat[0] == '*' )
return wcmatch6( &pat[1], str ) || wcmatch6( pat, &str[1] );

return 0;
}

fir

unread,
Aug 17, 2018, 2:18:36 PM8/17/18
to
if so i think it is winner in sjrt forms

in more computatively proper my chunk based is my favourite ;c (though i m not much tested it)

BYW when i would wrote any recursion im like always find it just as bug becouse there are non popular methods of checking
the distance to the stack overflow case

(they probably can be found but i incidentally not heard) (and unles
using it to at least call assertion using recurential codes seem jus like intentionally coded bug ;c)

fir

unread,
Aug 17, 2018, 2:24:03 PM8/17/18
to
if all that short form writers would like
strongly to 'win' i suggest turning label names into 1-letters (or just count labal as 1) and count characters or operators (hovever visueal simplicity also amtters and there should be taken into consideration) also there should be checked for some corecctnes.. i dont much care to that level, at tke moment this "ike naar" form seems winner to me

fir

unread,
Aug 17, 2018, 2:32:54 PM8/17/18
to
W dniu piątek, 17 sierpnia 2018 20:18:36 UTC+2 użytkownik fir napisał:
> if so i think it is winner in sjrt forms
>

i meant "short forms"

> in more computatively proper my chunk based is my favourite ;c (though i m not much tested it)
>
> BYW when i would wrote any recursion im like always find it just as bug becouse there are non popular methods of checking
> the distance to the stack overflow case
>
> (they probably can be found but i incidentally not heard) (and unles
> using it to at least call assertion using recurential codes seem jus like intentionally coded bug ;c)

obviously this is absolutely silly as
normally stack is just array of ram
where you got stack pointer pointed somewhere inside... nothing easier to
know (and eventually mamnago those two waluses, size of that array and this pointer reletive to begining)

c std comitee instead of making many
bad things should make some work and standarize at least a function that gives
the distance to the bottom (and put that in clib to me to get it in next iteration of msvcrt.dll ;c

fir

unread,
Aug 17, 2018, 2:41:40 PM8/17/18
to
also there is probably a technical option to just realloc stack ob overflow
(at least ynder windows)...
i must say im not even sure how thet practically do it now, im suffice to know i may set stack sieze in PE file
and typically it is 1 or 2 MB (i knowed more bit honestly forgot*)

(*which is obviously bad, i should learn it)

Rosario19

unread,
Aug 17, 2018, 3:05:40 PM8/17/18
to
On Thu, 16 Aug 2018 21:56:42 +0300, Anton Shepelev wrote:

>Hello, ladies and laddies!
>
>I hereby commence this new thread for the submission and
>discussinon of entries to our little wildcard-matching
>exercise announced in this post:
>
> Subject: A small programming exercise: wildcard matching
> Date: Sat, 11 Aug 2018 15:49:53 +0300
> Message-ID: <20180811154953.ac18...@gmail.com>
>
>Please, publish your entries as immediate replies to this
>message in order to facilitate cohesion.

this is the my, afther some test...

#include <stdio.h>
#include <time.h>

#define F for
#define R return
#define P printf
#define GG(a,b) if(a)goto b

/*
Input: a, b pointer to C array 0 terminated (strings)
Search the string "a" that can contain too wildcard:
"*" as generic substring that can be null too
"?" as generic substring of 1 char
in the string "b";
if error it return -1
(to many recursive call or too long strings or too much time)
if "a" match "b", it return 1 else
if "a" do not match "b", it return 0
NB: For Big O it could be exponential in the wrost case;
If find one calculus take > 10 seconds time => return -1
*/
wcmatchR(char* a, char* b)
{static time_t ti, tf;
static int nc,r; // nc means number of call of this function
unsigned ns;
char *pa, *pb, ch;

if(nc==0)tf=ti=time(0); //nc==0 at start
++nc; // if recursivly call >900 times => error
if(nc>= 900){re: --nc;R -1;}// return error
if(a== 0){r1: --nc;R 1;}// return match result
if(b== 0){r0: --nc;R 0;}// return not match result
tf=time(0); // if calculate too much(e.g. exponential)return -1
GG(difftime(tf,ti)>10.0, re); //10 seconds of calculum:too much
F(ns=0;;++a,++b,++ns)
{//P("%c %c, %d %d\n", *a, *b, *a, *b);
GG(ns>=0xFFFFFF, re); //string too much long R -1
if(*a=='*'){F(;*a=='*';++a); --a;} //if*a=='*'goto the last '*'
if(!*b){GG(*a=='*'&&a[1]==0, r1);--nc;R !*a;}// if *b==0 =>exit
switch(*a)
{case '?': break;
case '*': ++a; GG(*a==0, r1); //in the case of "..*" => match
F(ch=*a,pb=b,pa=0;*pb;++pb) //search ch in b until\0
{if(*pb==ch||ch=='?')
{pa=pb;
if(a[1])
{GG((r=wcmatchR(a+1,pb+1))==1, r1);
GG(r==-1,re); //stack overflow
} //exponential calc
}
}
GG(pa==0, r0); // ch not found
b=pa;
break; // [here *a!=0]
default: GG(*a!=*b, r0); //here *b!=0;
} //if*a==0=>*a=0!=*b return not match
}
}

Rosario19

unread,
Aug 17, 2018, 3:15:48 PM8/17/18
to
possibly wmatch(a,b)==wmatch(reverse(a), reverse(b))

Anton Shepelev

unread,
Aug 17, 2018, 3:23:21 PM8/17/18
to
This is my primary solution, and the only one that I wrote
entirely by myself. It is iterative, does not allocate
memory, and operates with indices into the text and pattern
arrays.

The basic idea is to split the pattern at compacts of
astersks so that the components, which I call sections,
contain only normal characters or question marks. The
function wcnext() finds treats the pattern as a stream of
sections and returns, upon every call, one section with the
additional "boundary" infomation: whether the match must be
fixed at either end of the text, or whether it can be
located anywhere within the unscanned remainder of the text.
It returns the flag indicating that the just read section
was the last in the pattern. For example, the pattern
a**b?c*d* will be split into the following sections:

+-----------------------------+
|section start end last |
+-----------------------------+
|a Y N N |
|b?c N N N |
|d N N Y |
+-----------------------------+

I have included two alternative implementations of this
function: as a state automaton and as three scanning loops.
The loops version turned much simpler, but I had thought
otherwise and had started with the automaton.

For all sections, the function wcmatch() finds the earliest
match in the target string satisfying the boundary
conditions shown above. Only the middle sections -- those
not anchored either to the start or the end of the
text -- require a search. For anchored sectinos a direct
match at a fixed offset is sufficient.

My comment declares O(n) performance. That implies the
length of the target string with a fixed pattern. The
actual performance is approximately O(mn), with m denoting
pattern length. It could be futher improved by using a
optimised substring search algorithm, such as Knuth-Morris-
Pratt.

I am not satisfied with my code, and shall welcome
refactoring advise.

/* ----------------- Efficient wildcard matching routine ----------------------
( by Anton Shepelev )

char wcmatch( char const * pattern, char const * text );

Tests whether wildcard pattern <pattern> matches string <text> and returns 1
upon a match and 0 upon a mismatch. wcmatch() performs in O(n) time and
O(1) memory, and that exclusively on the stack.

Macro parameter SCAN_TYPE selects the pattern scanner implementation:
SC_STATEAUTO: finite-state automaton
SC_SEQUENTIAL: sequential loop-based
----------------------------------------------------------------------------- */
#include <string.h>

#define SC_STATEAUTO 1
#define SC_SEQUENTIAL 2

#define SCAN_TYPE SC_SEQUENTIAL
/* ---------------------------- Pattern scanner ----------------------------- */

typedef struct /* Pattern state: */
{ const char const * _; /* pattern string */
int start; /* index of start charcacter */
int end; /* index of end character */
int pos; /* index of current character */
} pattern_t;

typedef enum /* type of match required for a section: */
{ EsMid, /* match in anywere in the middle */
EsBeg, /* match at the start */
EsEnd, /* match at the end */
EsExact, /* match at both ends */
} PatMatch; /* (match-at-end help optimise cases like *n against nnnnnnnnn) */

typedef enum /* Scanner state: */
{ StSeek, /* seeking for start of section */
StRead, /* reading a section */
StSkip /* skiping trailing asterisks */
} NsState;

/* Acquire the next section from pattern <pat>. The pattern is interpreted */
/* as a stream of sections separated by compacts of asteriskses. Sets <res> */
/* to the type of the required match operation, sets <pat->start> and
/* <pat->end> to the start and end of the section. Returns zero if there are */
/* more sections to read and 1 otherwise. */
static char wcnextsec( pattern_t * pat, PatMatch * res )
{ int pos;
char c, last, match_beg, match_end;

#if SCAN_TYPE == SC_SEQUENTIAL
pos = pat->pos - 1; /* for a one-liner initial loop */

while( (c = pat->_[++pos]) == '*' ); /* skip initial astersks */
pat->start = pos; /* mark start of section */
for(;c != '\0' && c != '*';c = pat->_[++pos]); /* read section */
pat->end = pos - 1; /* mark of section */
for(;c == '*'; c = pat->_[++pos] ); /* skip trailing astersks */

Done:
last = c == '\0';

#elif SCAN_TYPE == SC_STATEAUTO
NsState state;

last = 0;
state = StSeek;
pos = pat->pos;

while( 1 == 1 )
{ c = pat->_[pos];
switch( c )
{ case '\0':
switch( state )
{ case StSeek: pat->start = pos; /* fallthrough * */
case StRead: pat->end = pos - 1;
}
last = 1;
goto SA_DONE; /* the quickest way to leave a loop from a switch */
break;
case '*':
if( state == StRead )
{ pat->end = pos - 1;
state = StSkip;
}
break;
default:
switch( state )
{ case StSkip: goto SA_DONE;
case StSeek:
state = StRead;
pat->start = pos;
break;
}
break;
}
pos = pos + 1;
}
SA_DONE:
#else
#error Set SCAN_TYPE either to SC_STATEAUTO or to SC_SEQUENTIAL
#endif

pat->pos = pos;

/* OPT: set match_beg> and <match_end> within the loop, only when required */
match_beg = pat->start == 0;
match_end = last && pat->end == pos - 1;

/* Encoding the boundary conditions into a single value */
/* so as to swtich upon it */
*res = match_end << 1 | match_beg; /* HACK: relies on the ordering of enum */

return last;
}

/* ---------------------------- Section matching ---------------------------- */

/* OBS: very similar to <pattern_t>: how to avoid repetition without */
/* implementing inheritance? -- Store the <start> and <end> members of */
/* <pattern_t? in a separate structure? */

typedef struct /* Text scanner: */
{ const char const * _; /* the text */
int len; /* length of text */
int pos; /* index of current character */
} text_t;

/* Match section <sec> against string <txt> at offset <ofs>. */
/* Returns 1 upon match and 0 otherwise. In case of a match, */
/* txt->pos points the the character following the matched substring, */
static char wcmatchsec( pattern_t * sec, text_t* txt, int ofs )
{ int i, secln, tpos;
char match, patc, txtc;

if( ofs < 0 ) return 0;

/* no match if section longer that text: */
secln = sec->end - sec->start + 1;
if( secln > txt->len - txt->pos - ofs ) return 0;

tpos = txt->pos + ofs;
match = 1;
for( i = sec->start; i <= sec->end; i++ )
{ txtc = txt->_[tpos];
patc = sec->_[ i ];
if( patc != '?' && txtc != patc )
{ match = 0;
break;
}
tpos++;
}
if( match ) txt->pos = tpos;
return match;
}

/* Find section <sec> in text <txt> starting from its current character. */
/* Returns 1 if found and 0 if not. */
static char wcfindsec( pattern_t * sec, text_t * txt )
{ int ofs, ofsmax;
char found;

found = 0;
ofsmax = txt->len + sec->start - sec->end - txt->pos;
for( ofs = 0; ofs <= ofsmax; ofs++ )
{ if( wcmatchsec( sec, txt, ofs ) )
{ found = 1;
break;
}
}
return found;
}

/* ----------------------- Wildcard matching function ----------------------- */

/* Match wildcard <pattern> against <text> */
char wcmatch /* returns 1 upon match and 0 upon mismatch */
( char const * pattern, /* the pattern, e.g. "*.txt" */
char const * text /* the string to match, e.g. "redame.txt" */
)
{ pattern_t pat;
text_t txt;
char match;
char last;
PatMatch end;
int ofs;

pat._ = pattern ; pat.pos = 0;
txt._ = text; txt.pos = 0; txt.len = strlen( text );

while( 1 == 1 )
{ last = wcnextsec( &pat, &end );
switch( end )
{ case EsBeg: match = wcmatchsec( &pat, &txt, 0 ); break;
case EsMid: match = wcfindsec ( &pat, &txt ); break;
case EsEnd:
ofs = txt.len - txt.pos - pat.end + pat.start - 1;
match = wcmatchsec( &pat, &txt, ofs );
break;
case EsExact:
match = wcmatchsec( &pat, &txt, 0 ) && pat.pos == txt.len;
break;
}
/* Terminate upon mismatch or when the pattern has been scanned: */
if( !match || last ) break;
}

return match;
}

Anton Shepelev

unread,
Aug 17, 2018, 3:24:51 PM8/17/18
to
Stefan Ram:

> I now believe that I might understand how the split-
> pattern algorithms work: It is never wrong to match the
> /first/ segment from the text that matches the current
> segment from the splitted pattern, because, doing this,
> one never obstructs possibilities to successfully match
> the following pattern segments. Therefore, in this case,
> no kind of backtracking is needed to try other segments
> than the first matching segment from a text.

Exactly.

--
() ascii ribbon campaign -- against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

Keith Thompson

unread,
Aug 17, 2018, 3:27:37 PM8/17/18
to
Rosario19 <R...@invalid.invalid> writes:
[...]
> #define F for
> #define R return
> #define P printf
> #define GG(a,b) if(a)goto b

No!

(Don't worry, I'm done.)

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Anton Shepelev

unread,
Aug 17, 2018, 3:32:37 PM8/17/18
to
Stefan Ram:

> Can Nikita's code detect that *ccd matches abcccd?

No. (but it passes your tests). Well noticed.

Tim Rentsch

unread,
Aug 17, 2018, 3:39:15 PM8/17/18
to
Here is my first solution.


char
wcmatch( const char * const p, const char * const s ){
return
*p == 0 ? *s == 0 :
*s == 0 ? *p == '*' ? wcmatch( p+1, s ) : 0 :
*p == '?' ? wcmatch( p+1, s+1 ) :
*p == '*' ? wcmatch( p+1, s ) ? 1 : wcmatch( p, s+1 ) :
*p == *s ? wcmatch( p+1, s+1 ) :
/* otherwise */ 0;
}


All calls are tail-call optimized (gcc -O2) except for the one that
can't be since it isn't a tail call.

Horrible behavior on the large test patterns. :)

Anton Shepelev

unread,
Aug 17, 2018, 3:39:48 PM8/17/18
to
Rosario:

> this is the my, afther some test...

Your wcmatchR() works, although I don't understand how,
for the life of me. Try posting it to ioccc.org .

Anton Shepelev

unread,
Aug 17, 2018, 3:47:20 PM8/17/18
to
Tim Rentsch:

> Here is my first solution.
>
> char
> wcmatch( const char * const p, const char * const s ){
> return
> *p == 0 ? *s == 0 :
> *s == 0 ? *p == '*' ? wcmatch( p+1, s ) : 0 :
> *p == '?' ? wcmatch( p+1, s+1 ) :
> *p == '*' ? wcmatch( p+1, s ) ? 1 : wcmatch( p, s+1 ) :
> *p == *s ? wcmatch( p+1, s+1 ) :
> /* otherwise */ 0;
> }

An interesting abuse of the ternary operator.

> All calls are tail-call optimized (gcc -O2) except for the
> one that can't be since it isn't a tail call.

Can there be more than one tail call in a function?

Anton Shepelev

unread,
Aug 17, 2018, 3:50:45 PM8/17/18
to
Ben Bacarisse:

> Non exponential recursive versions are possible but they
> don't use this simple algorithm.

How?

fir

unread,
Aug 17, 2018, 4:02:29 PM8/17/18
to
W dniu piątek, 17 sierpnia 2018 21:39:15 UTC+2 użytkownik Tim Rentsch napisał:
> Here is my first solution.
>
>
> char
> wcmatch( const char * const p, const char * const s ){
> return
> *p == 0 ? *s == 0 :
> *s == 0 ? *p == '*' ? wcmatch( p+1, s ) : 0 :
> *p == '?' ? wcmatch( p+1, s+1 ) :
> *p == '*' ? wcmatch( p+1, s ) ? 1 : wcmatch( p, s+1 ) :
> *p == *s ? wcmatch( p+1, s+1 ) :
> /* otherwise */ 0;
> }
>
>
lets compare to

int match(char const *pat, char const *txt);

static int star(char const *pat, char const *txt)
{
return match(pat, txt) || (*txt && star(pat, txt + 1));
}

static int match(char const *pat, char const *txt)
{
return (*pat == 0 && *txt == 0)
|| (*pat == '*' && star(pat + 1, txt))
|| (*pat == '?' && *txt && match(pat + 1, txt + 1))
|| (*pat == *txt && match(pat + 1, txt + 1));
}

imo its about to equal (though i personally more prefer maybe ors and and than ?: here, but thats more stylistical i dont go into it)

as ike naar version was posted first in situation where its about equal i would stand my opinion, you may get second prize ;c in category of short forms (yawn)

fir

unread,
Aug 17, 2018, 4:40:21 PM8/17/18
to
W dniu piątek, 17 sierpnia 2018 03:01:15 UTC+2 użytkownik Ben Bacarisse napisał:
> Here is a version that tries to be more respectful of C's resource
> priorities. As such, it does not use recursion, and it uses malloc to
> allocate some temporary data and not, as I originally, planned, a VLA.
>
> The plan is a simple one that I am sure will be seen in other
> submissions. The pattern is split at the *s (well, again, as in my
> other versions, at sequences of *s) and the resulting "fixed width"
> pieces are matched in sequence.
>
> A couple of details... The match part that comes at the end always
> includes the null byte. It is matched like any other character and give
> a simply way to ensure that that part of the pattern matches only at the
> end of the subject string. Because there is no automatic way to
> similarly force a match at the start of a string, that piece has to be
> handled differently.
>
> A pattern with a * is always considered to have as least two fixed width
> parts. Thus "*" is treated as a zero-length match at the start and a
> 1-byte match of a null byte.
>
> In addition, space for an extra part is allocated to act as a sentinel
> so as to avoid having to keep a count of the number of parts. The parts
> are processed until one is found with a null pattern.
>
> This version was supposed to have a "compile" phase where the pattern
> could be prepared, so as to be applied to many subject strings. Since
> this design could make use of more sophisticated string matching
> algorithms like Boyer-Moore or KMP, it could well be worth having such a
> separate phase. Needless to say, that never happened, but I think that
> separating out finding the parts from matching them does add a little to
> the clarity.
>
> Finally, I acknowledge that I use a style the many here find too dense.
> Running the code though 'indent' will only help a little as I find
> myself drawn to writing things like
>
> (++part)->pattern = pat += strspn(pat, "*");
>
> All I can say in my defence is that I wrote the code for fun and I
> learned C in the era of while (*dst++ = *src++);.
>
>
> struct match_part {
> const char *pattern;
> ptrdiff_t length;
> };
>
> static bool match_char(char pattern, char text)
> {
> return pattern == text || pattern == '?' && text;
> }
>
> static bool matches_here(const struct match_part *part, const char *text)
> {
> for (ptrdiff_t i = 0; i < part->length; i++)
> if (!match_char(part->pattern[i], text[i]))
> return false;
> return true;
> }
>
> static ptrdiff_t next_match(const struct match_part *part, const char *text)
> {
> const char *start = text;
> do if (matches_here(part, start))
> return start - text;
> while (*start++);
> return -1;
> }
>
> static size_t count_parts(const char *pat)
> {
> size_t n = 1;
> while (*pat)
> if (*pat++ == '*') {
> n += 1;
> pat += strspn(pat, "*");
> }
> return n;
> }
>
> static struct match_part *find_parts(const char *pat)
> {
> struct match_part *parts = malloc((count_parts(pat) + 1) * sizeof *parts);
> if (parts) {
> struct match_part *part = parts;
> part->pattern = pat;
> while (*pat)
> if (*pat++ == '*') {
> part->length = pat - part->pattern - 1;
> (++part)->pattern = pat += strspn(pat, "*");
> }
> part->length = pat - part->pattern + 1;
> part[1] = (struct match_part){0, 0};
> }
> return parts;
> }
>
> int wcmatch(const char *pat, const char *text)
> {
> struct match_part *parts = find_parts(pat);
> ptrdiff_t d = -1;
> if (parts && matches_here(parts, text)) {
> const struct match_part *part = parts;
> d = 0;
> text += part->length;
> while ((++part)->pattern && (d = next_match(part, text)) >= 0)
> text += part->length + d;
> }
> free(parts);
> return d >= 0;
> }
>
yet again watching at this - this is interesting to me coz its the same what
i do in my code or about but with different names

those lib functions are practically the same but i get netter names - more general names are netter string_compare is better than 'matches_here' also find_string is better than next_match (or something like that, also split is
possily better than find_parts (a bit coz find_parts also sounds general, though ts not only find its more like collect and parts is a bit enigmmatic)

the body of wcmatch is also abit not so clear (btw rile of not using abbreviations is quite strong imo so that name wcmatch is just bad better would be to name it wildcard_match though according to rule of use verps it should be check_wildcard_match... i would only need to check in net what wildcard exactly mean (coz if it is sepcific it should be named specifically,

like check_doslike_wildcard_match(char* text, char* wildcard_pattern)

im not personally afraid of such long names only make short some names that are very often used but not such are called not so often (this as to names)

Ben Bacarisse

unread,
Aug 17, 2018, 5:14:44 PM8/17/18
to
Anton Shepelev <anto...@gmail.com> writes:

> Tim Rentsch:
<snip>
>> All calls are tail-call optimized (gcc -O2) except for the
>> one that can't be since it isn't a tail call.
>
> Can there be more than one tail call in a function?

Yes.

--
Ben.

Tim Rentsch

unread,
Aug 17, 2018, 7:35:28 PM8/17/18
to
fir <profes...@gmail.com> writes:

> W dniu pi?tek, 17 sierpnia 2018 21:39:15 UTC+2 u?ytkownik Tim Rentsch napisa?:
>
>> Here is my first solution.
>>
>>
>> char
>> wcmatch( const char * const p, const char * const s ){
>> return
>> *p == 0 ? *s == 0 :
>> *s == 0 ? *p == '*' ? wcmatch( p+1, s ) : 0 :
>> *p == '?' ? wcmatch( p+1, s+1 ) :
>> *p == '*' ? wcmatch( p+1, s ) ? 1 : wcmatch( p, s+1 ) :
>> *p == *s ? wcmatch( p+1, s+1 ) :
>> /* otherwise */ 0;
>> }
>
> lets compare to
>
> int match(char const *pat, char const *txt);
>
> static int star(char const *pat, char const *txt)
> {
> return match(pat, txt) || (*txt && star(pat, txt + 1));
> }
>
> static int match(char const *pat, char const *txt)
> {
> return (*pat == 0 && *txt == 0)
> || (*pat == '*' && star(pat + 1, txt))
> || (*pat == '?' && *txt && match(pat + 1, txt + 1))
> || (*pat == *txt && match(pat + 1, txt + 1));
> }
>
> imo its about to equal (though i personally more prefer maybe ors and
> and than ?: here, but thats more stylistical i dont go into it)

I used ?: rather than && or || so the recursive calls would be
tail-call optimized into branches. The gcc compiler (and probably
others also, but I'm sure about gcc) is smart enough to do TCO
with ?: but not with the && or || operators. No reason that has
to be true, but since it is I adjust my code accordingly.

> as ike naar version was posted first in situation where its about
> equal i would stand my opinion, you may get second prize ;c in
> category of short forms (yawn)

I can't tell you how thrilled I am to hear that.

By the way, don't assume posting first is the same as being
written first. My version was done more than six days ago, just
hours after the initial posting announcing the exercise. Not
that that should matter either, it was just the circumstances
of when I happened to see the posting.

Tim Rentsch

unread,
Aug 17, 2018, 7:50:02 PM8/17/18
to
Anton Shepelev <anto...@gmail.com> writes:

> Tim Rentsch:
>
>> Here is my first solution.
>>
>> char
>> wcmatch( const char * const p, const char * const s ){
>> return
>> *p == 0 ? *s == 0 :
>> *s == 0 ? *p == '*' ? wcmatch( p+1, s ) : 0 :
>> *p == '?' ? wcmatch( p+1, s+1 ) :
>> *p == '*' ? wcmatch( p+1, s ) ? 1 : wcmatch( p, s+1 ) :
>> *p == *s ? wcmatch( p+1, s+1 ) :
>> /* otherwise */ 0;
>> }
>
> An interesting abuse of the ternary operator.

The ?:'s in the "center" are there so that gcc will optimized
the tail calls into branches; gcc is smart enough to do TCO
for ?: but not for the && or || operators.

The outer structure of ?:'s is a common pattern in functional
programming, and behaves like 'cond' in Lisp. Once you get used
to it I think you'll find it has some nice properties. Besides,
wasn't part of the point of proposing the exercise to expand
people's horizons? You wouldn't want to see everyone just post
the same old tired control structures, would you?

>> All calls are tail-call optimized (gcc -O2) except for the
>> one that can't be since it isn't a tail call.
>
> Can there be more than one tail call in a function?

Sure. The function definition above has four tail calls, one on
each of the four lines that call wcmatch(). The first call to
wcmatch() on the line where p=='*' is not a tail call, but all
the others are.

fir

unread,
Aug 17, 2018, 7:54:37 PM8/17/18
to
well i dont treat this topic who is winning seriously... though obviously someone if wants do that for fun could
measure and compare that in some metrics
(like counting characyters (assuming id length is 1) counting operators not charactes, or counting those physical atoms i ddefined

even some may compare assembly length, someonline gcc compiler showed that etc

Tim Rentsch

unread,
Aug 17, 2018, 8:05:52 PM8/17/18
to
Anton Shepelev <anton.txt@g{oogle}mail.com> writes:

> Tim Rentsch:
>
>> [.. code summary/"mission" ..]
>
> That belongs to the obfuscated C contest:

I thought it would be interesting to try to write a solution with
only return statements, and no other kinds of statements or
declarations. This idea isn't just some capricious whim - it
relates to looking at C from a functional programming perspective.
For the sake of visual appearance I also decided to limit myself
to single-line return statements, with a strict limit of 79
columns. Once these restrictions are accepted, it's pretty much
inevitable that most of the names used will have to be short.

Not counting the short names though this is a very practical
algorithm. It's pretty fast, uses O(1) space, and all the
recursive calls are tail-call optimized so it can work on
arbitrarily large inputs. I ran test cases up to 50 billion
characters, and it handled them all with no problem. I know
the appearance gives it the look of an IOCCC entry, but don't let
that fool you. There is some interesting substance there for
people who take the time to work through it.

Tim Rentsch

unread,
Aug 17, 2018, 8:18:17 PM8/17/18
to
Anton Shepelev <anto...@gmail.com> writes:

> Ben Bacarisse:
>
>> Non exponential recursive versions are possible but they
>> don't use this simple algorithm.
>
> How?

The first solution I posted (ie, the longer one, with 20
single-line functions) uses recursion exclusively, and it
is non-exponential. Its basic pattern-match component (ie,
for sub-parts of the pattern that don't have *'s in them) is
quadratic in the length of the sub-pattern, like regular
string search, and the rest of the algorithm is all linear.

To be a little more specific, it uses a find-the-first-match-
for-the-next-sub-pattern, like the more elaborate solution
that Ben posted, but just does that recursively rather than
iteratively. And all the recursion happens through tail
calls that get turned into branches by TCO. In fact the
generated code has only two actual functions in it, one
that deals with the simple cases like no * or one *, and
a second for all the cases with more than one *, which
need to be done "iteratively" rather than just if()-like
control flow.

Tim Rentsch

unread,
Aug 17, 2018, 8:25:19 PM8/17/18
to
Malcolm McLean <malcolm.ar...@gmail.com> writes:

> On Friday, August 17, 2018 at 5:12:25 PM UTC+1, Malcolm McLean wrote:
>> Exactly, yes, so you've got to call the modified strstr (accommodating
>> the question mark) repeatedly. But it returns quickly - it doesn't
>> search the whole of the target string.
>
> However what about this pair?
>
> "a*a*a*a*a*a*a*a" "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
>
> are we still OK? If we retry the last match then we should be, but
> if we retry the first we'll have a cascade of retries as each match
> gets moved along one.

I can tell you that I ran my second solution on patterns
just like this, 50 billion characters long, and it completed
in about three minutes. And this isn't just an accident of
this particular pattern - the algorithm does things a piece
at a time, and once a piece is done, it's done: there is
never any "going back" that could cause the kind of explosive
slowdown alluded to above.

Tim Rentsch

unread,
Aug 17, 2018, 8:27:24 PM8/17/18
to
Ben Bacarisse <ben.u...@bsb.me.uk> writes:

> Anton Shepelev <anton.txt@g{oogle}mail.com> writes:
>
>> Ben Bacarisse:
>>
>>> Matching every non-* part at the first possible location
>>> will produce a linear-time algorithm.
>>
>> Indeed, and my solution has also special handling of the
>> last such section in order to avoid a full-string scan in
>> cases such as "*b" for "aaaaaaaaaaabbbbbbbbb".
>
> Yes, I considered checking last part early [...]

I did this in my second solution, because it helped simplify
the loop for all the "middle" pieces.

Ben Bacarisse

unread,
Aug 17, 2018, 8:34:13 PM8/17/18
to
Anton Shepelev <anto...@gmail.com> writes:

> Ben Bacarisse:
>
>> Non exponential recursive versions are possible but they
>> don't use this simple algorithm.
>
> How?

By implementing the split on *s algorithm recursively.

That answer makes me think we may be talking at cross-purposes because
you almost certainly now that any algorithm can be written recursively.

--
Ben.

Tim Rentsch

unread,
Aug 17, 2018, 8:53:02 PM8/17/18
to
Ben Bacarisse <ben.u...@bsb.me.uk> writes:

> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>
>> By malicious input, I mean input contrived to cause worst-case
>> performance, not normal input you would expect from typical use.
>
> I thought that's what you meant. I asked because, since there is a
> liner algorithm[1], I could not contrive any cases where an O(N^2)
> solution would be useful. But maybe you wrote the above thinking
> that the only other algorithms were exponential.
>
> [1] Exactly what I means by this is a bit vague. There are at least
> two "sizes" to consider -- the pattern length, p, and the subject
> length s. I think that if you do the fixed-size matches with
> Boyer-Moore (+Galil) you get O(p+s) even for these * patterns.

I share your assessment. Also, as I just now discovered, since
Boyer-Moore there is now Two-way string matching, which makes it
more likely that the overall result is O(p+s). Two-way string
matching also looks a little better (per the information on the
wikipedia page) than Boyer-Moore in setup time and how much
space is needed.

https://en.wikipedia.org/wiki/String_searching_algorithm
http://www.quretec.com/u/vilo/edu/2002-03/Tekstialgoritmid_I/
Articles/Exact/Two-way-p650-crochemore.pdf

I don't know which approach is more amenable to dealing with ?
wild-character pattern characters. I don't know enough to say
for sure that the presence of ? pattern characters won't muck
up the work, but it seems pretty likely that they don't change
things in any significant way.

Ben Bacarisse

unread,
Aug 17, 2018, 9:44:11 PM8/17/18
to
Tim Rentsch <t...@alumni.caltech.edu> writes:

> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>
>> Malcolm McLean <malcolm.ar...@gmail.com> writes:
>>
>>> By malicious input, I mean input contrived to cause worst-case
>>> performance, not normal input you would expect from typical use.
>>
>> I thought that's what you meant. I asked because, since there is a
>> liner algorithm[1], I could not contrive any cases where an O(N^2)
>> solution would be useful. But maybe you wrote the above thinking
>> that the only other algorithms were exponential.
>>
>> [1] Exactly what I means by this is a bit vague. There are at least
>> two "sizes" to consider -- the pattern length, p, and the subject
>> length s. I think that if you do the fixed-size matches with
>> Boyer-Moore (+Galil) you get O(p+s) even for these * patterns.
>
> I share your assessment. Also, as I just now discovered, since
> Boyer-Moore there is now Two-way string matching, which makes it
> more likely that the overall result is O(p+s). Two-way string
> matching also looks a little better (per the information on the
> wikipedia page) than Boyer-Moore in setup time and how much
> space is needed.
>
> https://en.wikipedia.org/wiki/String_searching_algorithm
> http://www.quretec.com/u/vilo/edu/2002-03/Tekstialgoritmid_I/
> Articles/Exact/Two-way-p650-crochemore.pdf

Interesting, thanks.

> I don't know which approach is more amenable to dealing with ?
> wild-character pattern characters. I don't know enough to say
> for sure that the presence of ? pattern characters won't muck
> up the work, but it seems pretty likely that they don't change
> things in any significant way.

Nor I. But I would consider splitting the pattern on sequences made up
of *s and ?s so the parts that need to matched fast are all literal
strings. The number of ?s in the */? sequence being used to skip an
initial segment of the subject before looking for the next literal part.

It's possible that this makes things worse, but at least it would be a
way to proceed without trying to fix BM, KMP or this two-way algorithm
to accommodate ?s. Mind you, there must surely be literature about
matching with single-character wild patterns.

--
Ben.

Ben Bacarisse

unread,
Aug 17, 2018, 9:51:00 PM8/17/18
to
Tim Rentsch <t...@alumni.caltech.edu> writes:

> Anton Shepelev <anton.txt@g{oogle}mail.com> writes:
>
>> Tim Rentsch:
>>
>>> [.. code summary/"mission" ..]
>>
>> That belongs to the obfuscated C contest:
<snip>
> I know
> the appearance gives it the look of an IOCCC entry, but don't let
> that fool you.

To paraphrase Marx (Groucho, not Karl) "It may look like obfuscated
code, but don't let that fool you, it really is obfuscated code".

> There is some interesting substance there for
> people who take the time to work through it.

Just a data point: I stopped reading about three lines in when it became
clear it was just another of your cryptic "puzzle" posts.

--
Ben.

Scott

unread,
Aug 18, 2018, 4:50:30 AM8/18/18
to
So Rosario mentioned this pathological case:

"a*a*a*a*a*a*a*a*a*a*a*a*a*a*c"
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"

(Or someone else did, Ros's was the first I saw, credit due etc)

...and I somehow started to get annoyed that my match function choked
on it. But then, so do almost everyone else's recursive solutions.

For fun, I ran this case (and the test suite) against BSD's
not-quite-POSIX fnmatch(3), a reasonably popular library function. I
watched fnmatch fail only one of the suite tests (all printable ASCII,
probably because of the '['). And, interestingly, it too choked on the
above pathological case. Brief inspection of the source reveals that
fnmatch, like many of us here in clc, also uses recursion.

So I guess a lot of us are in pretty good company after all.

I'll say also that it's been interesting seeing the variety of other
solutions posted here. There have been some approaches that I'd have
dismissed out of hand, or never even thought of at all. I don't know
if that's good or bad, though.

Rosario19

unread,
Aug 18, 2018, 6:23:42 AM8/18/18
to
the couple of patological string in few i know, appear at end not
match...
so one has to find one loop that eliminate each input certaly not
match (and this could be ok even for recursive ones)

the question it is right what i say?
exist one couple of patological strings that instead match?

Bart

unread,
Aug 18, 2018, 7:59:33 AM8/18/18
to
On 16/08/2018 19:56, Anton Shepelev wrote:

> Please, publish your entries as immediate replies to this
> message in order to facilitate cohesion.

Lots of recursive solutions so I've posted one that is non-recursive,
and is still fairly short (although it has more 'horizontal' code than
I'd normally write).

It was done this morning so there is some influence from the results
I've already glanced at (mainly, treating * and ? separately rather than
combining consecutive */? sequences as I did in my sprawling original
effort).

It passes Stefan Ram's test data.

And it is self-contained in one function and requires no C library
functions.

There are some gotos, but I'm unapologetic about those. (The code was
manually transcribed from my non-C language, and the original didn't
need gotos as it had better loop control statements.). There are
multiple return points too.


-----------------------------------

int wcmatch(char* p, char* t) {
char *q, *u;

while (1) {
switch (*p) {
case '*':
do ++p; while (*p == '*');

next2: while (2) {
q = p; u = t;
while (3) {
if (*q == '*') goto exit2;
if (*q == 0 && *u == 0) return 1;
if (*q == 0 && *u) {++t; goto next2;}
if (*u == 0) return 0;
if (*q == '?')
{++q; ++u;}
else
if (*q++ != *u++) {++t; goto next2;}
}
}
exit2:
break;

case '?':
if (*t++ == 0) return 0;
++p;
break;

case 0:
return *t == 0;

default:
if (*p++ != *t++) return 0;
}
}
}

-----------------------------------

Ben Bacarisse

unread,
Aug 18, 2018, 8:27:01 AM8/18/18
to
nob...@example.org (Scott) writes:

> So Rosario mentioned this pathological case:
>
> "a*a*a*a*a*a*a*a*a*a*a*a*a*a*c"
> "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"
>
> (Or someone else did, Ros's was the first I saw, credit due etc)
>
> ...and I somehow started to get annoyed that my match function choked
> on it. But then, so do almost everyone else's recursive solutions.

Yes. The trouble is that these solutions are trying to find every
match. With a*a*c matching against aaaab they try

aaaab
aa--c
a-a-c
a--ac

before they declare failure. If the exercise had been to count the
number of ways a pattern can match a string, these recursive solutions
would be the obvious candidates.

This pattern of searching is useful when the pattern language is a
full-blown regular expression language, so most RE engines have
pathological cases that you just have to live with, though the good ones
have a large number of tricks to avoid trouble.

However any algorithm can be written recursively, so there is nothing
inherently expensive about it. It's not the recursion that's causing
the problem, it's the pattern of recursion.

Maybe the two classes of algorithm seen here should be called "match
all" and "match first".

> For fun, I ran this case (and the test suite) against BSD's
> not-quite-POSIX fnmatch(3), a reasonably popular library function. I
> watched fnmatch fail only one of the suite tests (all printable ASCII,
> probably because of the '['). And, interestingly, it too choked on the
> above pathological case. Brief inspection of the source reveals that
> fnmatch, like many of us here in clc, also uses recursion.

Or try your shell. zsh starts to consume absurd quantities of memory
when given ls a*a*a*a*a*a*a*a*a*a*a*a*a*a*c if there is a file with the
"bad" name in the directory. grep, however (at least on my system),
happily handles the equivalent "a.*a.* ... c" pattern.

(By the way, ls a*a*a*a*a*a*a*c is probably enough to see a noticeable
delay. No need to tie-up your system with the longer pattern.)

--
Ben.

Bart

unread,
Aug 18, 2018, 8:27:24 AM8/18/18
to
On 18/08/2018 12:59, Bart wrote:
> On 16/08/2018 19:56, Anton Shepelev wrote:
>
>> Please, publish your entries as immediate replies to this
>> message in order to facilitate cohesion.
>
> Lots of recursive solutions so I've posted one that is non-recursive,
> and is still fairly short (although it has more 'horizontal' code than
> I'd normally write).

Regarding metrics, then Ike Naar's 4-line solution is 251 characters,
while mine can be reduced to 389 characters by removing extraneous
spaces, and renaming the labels, but still retaining most of the lines.

Then it is only 55% bigger. (This is assuming lf line-endings; on
Windows they will be cr-lf and the difference is 64%.) Effectively it
could be squeezed into 6 lines.

It wasn't however specifically written to be all that small; only much
more compact than my original.

---------------------------------------

int wcmatch(char*p,char*t){
char*q,*u;
while(1){
switch(*p){
case'*':
do++p;while(*p=='*');
n:while(2){
q=p;u=t;
while(3){
if(*q=='*')goto e;
if(*q==0 && *u==0)return 1;
if(*q==0 && *u) {++t;goto n;}
if(*u==0)return 0;
if(*q=='?')
{++q;++u;}
else
if(*q++!=*u++){++t;goto n;}
}}
e:
break;
case'?':
if(*t++==0)return 0;
++p;
break;
case 0:
return*t==0;
default:
if(*p++!=*t++) return 0;
}}}

---------------------------------------

--
bart

Bart

unread,
Aug 18, 2018, 8:34:53 AM8/18/18
to
Huh. The more discerning might see some spaces still in, that don't need
to be there (I thought I'd searched for them all).

Still, that only makes it marginally smaller compared with Ike's. Just
edging towards 50% bigger, if a couple more lines are lost.

--
bart

Ben Bacarisse

unread,
Aug 18, 2018, 8:58:52 AM8/18/18
to
Bart <b...@freeuk.com> writes:
<snip>
> Huh. The more discerning might see some spaces still in, that don't
> need to be there (I thought I'd searched for them all).
>
> Still, that only makes it marginally smaller compared with Ike's. Just
> edging towards 50% bigger, if a couple more lines are lost.

As I said before, I don't think counting lines or characters is a very
useful measure. For example, in this cost measure you are paying a
price for using a switch because of the "break;"s (and "return <exp>;",
the other usual to get out, is worse). And you also pay a price for
using goto. The label next2 can go if you replace the two gotos with
breaks.

--
Ben.

Bart

unread,
Aug 18, 2018, 10:03:42 AM8/18/18
to
On 18/08/2018 13:58, Ben Bacarisse wrote:
> Bart <b...@freeuk.com> writes:
> <snip>
>> Huh. The more discerning might see some spaces still in, that don't
>> need to be there (I thought I'd searched for them all).
>>
>> Still, that only makes it marginally smaller compared with Ike's. Just
>> edging towards 50% bigger, if a couple more lines are lost.
>
> As I said before, I don't think counting lines or characters is a very
> useful measure.

Neither did I. But some people seem to think so. And I noticed my
version wasn't that much more complex that what seemed to be considered
the shortest solution, usually associated with recursive methods.

Although if you apply the same 'compression' to some other answers,
there's not much between them either.

For example, in this cost measure you are paying a
> price for using a switch because of the "break;"s (and "return <exp>;",
> the other usual to get out, is worse). And you also pay a price for
> using goto. The label next2 can go if you replace the two gotos with
> breaks.

Those are multi-level breaks.


--
bart

fir

unread,
Aug 18, 2018, 11:21:03 AM8/18/18
to
W dniu piątek, 17 sierpnia 2018 16:41:20 UTC+2 użytkownik fir napisał:
>
> Chunks2 split_after_char( Chunk2 text, char split)
> {
> static Chunks2 table = {0};
>
> Chunk2 word = {text.start, 0};
>
> for(;text.length--; text.start++)
> {
> word.length++;
>
> if(*text.start==split)
> {
> table = AddChunk(table, word);
> word = {text.start+1, 0};
> }
> }
>
> if(word.length>0)
> table = AddChunk(table, word);
>
> return table ;
>
> }

yet getting back to it

it is not so long and works but it is still headwrapping/(confusing)(or how to call it, i mean state when it is not so easy to clerly think/work on it - wery wery unpleasant state ;c)

hovever i later get thinked that writing
the logic in loop like here may be some reason and if not to try to code utside the loop instead of outside

this was leading to something ike that


chunks SplitAfterCharacter_New(chunk txt, char character)
{
static chunks table = {0};
////////////////////////////////////////
for(;;)
{
chunk left_part = GiveLeftSubchunk_TillCharacter_Inclusive(txt, character);

table = Add_Chunk(table, left_part);

if(left_part.end==txt.end) break;

// if(left_part==txt) break; //not compiles

txt = {left_part.end + 1, txt.end};
}


////////////////////////////////////
return table ;
}



and this is imo far more ok,

this code just gets left part od subchunk (until character met, inclusively) [i should find maybe shorter namie like BiteLeft() ]
then adds it to container, repeats it and exits if this left part is last one that can be found

at least it is better, coz this split in those previous form OMG,

Ben Bacarisse

unread,
Aug 18, 2018, 11:29:50 AM8/18/18
to
I know. I said that one of your labels can be removed (next2) and the
corresponding gotos replaced by breaks. I said that knowing what the
code does. While the gotos currently exit two loops it is only to
restart the outer one with no change in state. If you move the label to
just before the } of while (2), you will see that the gotos are really
just breaks out of the inner loop.

--
Ben.

Richard Damon

unread,
Aug 18, 2018, 3:45:44 PM8/18/18
to
On 8/17/18 9:39 AM, Joe Pfeiffer wrote:
> Anton Shepelev <anton.txt@g{oogle}mail.com> writes:
>
>> I wrote:
>>
>>> In mine, the recursive call is generous (skips the
>>> asterisk) and the iterative continuation is greedy (eats a
>>> symbol from the string). After I came up with a working
>>> function, I have tried to make it as small as possible,
>>> whence the tricky use ++ and --. The result is quite
>>> different from, and I belive better than, the original.
>>>
>>> I have not tested this conjecture, but meseems this
>>> function will perform the worst when there are many
>>> asterisks but few symbols for them to match, because then
>>> the recursion will have to traverse the largest tree until
>>> before reaching a correct fit.
>>
>> It should be the other way round: few astersks with many
>> symbols to swallow. I am at work now an shan't be able to
>> most my own solution until the evening.
>
> My intuition is that given any solution, a pattern-text pair can be
> constructed that will drive the solution into exploring the whole space
> of possible matches and hence will be exponential. What that pair looks
> like will depend on the solution you're trying to drive into poor
> behavior.
>

I first thought the same thing, but now I believe that if you divide the
search string into the pieces separated by *, the each piece only needs
to be found at its first possible location, the key factor is that the
first string is anchored at the beginning of the string and the last
piece anchored at the end of the string (which can be done by including
the trailing nul as part of it). There is no case where any other than
the first match for a whole * delimited piece will be needed. There may
be alternate match that use later matches, but their existance does not
affect the output of the match function.

A match function that uses this feature will probably have a lower max
'O' complexity level. I suspect that many of the routines that didn't
take this into account may find matches in a similar time, but may take
a LOT longer to return a No Match result, as they will try a lot of
combinations that can't actually help.

Bart

unread,
Aug 18, 2018, 4:40:37 PM8/18/18
to
OK, I see what you mean. If I make that mod, and change to an if-else
chain to see what that looks like, I get the new version below.

(I noticed a small bug: although this doesn't stop it working properly,
it means that if it just successfully matched the "abc" in "*abc*" with
"aaabc...", then p still points to "abc*", and t still points to
"abc...", so this part is checked again in the main loop.

The assignments next to exit2 take care of that, and makes it marginally
faster.)


-----------------------------------------

int wcmatch(char* p, char* t) {
char *q, *u;

while (1) {
if (*p == '*') {
do ++p; while (*p == '*');

while (2) {
q = p; u = t;
while (3) {
if (*q == '*') goto exit2;
if (*q == 0 && *u == 0) return 1;
if (*q == 0 && *u) {++t; break;}
if (*u == 0) return 0;
if (*q == '?')
{++q; ++u;}
else
if (*q++ != *u++) {++t; break;}
}
}
exit2: p=q; t=u;
}
else if (*p=='?') {
if (*t++ == 0) return 0;
++p;
}
else if (*p==0) return *t == 0;
else if (*p++ != *t++) return 0;
}
}

--
bart

Bart

unread,
Aug 18, 2018, 5:00:14 PM8/18/18
to
On 17/08/2018 08:27, Ike Naar wrote:
> On 2018-08-16, Anton Shepelev <anto...@gmail.com> wrote:
>> Hello, ladies and laddies!
>>
>> I hereby commence this new thread for the submission and
>> discussinon of entries to our little wildcard-matching
>> exercise announced in this post:
>>
>> Subject: A small programming exercise: wildcard matching
>> Date: Sat, 11 Aug 2018 15:49:53 +0300
>> Message-ID: <20180811154953.ac18...@gmail.com>
>>
>> Please, publish your entries as immediate replies to this
>> message in order to facilitate cohesion.
>
> The short version (as promised)
>
> int match(char const*,char const*);
> int star(char const*p,char const*t){return match(p,t)||(*t&&star(p,t+1));}
> int match(char const*p,char const*t){return(*p==0&&*t==0)||(*p=='*'&&
> star(p,t))||(*p=='?'&&*t&&match(p+1,t+1))||(*p==*t&&match(p+1,t+1));}

The start of that last line should be star(p+1,t) to match your longer code.

Note that this uses 'match' rather the 'wcmatch' that everyone else
does; in this program, that would add 8 more characters. With the "+1"
that was missing, it appears to be 10 characters shorter that it ought
to be.

Offsetting that, however, is the fact that you can get rid of all those
consts, saving 36 characters, and it would still work perfectly well**.
So a net reduction in size of 10%.

(**Or as well as it can with the longer tests, where I assumes it runs
out of stack or something.)

--
bart
It is loading more messages.
0 new messages