Case insensitive lsearch?

Helmut Giese

unread,

Apr 2, 2003, 10:35:09 AM4/2/03

to

Hello out there,
is it somehow possible to do [lsearch] ignoring case?
There is no immediate option but the man says that using -glob will
use a comparison like [string match] - and [string match] has an
option -nocase.
But it is probably a shot into the wild to assume that this [string
match] option can be made to work for [lsearch], too.

Currently I just walk thru the list comparing 'by hand' - just
wondered if there was a better way.
Best regards
Helmut Giese

Roy Terry

unread,

Apr 2, 2003, 10:40:11 AM4/2/03

to

Helmut Giese wrote:
>
> Hello out there,
> is it somehow possible to do [lsearch] ignoring case?

You can use the the -regexp option and
preceed the pattern with (?i)

Darren New

unread,

Apr 2, 2003, 11:44:13 AM4/2/03

to

Helmut Giese wrote:
> is it somehow possible to do [lsearch] ignoring case?

lsearch [string tolower $a] [string tolower $b]
maybe?

Helmut Giese

unread,

Apr 2, 2003, 11:51:14 AM4/2/03

to

On Wed, 02 Apr 2003 15:40:11 GMT, Roy Terry <royt...@earthlink.net>
wrote:

Great! Works like a charm. Thanks
Helmut Giese

Derk Gwen

unread,

Apr 2, 2003, 12:51:51 PM4/2/03

to

hgi...@ratiosoft.com (Helmut Giese) wrote:
# Hello out there,
# is it somehow possible to do [lsearch] ignoring case?

lsearch -option [string tolower $list] [string tolower $seeker]

It may be a little slower than a -yescase search because of the list representation
shimmering, but it should give the same result.

--
Derk Gwen http://derkgwen.250free.com/html/index.html
Where do you get those wonderful toys?

Helmut Giese

unread,

Apr 3, 2003, 5:25:00 AM4/3/03

to

Yeah, why didn't I think of this old trick?
Mumble, mumble ... lists are large - would make a 'lower case copy' of
the original for [lsearch] ing und use the original for generating
messages to user. Done this stuff lots of times.
Thanks for reminding me :)
Helmut Giese

PS: Roy's suggestion is, however, more pleasing on the intellectual
level (look Ma, used an RE with lsearch) - will have to check
performance on *really* large lists.

Helmut Giese

unread,

Apr 3, 2003, 5:29:50 AM4/3/03

to

On Wed, 02 Apr 2003 17:51:51 -0000, Derk Gwen <derk...@HotPOP.com>
wrote:

>hgi...@ratiosoft.com (Helmut Giese) wrote:
># Hello out there,
># is it somehow possible to do [lsearch] ignoring case?
>
>lsearch -option [string tolower $list] [string tolower $seeker]
>
>It may be a little slower than a -yescase search because of the list representation
>shimmering, but it should give the same result.

Thanks.
Would you care to elaborate on 'shimmering'?
Best regards
Helmut Giese

lvi...@yahoo.com

unread,

Apr 3, 2003, 8:36:19 AM4/3/03

to

:> is it somehow possible to do [lsearch] ignoring case?

:lsearch [string tolower $a] [string tolower $b]

Of course, that means that the strings have to be handled at least 3
times. Could not lsearch be written in such a way that they are handled
only once?
--
Join us at the Tenth Annual Tcl/Tk Conference <URL: http://mini.net/tcl/6274 >
Even if explicitly stated to the contrary, nothing in this posting
should be construed as representing my employer's opinions.
<URL: mailto:lvi...@yahoo.com > <URL: http://www.purl.org/NET/lvirden/ >

Michael Schlenker

unread,

Apr 3, 2003, 9:16:23 AM4/3/03

to

lvi...@yahoo.com wrote:
> :> is it somehow possible to do [lsearch] ignoring case?
> :lsearch [string tolower $a] [string tolower $b]
>
> Of course, that means that the strings have to be handled at least 3
> times. Could not lsearch be written in such a way that they are handled
> only once?

It sure is possible, not extremly hard. Just about 20
strcmp/memcmp/DictionaryCompare calls in Tcl_LsearchObjCmd() that would
have to be changed+the extra option added.

Michael Schlenker

unread,

Apr 3, 2003, 9:33:51 AM4/3/03

to

With small lists (one character items, 1.000.000 items in the list) the
lsearch -regexp approach is much faster than the [string tolower]
approach. (about a factor of 40000 faster with regexp). This may be
different with larger items.

Michael Schlenker

Martin Lemburg EDS PLM solutions

unread,

Apr 3, 2003, 10:51:15 AM4/3/03

to

hgi...@ratiosoft.com (Helmut Giese) wrote in message news:<3e8c0c5a...@News.CIS.DFN.DE>...

Hello,

this means ...

1. the string representation of the list will be used by the "string
tolower" command.

2. the "string tolower" command returns a string object (a "lowered"
copy of the list, without the underlying tcl list data structure)

3. lsearch would try to convert this string object back into a list
object (with all the underlying tcl list data structure)

4. after the successfull conversion into a list object the string
object don't exists anymore

5. at the end of lsearch this "temporary" list object will be deleted

If the list is very huge (in the memory), than it could be better to
use -regexp, like in the first reply mentioned!

Greetings,

Martin Lemburg

Helmut Giese

unread,

Apr 3, 2003, 1:19:06 PM4/3/03

to

On 3 Apr 2003 07:51:15 -0800, martin....@eds.com (Martin Lemburg
EDS PLM solutions) wrote:

>> Would you care to elaborate on 'shimmering'?
>> Best regards
>> Helmut Giese
>
>Hello,
>
>this means ...
>
>1. the string representation of the list will be used by the "string
>tolower" command.
>
>2. the "string tolower" command returns a string object (a "lowered"
>copy of the list, without the underlying tcl list data structure)
>
>3. lsearch would try to convert this string object back into a list
>object (with all the underlying tcl list data structure)
>
>4. after the successfull conversion into a list object the string
>object don't exists anymore
>
>5. at the end of lsearch this "temporary" list object will be deleted
>
>If the list is very huge (in the memory), than it could be better to
>use -regexp, like in the first reply mentioned!

Helo Martin,
thanks for the explaination. Now - as often in the past - I wonder
about the term 'shimmering'. Is it just a synonym for 'conversion' or
does it carry more meaning?
Helmut Giese

Darren New

unread,

Apr 3, 2003, 1:37:23 PM4/3/03

to

Helmut Giese wrote:
> thanks for the explaination. Now - as often in the past - I wonder
> about the term 'shimmering'. Is it just a synonym for 'conversion' or
> does it carry more meaning?

Shimmer: a wavering sometimes distorted visual image usually resulting
from heat-induced changes in atmospheric refraction.

So when people say the type is shimmering, it means it's switching back
and forth between types several times, which is inefficient.

Helmut Giese

unread,

Apr 3, 2003, 2:48:31 PM4/3/03

to

Ok, I got it now. Thanks
Helmut Giese

John Taylor

unread,

Apr 9, 2003, 3:22:34 AM4/9/03

to

Roy Terry <royt...@earthlink.net> wrote in message news:<3E8B04BF...@earthlink.net>...

Background:
I have a program that generates a list of regular expressions to match
text files and as part of that process it checks to see if it already
has generated a particular regular expression...

Question:
I presume that the -regexp with (?i) technique is meant for usage with
normal characters NOT when you want to search for the existance of a
regular expression itself? In this case I would use (?iq) I presume?

Example:
set alist "george fred"
set apattern ".*"

# lsearch returns -1 as no match
lsearch [string toupper $alist] [string toupper $apattern]

set ipattern "(?i)$apattern"
# lsearch returns 0 as a rexexp match NOT a string match
lsearch -regexp $alist $ipattern

set iqpattern "(?iq)$apattern"
# lsearch now returns -1 as correctly does literal match
lsearch -regexp $alist $iqpattern

Michael Schlenker

unread,

Apr 9, 2003, 3:49:51 AM4/9/03

to

John Taylor wrote:
> Roy Terry <royt...@earthlink.net> wrote in message news:<3E8B04BF...@earthlink.net>...
>
>>Helmut Giese wrote:
>>
>>>Hello out there,
>>>is it somehow possible to do [lsearch] ignoring case?
>>
>>You can use the the -regexp option and
>>preceed the pattern with (?i)
>
>
> Background:
> I have a program that generates a list of regular expressions to match
> text files and as part of that process it checks to see if it already
> has generated a particular regular expression...
>
> Question:
> I presume that the -regexp with (?i) technique is meant for usage with
> normal characters NOT when you want to search for the existance of a
> regular expression itself? In this case I would use (?iq) I presume?

Yes. If you use the regexp trick you have to use the ?iq switch, but
this may bite you quite badly if you use any of the shorthand escapes in
regular expressions that have quite a different meaning in lowercase
(think about \s+ and \S+ or \m and \M).

In general it simply is _wrong_ to assume that:

regexp [string to lower $exp] [string to lower $string]
<=>
regexp $exp $string

So be prepared for trouble if you go this route.

Michael Schlenker

Roy Terry

unread,

Apr 9, 2003, 10:35:30 AM4/9/03

to

John Taylor wrote:
>
> Roy Terry <royt...@earthlink.net> wrote in message news:<3E8B04BF...@earthlink.net>...
> > Helmut Giese wrote:
> > >
> > > Hello out there,
> > > is it somehow possible to do [lsearch] ignoring case?
> > You can use the the -regexp option and
> > preceed the pattern with (?i)
>
> Background:
> I have a program that generates a list of regular expressions to match
> text files and as part of that process it checks to see if it already
> has generated a particular regular expression...
>
> Question:
> I presume that the -regexp with (?i) technique is meant for usage with
> normal characters NOT when you want to search for the existance of a
> regular expression itself? In this case I would use (?iq) I presume?

I'm not an RE specialist but I would definitely not presume that. In
fact, regexp certainly doesn't know or care that you're searching
for the existence of a "regular expression itself". In all cases regexp
and friends just searches strings.

>
> Example:
> set alist "george fred"
> set apattern ".*"
>
> # lsearch returns -1 as no match
> lsearch [string toupper $alist] [string toupper $apattern]
>
> set ipattern "(?i)$apattern"
> # lsearch returns 0 as a rexexp match NOT a string match
> lsearch -regexp $alist $ipattern
>
> set iqpattern "(?iq)$apattern"
> # lsearch now returns -1 as correctly does literal match
> lsearch -regexp $alist $iqpattern

This seems reasonable to me. (?i) and (?q) ought to be able
to be mixed as desired. In this last instance the (?i) is
seems irrelevant to me.

What is it about searching for the existence of a regular
expression? This is not a concept that is present in Tcl
code. I suspect there is a simplification waiting to
be brought to your code in which you find out if an RE
has been built by doing [info exists] or some sort of
array lookup. Using an RE to test what your program did
or did not already do sounds a bit off.

Care to clarify?

Cheers,
Roy

Roy Terry

unread,

Apr 9, 2003, 10:38:59 AM4/9/03

to

Michael Schlenker wrote:
>
> John Taylor wrote:
> > Roy Terry <royt...@earthlink.net> wrote in message news:<3E8B04BF...@earthlink.net>...
> >
> >>Helmut Giese wrote:
> >>
> >>>Hello out there,
> >>>is it somehow possible to do [lsearch] ignoring case?
> >>
> >>You can use the the -regexp option and
> >>preceed the pattern with (?i)
> >
> >
> > Background:
> > I have a program that generates a list of regular expressions to match
> > text files and as part of that process it checks to see if it already
> > has generated a particular regular expression...
> >
> > Question:
> > I presume that the -regexp with (?i) technique is meant for usage with
> > normal characters NOT when you want to search for the existance of a
> > regular expression itself? In this case I would use (?iq) I presume?
> Yes. If you use the regexp trick you have to use the ?iq switch, but
> this may bite you quite badly if you use any of the shorthand escapes in
> regular expressions that have quite a different meaning in lowercase
> (think about \s+ and \S+ or \m and \M).

Are you talking about analyzing an re by running another re over it?

>
> In general it simply is _wrong_ to assume that:
>
> regexp [string to lower $exp] [string to lower $string]
> <=>
> regexp $exp $string

Do you have evidence that this is so? I would consider
it a bug so please post it.
>
> Michael Schlenker

Michael Schlenker

unread,

Apr 9, 2003, 10:51:49 AM4/9/03

to

ok i made a typo here [string tolower ....]

>><=>
>>regexp $exp $string
>
> Do you have evidence that this is so? I would consider
> it a bug so please post it.
>

Its not a bug:

% set exp {\Sa\S+sentence\S+with\S+whitespace}
\Sa\S+sentence\S+with\S+whitespace
% set string { a sentence with whitespace}
a sentence with whitespace
% regexp $exp $string
0
% regexp [string tolower $exp] [string tolower $string]
1

Michael Schlenker

Roy Terry

unread,

Apr 9, 2003, 11:20:04 AM4/9/03

to

Whoops, I mis-read. I do agree with your assertion. I am
asserting that the presence of (?i) at re beginning should
have no effect on the the meaning of \m \M \s \S etc. Would
you agree with that?

Michael Schlenker

unread,

Apr 9, 2003, 6:43:51 PM4/9/03

to

Yes, i agree.

Michael

John Taylor

unread,

Apr 10, 2003, 1:25:18 AM4/10/03

to

Roy Terry <royt...@earthlink.net> wrote in message news:<3E94301A...@earthlink.net>...

Roy,

I'll try to clarify what I'm doing and apologise in advance that its a
bit long. I am a Database Administrator and therefore need to check
database and other "log" files. I don't want to do this manually. I
observe that I could "grep" for known errors etc but I would need a
very good list of error message types and I observe that when things
go wrong, non-documented forms of errors are generated.

So my approach is to get a log (with not many errors in it) and get my
program to automatically generate a set of regular expressions (REs)
that match the log's lines. Then when I want to check a new log, I
output log file lines which don't match any of the REs (i.e.
interesting lines including errors). As part of this technique I wish
to optionally adjust the automatically generated REs by including user
generated REs and by excluding some etc. I therefore need to check
that the user is not duplicating existing REs etc. So at times the
program searches the generated REs to check consistency of the user
requests. I hold the REs as a list which I admit may NOT be the best
way to store them but the program evolved from test code to check if
the technique works in the real world and so has scars and bruises of
experimentation. Maybe I should hold them as an array (although
searching for a RE might well still involve issues of case and the
need to turn off the function of meta characters to achieve literal
searches). However they are currently in a list and I case fold the
non RE characters so that they are held in a canonical form to make it
easier to check inconsistent user RE manipulations. I don't case fold
all expressions to allow the user to have case sensitive REs for
completeness.

Please understand that the basis of the program was in the Tcl 7.5 era
and the REs had less power and I'm changing it to Tcl 8.4.1.0.+. As
an aside therefore I think I will think more about the (?i) prefix in
the REs rather than using regexp -nocase.

Anyway until I change things, I need a method of case insensitive
searching of a list. I understand that my current method of using
lsearch with string toupper works but involves "shimmering" so was
interested when I saw the lsearch -regexp with (?i) RE prefix
technique. However it was apparent to me that the pattern in this new
efficient technique was no longer a string but a RE and since my
patterns are REs and I want them to be taken just as literal strings,
I think that this means using (?iq) RE prefix.

I hope this explains somewhat why one might want to auto generate REs
rather than handcraft them and also why you might want to semi
literally search a data structure containing REs.

I remain open to any suggestions, whether it be overall approach or
Tcl specific e.g.s neural network to check logs, how to search a Tcl
data structure that contains REs, how in Tcl to check the validity of
user supplied REs and RE auto generation ideas including efficiency
issues.

Also I don't really understand "[info exists]" suggestion in regards
to RE existence checking?! Can you please explain?! (NB my program's
RE generation phase does not necessarily involve passing them to the
Tcl interpreter so I don't see how the interpreter would know about
them until they are actually used.)
Can you also elaborate on an array based RE data structure (e.g. would
you use the RE as the key, with say a dummy value?).

To clarify the example pattern. It was simplistic, in that (?i) with
a RE, say involving ".*" is NOT important, however in a broader
context when combined with alphabetic characters it is.

Thanks for the help so far!
Regards John Taylor

Roy Terry

unread,

Apr 10, 2003, 8:47:17 AM4/10/03

to

I gather then that the log files are quite heterogeneous. It also seems
to follow that you are really detecting just "new looking" lines and can
only identify them as errors by human inspection.

>
> So my approach is to get a log (with not many errors in it) and get my
> program to automatically generate a set of regular expressions (REs)
> that match the log's lines. Then when I want to check a new log, I
> output log file lines which don't match any of the REs (i.e.
> interesting lines including errors). As part of this technique I wish
> to optionally adjust the automatically generated REs by including user
> generated REs and by excluding some etc.

You mean users will write RE's by hand to be mixed with auto-generated?
And the program will exclude some of ? (either user generated or program
generated) - it's not clear from what you write.

I therefore need to check
> that the user is not duplicating existing REs etc.

I don't buy the the "therefore" assertion, but obviously this
is important to your goals so don't bother about that. Anyhow, what
seems still curious is the notion of comparing RE's in order to
reject "duplicates". Seems to me that for some lines *many* different
looking RE's will produce hits and there is no practical way to detect
this. So all the trouble you're going to with case-less matching appears
to address only some of the duplicates. In other words, seems like
a lot of concern to build an umbrella with a large hole in it. :*)
This probably means I just still don't understand and that's OK.
Never mind what I said about [info exists], etc which doesn't apply.

Cheers and good luck.
Roy

Michael Schlenker

unread,

Apr 10, 2003, 9:24:35 AM4/10/03

to

John Taylor wrote:
>
> So my approach is to get a log (with not many errors in it) and get my
> program to automatically generate a set of regular expressions (REs)
> that match the log's lines. Then when I want to check a new log, I
> output log file lines which don't match any of the REs (i.e.
> interesting lines including errors). As part of this technique I wish
> to optionally adjust the automatically generated REs by including user
> generated REs and by excluding some etc. I therefore need to check
> that the user is not duplicating existing REs etc. So at times the
> program searches the generated REs to check consistency of the user
> requests. I hold the REs as a list which I admit may NOT be the best
> way to store them but the program evolved from test code to check if
> the technique works in the real world and so has scars and bruises of
> experimentation. Maybe I should hold them as an array (although
> searching for a RE might well still involve issues of case and the
> need to turn off the function of meta characters to achieve literal
> searches).

> However they are currently in a list and I case fold the
> non RE characters so that they are held in a canonical form to make it
> easier to check inconsistent user RE manipulations. I don't case fold
> all expressions to allow the user to have case sensitive REs for
> completeness.

I don't understand this paragraph. Do you state here you have a
_correct_ proc for case folding a regexp or do you state you just use
[string tolower].

>
> Anyway until I change things, I need a method of case insensitive
> searching of a list. I understand that my current method of using
> lsearch with string toupper works but involves "shimmering" so was
> interested when I saw the lsearch -regexp with (?i) RE prefix
> technique. However it was apparent to me that the pattern in this new
> efficient technique was no longer a string but a RE and since my
> patterns are REs and I want them to be taken just as literal strings,
> I think that this means using (?iq) RE prefix.
>
> I hope this explains somewhat why one might want to auto generate REs
> rather than handcraft them and also why you might want to semi
> literally search a data structure containing REs.

>
> I remain open to any suggestions, whether it be overall approach or
> Tcl specific e.g.s neural network to check logs, how to search a Tcl
> data structure that contains REs, how in Tcl to check the validity of
> user supplied REs and RE auto generation ideas including efficiency
> issues.

Some simple validation proc to check if it compiles ok:

proc checkregexp {exp} {
if {[catch {regexp $exp "dummy"} msg]} {
return 0
} else {
return 1
}
}

Hmm, you description sounds a bit overly complex for the task at hand,
but i may be missing something.

Some questions:
How important is maximum performance for your use case, ?

As i see it your program does log parsing, maybe run by cron, but no
user will add his private regexps at runtime of the program, so this can
be factored out easily.

I assume you have all regexp (generated and user added) in a file each.

I would do it like this:
1. Check if the auto-generated regexp file and the user file have changed.
2. If no change go to 5.
3. Validate all supplied regexps.
4. Create a cache of the valid regexps in a file.
5. Load the cache.
6. Match the log file.
7. Present output.

Step 3 which involves validating the regexps is only called if changes
happen, which probably is far less common than the unchanged case.

Validation:
You seem to have some rules when a user added regexp is valid to add
(autogenerated should always be valid otherwise the generator seems
broken to me):

(1) no exact duplicates
(2) not all duplicates after ignoring case
(3) some other requirements

(1) is trivial with either lsearch or an array and using [info exists
array($key)].

(2) Is harder because two problems appear:
a) what is a correct lowercase version of a regexp (as you cannot
use [string tolower] directly you have to write a [regexptolower] proc
that has to identify the literals in a regexp.
b) There seem to be exceptions to the rule that regexps that only
differ in the case of the literals are considered equal.

Michael

John Taylor

unread,

Apr 14, 2003, 4:12:03 AM4/14/03

to

Michael Schlenker <sch...@uni-oldenburg.de> wrote in message news:<b73r4t$ala53$1...@ID-102549.news.dfncis.de>...

> John Taylor wrote:
> > However they are currently in a list and I case fold the
> > non RE characters so that they are held in a canonical form to make it
> > easier to check inconsistent user RE manipulations. I don't case fold
> > all expressions to allow the user to have case sensitive REs for
> > completeness.
> I don't understand this paragraph. Do you state here you have a
> _correct_ proc for case folding a regexp or do you state you just use
> [string tolower].

For literal, case insensitive RE comparisons I use lsearch -exact with
[string toupper ...].
When generating REs my program knows that it can case fold literal
characters and leaves meta characters alone.
The user can indicate that some of "their" RE's should be treat with
case sensitivity.

> Some simple validation proc to check if it compiles ok:
>
> proc checkregexp {exp} {
> if {[catch {regexp $exp "dummy"} msg]} {
> return 0
> } else {
> return 1
> }
> }
>

My program has two fundamental modes, a RE generation mode and a RE
usage mode.
Currently I "catch" user generated REs at usage stage, it would be
better to catch them in the generation mode.

So a RE question: Does Tcl guarantee now and in the future that all RE
syntax faults can be caught regardless of the string, such that as in
your technique one simple, single presentation is sufficient?
I wonder whether it only checks what it needs of the RE OR
that it fully parses/compiles such that your checking technique works.
Further you give the string as "dummy" what is the most efficient
string that works, is it "" ?

> Some questions:
> How important is maximum performance for your use case, ?

Much more important than the generation phase as it is run regularly
and the RE generation phase is run very rarely.

> As i see it your program does log parsing, maybe run by cron, but no
> user will add his private regexps at runtime of the program, so this can
> be factored out easily.

In general this is TRUE BUT the program in generation phase simply
outputs an editable text file of REs, since these can/could be
modified it still does some checking when they are read back in in the
normal usage phase to protect the user and itself from human errors.
Further when generating REs I run the
generation and usage phases iteractively to allow the application of
human intelligence to RE production and therefore need to check
accidental human generated RE mistakes including direct tweaks to the
auto generated REs.

> I assume you have all regexp (generated and user added) in a file each.

What happens is a training file is input that the program generates
REs for. However optionally a user supplied file of inclusion REs and
a user supplied file of exclusion REs is permitted. Then out the end
comes a single file consisting of user supplied REs plus system
generated REs less any REs the user said they didn't want.

I am not aware that usage mode runtime RE validation done on RE
"cache" file load is at all significant re performance.
The thing I don't understand is how to semantically compare two REs
for functional equivalence. Very simple example case is how does a
program know that ^[0-9][0-9]$ is functionally equivalent to
^[0-9]{2,2}$
NB my program does NOT accidently generate semantically equivalent REs
to user supplied ones BUT is unable to determine whether two user
supplied REs are semantically equivalent, it is only able to
syntactically/literally determine equivalence of user supplied REs.
So question: How does one determine whether two REs are semantically
equivalent programatically especially in Tcl?

Finally now fully back on topic:
I do lsearch -exact [string toupper $alist] [string toupper $apattern]
in my program, I don't think the suggestion of lsearch -regexp with
(?i) or my idea of (?iq) is fully functional equivalent especially
when one wants to check for existance of REs case insensitively, so I
suppose I'll stick to the above "shimmering" one?! Interesting idea
though!

Thanks!
John Taylor

John Taylor

unread,

Apr 14, 2003, 4:36:31 AM4/14/03

to

Roy Terry <royt...@earthlink.net> wrote in message news:<3E95683E...@earthlink.net>...

Yes and as a practical matter, works well.

> > So my approach is to get a log (with not many errors in it) and get my
> > program to automatically generate a set of regular expressions (REs)
> > that match the log's lines. Then when I want to check a new log, I
> > output log file lines which don't match any of the REs (i.e.
> > interesting lines including errors). As part of this technique I wish
> > to optionally adjust the automatically generated REs by including user
> > generated REs and by excluding some etc.
> You mean users will write RE's by hand to be mixed with auto-generated?

Generally the user (basically me) will tweak auto generated ones and
supply these in the next generation run.

> And the program will exclude some of ? (either user generated or program
> generated) - it's not clear from what you write.

Excludes system generated ones. It just checks that the user doesn't
include and exclude the same regular expression, which means the user
has made a mistake.

> I therefore need to check
> > that the user is not duplicating existing REs etc.
> I don't buy the the "therefore" assertion, but obviously this
> is important to your goals so don't bother about that. Anyhow, what
> seems still curious is the notion of comparing RE's in order to
> reject "duplicates". Seems to me that for some lines *many* different
> looking RE's will produce hits and there is no practical way to detect
> this. So all the trouble you're going to with case-less matching appears
> to address only some of the duplicates. In other words, seems like
> a lot of concern to build an umbrella with a large hole in it. :*)
> This probably means I just still don't understand and that's OK.
> Never mind what I said about [info exists], etc which doesn't apply.
>
> Cheers and good luck.
> Roy

...
I think its best to catch as many user errors as is easy. I have
asked Michael Schlenker when I replied to him whether there is a
method of determining whether two regular expressions are semantically
equivalent. I gather your saying that there is not a known method?!
NB I don't accidently generate semantically equivalent REs to user
ones, I just don't know how to check if the user has supplied two or
more semantically equivalent REs (silly user).

Thanks

Michael Schlenker

unread,

Apr 14, 2003, 4:54:42 AM4/14/03

to

John Taylor wrote:
> Michael Schlenker <sch...@uni-oldenburg.de> wrote in message news:<b73r4t$ala53$1...@ID-102549.news.dfncis.de>...
>
>>John Taylor wrote:
>>
>
>>Some simple validation proc to check if it compiles ok:
>>
>>proc checkregexp {exp} {
>> if {[catch {regexp $exp "dummy"} msg]} {
>> return 0
>> } else {
>> return 1
>> }
>>}
>>
>
> My program has two fundamental modes, a RE generation mode and a RE
> usage mode.
> Currently I "catch" user generated REs at usage stage, it would be
> better to catch them in the generation mode.
>
> So a RE question: Does Tcl guarantee now and in the future that all RE
> syntax faults can be caught regardless of the string, such that as in
> your technique one simple, single presentation is sufficient?
> I wonder whether it only checks what it needs of the RE OR
> that it fully parses/compiles such that your checking technique works.
> Further you give the string as "dummy" what is the most efficient
> string that works, is it "" ?

It cannot give any guarantees, you could read the regexp part of the
core yourself to find out as much details as you want, but the regexp
part is rather complex.

The regexp is compiled and than executed, so any compile time syntax
errors should show themself with this proc.

I do not know what the most efficient dummy string is, mostly it doesn't
matter much. (if i wanted a maximum performance regexp syntax checker i
would try to extract the part from the tcl regexp code and create an
appropriate c-coded extension).

>>Some questions:
>>How important is maximum performance for your use case, ?
>
> Much more important than the generation phase as it is run regularly
> and the RE generation phase is run very rarely.

So check as much as possible in the generation phase. Don't bother with
too many checks in the run phase, if they can safely be done at
generation time.

>
>>As i see it your program does log parsing, maybe run by cron, but no
>>user will add his private regexps at runtime of the program, so this can
>>be factored out easily.
>
> In general this is TRUE BUT the program in generation phase simply
> outputs an editable text file of REs, since these can/could be
> modified it still does some checking when they are read back in in the
> normal usage phase to protect the user and itself from human errors.

Basically some MD5 or CRC32 checksum on your text file should be enough
to detect editing after generation.

>>I would do it like this:
>>1. Check if the auto-generated regexp file and the user file have changed.
>>2. If no change go to 5.
>>3. Validate all supplied regexps.
>>4. Create a cache of the valid regexps in a file.
>>5. Load the cache.
>>6. Match the log file.
>>7. Present output.

> I am not aware that usage mode runtime RE validation done on RE

> "cache" file load is at all significant re performance.

It moves the checks to generation time or to the first run of the script.

> The thing I don't understand is how to semantically compare two REs
> for functional equivalence. Very simple example case is how does a
> program know that ^[0-9][0-9]$ is functionally equivalent to
> ^[0-9]{2,2}

> NB my program does NOT accidently generate semantically equivalent REs
> to user supplied ones BUT is unable to determine whether two user
> supplied REs are semantically equivalent, it is only able to
> syntactically/literally determine equivalence of user supplied REs.
> So question: How does one determine whether two REs are semantically
> equivalent programatically especially in Tcl?

Build the state machine for the regexp and check if both are equivalent.
I'm not enough into state machine theory to tell you how hard this can
get, read about in in your favorite book on CS. Doing it in pure tcl:
Possible, but be prepared to write your own regexp parser/compiler in tcl.

> Finally now fully back on topic:
> I do lsearch -exact [string toupper $alist] [string toupper $apattern]
> in my program, I don't think the suggestion of lsearch -regexp with
> (?i) or my idea of (?iq) is fully functional equivalent especially
> when one wants to check for existance of REs case insensitively, so I
> suppose I'll stick to the above "shimmering" one?! Interesting idea
> though!

You can speed up the lsearch in two ways obviously:
1. cache the [string toupper $alist], this gets interesting if you have
many lsearchs on the same list in a loop.
2. sort the list and use lsearch -sorted

example:

set alist {some large list of regexps}
set clist {some list of regexp to check}

# setup a sorted list for lookup
set ALIST [lsort -ascii [string toupper $alist]]

# match
foreach exp $clist {
if {[lsearch -sorted -ascii -exact $ALIST $exp]!=-1} {
puts "Found \"$exp\" in alist"
} else {
puts "Did not find \"$exp\" in alist"
}
}

Probably even faster:

# setup an array for lookup
array set exps [list]
foreach exp $alist {
set exps([string toupper $item]) 1
}

# match
foreach exp $clist {
if {[info exists exps([string toupper $exp])]} {
puts "Found \"$exp\" in alist"
} else {
puts "Did not find \"$exp\" in alist"
}
}

Michael

Michael Schlenker

unread,

Apr 14, 2003, 5:48:36 AM4/14/03

to

John Taylor wrote:
> Roy Terry <royt...@earthlink.net> wrote in message news:<3E95683E...@earthlink.net>...

>
>> I therefore need to check
>>
>>>that the user is not duplicating existing REs etc.
>>
>>I don't buy the the "therefore" assertion, but obviously this
>>is important to your goals so don't bother about that. Anyhow, what
>>seems still curious is the notion of comparing RE's in order to
>>reject "duplicates". Seems to me that for some lines *many* different
>>looking RE's will produce hits and there is no practical way to detect
>>this. So all the trouble you're going to with case-less matching appears
>>to address only some of the duplicates. In other words, seems like
>>a lot of concern to build an umbrella with a large hole in it. :*)
>>This probably means I just still don't understand and that's OK.
>>Never mind what I said about [info exists], etc which doesn't apply.
>>
>>Cheers and good luck.
>>Roy
>
> ...
> I think its best to catch as many user errors as is easy. I have
> asked Michael Schlenker when I replied to him whether there is a
> method of determining whether two regular expressions are semantically
> equivalent. I gather your saying that there is not a known method?!

He only says there is no _practical_ method. But it is even worse.
Getting a semantically equivalence between two regexps is just one part,
hard but probably doable by comparing the two finite state machines
generated by the regexp compiler.

But for your usage domain you probably define duplicate in a broader sense.

Consider these two regexps:

1. {an example\s}
2. {an example\s?.*)

1. matches a subset of 2. A semantic comparision would say there
different, which they are. For your application it would be nice to
remove 1. from the list, as it only duplicates matches from 2. (matches
a true subset).

This can get arbitrarily complex, so i would not bother with it.
Performance is important, but machines get faster every month, so unless
you match really huge amounts of regexps this should not be a problem
(and if you do just make sure to order/group them intelligent to get an
early hit).

Michael Schlenker

John Taylor

unread,

Apr 15, 2003, 2:16:21 AM4/15/03

to

Michael Schlenker <sch...@uni-oldenburg.de> wrote in message news:<b7dsqo$dipl9$1...@ID-102549.news.dfncis.de>...

Michael,
Thanks for all your suggestions, especially:
1. that regular expression equivalence to an application may be more
involved than even semantic equivalence which itself is hard to do
2. that can probably catch all syntax errors in regular expressions by
presenting once as they are fully parsed/compiled regardless of
supplied string.

John Taylor