Currently I just walk thru the list comparing 'by hand' - just
wondered if there was a better way.
Best regards
Helmut Giese
Helmut Giese wrote:
>
> Hello out there,
> is it somehow possible to do [lsearch] ignoring case?
You can use the the -regexp option and
preceed the pattern with (?i)
lsearch [string tolower $a] [string tolower $b]
maybe?
Great! Works like a charm. Thanks
Helmut Giese
lsearch -option [string tolower $list] [string tolower $seeker]
It may be a little slower than a -yescase search because of the list representation
shimmering, but it should give the same result.
--
Derk Gwen http://derkgwen.250free.com/html/index.html
Where do you get those wonderful toys?
Yeah, why didn't I think of this old trick?
Mumble, mumble ... lists are large - would make a 'lower case copy' of
the original for [lsearch] ing und use the original for generating
messages to user. Done this stuff lots of times.
Thanks for reminding me :)
Helmut Giese
PS: Roy's suggestion is, however, more pleasing on the intellectual
level (look Ma, used an RE with lsearch) - will have to check
performance on *really* large lists.
>hgi...@ratiosoft.com (Helmut Giese) wrote:
># Hello out there,
># is it somehow possible to do [lsearch] ignoring case?
>
>lsearch -option [string tolower $list] [string tolower $seeker]
>
>It may be a little slower than a -yescase search because of the list representation
>shimmering, but it should give the same result.
Thanks.
Would you care to elaborate on 'shimmering'?
Best regards
Helmut Giese
Of course, that means that the strings have to be handled at least 3
times. Could not lsearch be written in such a way that they are handled
only once?
--
Join us at the Tenth Annual Tcl/Tk Conference <URL: http://mini.net/tcl/6274 >
Even if explicitly stated to the contrary, nothing in this posting
should be construed as representing my employer's opinions.
<URL: mailto:lvi...@yahoo.com > <URL: http://www.purl.org/NET/lvirden/ >
It sure is possible, not extremly hard. Just about 20
strcmp/memcmp/DictionaryCompare calls in Tcl_LsearchObjCmd() that would
have to be changed+the extra option added.
Michael Schlenker
With small lists (one character items, 1.000.000 items in the list) the
lsearch -regexp approach is much faster than the [string tolower]
approach. (about a factor of 40000 faster with regexp). This may be
different with larger items.
Michael Schlenker
Hello,
this means ...
1. the string representation of the list will be used by the "string
tolower" command.
2. the "string tolower" command returns a string object (a "lowered"
copy of the list, without the underlying tcl list data structure)
3. lsearch would try to convert this string object back into a list
object (with all the underlying tcl list data structure)
4. after the successfull conversion into a list object the string
object don't exists anymore
5. at the end of lsearch this "temporary" list object will be deleted
If the list is very huge (in the memory), than it could be better to
use -regexp, like in the first reply mentioned!
Greetings,
Martin Lemburg
>> Would you care to elaborate on 'shimmering'?
>> Best regards
>> Helmut Giese
>
>Hello,
>
>this means ...
>
>1. the string representation of the list will be used by the "string
>tolower" command.
>
>2. the "string tolower" command returns a string object (a "lowered"
>copy of the list, without the underlying tcl list data structure)
>
>3. lsearch would try to convert this string object back into a list
>object (with all the underlying tcl list data structure)
>
>4. after the successfull conversion into a list object the string
>object don't exists anymore
>
>5. at the end of lsearch this "temporary" list object will be deleted
>
>If the list is very huge (in the memory), than it could be better to
>use -regexp, like in the first reply mentioned!
Helo Martin,
thanks for the explaination. Now - as often in the past - I wonder
about the term 'shimmering'. Is it just a synonym for 'conversion' or
does it carry more meaning?
Helmut Giese
Shimmer: a wavering sometimes distorted visual image usually resulting
from heat-induced changes in atmospheric refraction.
So when people say the type is shimmering, it means it's switching back
and forth between types several times, which is inefficient.
Ok, I got it now. Thanks
Helmut Giese
Background:
I have a program that generates a list of regular expressions to match
text files and as part of that process it checks to see if it already
has generated a particular regular expression...
Question:
I presume that the -regexp with (?i) technique is meant for usage with
normal characters NOT when you want to search for the existance of a
regular expression itself? In this case I would use (?iq) I presume?
Example:
set alist "george fred"
set apattern ".*"
# lsearch returns -1 as no match
lsearch [string toupper $alist] [string toupper $apattern]
set ipattern "(?i)$apattern"
# lsearch returns 0 as a rexexp match NOT a string match
lsearch -regexp $alist $ipattern
set iqpattern "(?iq)$apattern"
# lsearch now returns -1 as correctly does literal match
lsearch -regexp $alist $iqpattern
In general it simply is _wrong_ to assume that:
regexp [string to lower $exp] [string to lower $string]
<=>
regexp $exp $string
So be prepared for trouble if you go this route.
Michael Schlenker
What is it about searching for the existence of a regular
expression? This is not a concept that is present in Tcl
code. I suspect there is a simplification waiting to
be brought to your code in which you find out if an RE
has been built by doing [info exists] or some sort of
array lookup. Using an RE to test what your program did
or did not already do sounds a bit off.
Care to clarify?
Cheers,
Roy
Michael Schlenker wrote:
>
> John Taylor wrote:
> > Roy Terry <royt...@earthlink.net> wrote in message news:<3E8B04BF...@earthlink.net>...
> >
> >>Helmut Giese wrote:
> >>
> >>>Hello out there,
> >>>is it somehow possible to do [lsearch] ignoring case?
> >>
> >>You can use the the -regexp option and
> >>preceed the pattern with (?i)
> >
> >
> > Background:
> > I have a program that generates a list of regular expressions to match
> > text files and as part of that process it checks to see if it already
> > has generated a particular regular expression...
> >
> > Question:
> > I presume that the -regexp with (?i) technique is meant for usage with
> > normal characters NOT when you want to search for the existance of a
> > regular expression itself? In this case I would use (?iq) I presume?
> Yes. If you use the regexp trick you have to use the ?iq switch, but
> this may bite you quite badly if you use any of the shorthand escapes in
> regular expressions that have quite a different meaning in lowercase
> (think about \s+ and \S+ or \m and \M).
Are you talking about analyzing an re by running another re over it?
>
> In general it simply is _wrong_ to assume that:
>
> regexp [string to lower $exp] [string to lower $string]
> <=>
> regexp $exp $string
Do you have evidence that this is so? I would consider
it a bug so please post it.
>
> Michael Schlenker
Its not a bug:
% set exp {\Sa\S+sentence\S+with\S+whitespace}
\Sa\S+sentence\S+with\S+whitespace
% set string { a sentence with whitespace}
a sentence with whitespace
% regexp $exp $string
0
% regexp [string tolower $exp] [string tolower $string]
1
Michael Schlenker
Michael
Roy,
I'll try to clarify what I'm doing and apologise in advance that its a
bit long. I am a Database Administrator and therefore need to check
database and other "log" files. I don't want to do this manually. I
observe that I could "grep" for known errors etc but I would need a
very good list of error message types and I observe that when things
go wrong, non-documented forms of errors are generated.
So my approach is to get a log (with not many errors in it) and get my
program to automatically generate a set of regular expressions (REs)
that match the log's lines. Then when I want to check a new log, I
output log file lines which don't match any of the REs (i.e.
interesting lines including errors). As part of this technique I wish
to optionally adjust the automatically generated REs by including user
generated REs and by excluding some etc. I therefore need to check
that the user is not duplicating existing REs etc. So at times the
program searches the generated REs to check consistency of the user
requests. I hold the REs as a list which I admit may NOT be the best
way to store them but the program evolved from test code to check if
the technique works in the real world and so has scars and bruises of
experimentation. Maybe I should hold them as an array (although
searching for a RE might well still involve issues of case and the
need to turn off the function of meta characters to achieve literal
searches). However they are currently in a list and I case fold the
non RE characters so that they are held in a canonical form to make it
easier to check inconsistent user RE manipulations. I don't case fold
all expressions to allow the user to have case sensitive REs for
completeness.
Please understand that the basis of the program was in the Tcl 7.5 era
and the REs had less power and I'm changing it to Tcl 8.4.1.0.+. As
an aside therefore I think I will think more about the (?i) prefix in
the REs rather than using regexp -nocase.
Anyway until I change things, I need a method of case insensitive
searching of a list. I understand that my current method of using
lsearch with string toupper works but involves "shimmering" so was
interested when I saw the lsearch -regexp with (?i) RE prefix
technique. However it was apparent to me that the pattern in this new
efficient technique was no longer a string but a RE and since my
patterns are REs and I want them to be taken just as literal strings,
I think that this means using (?iq) RE prefix.
I hope this explains somewhat why one might want to auto generate REs
rather than handcraft them and also why you might want to semi
literally search a data structure containing REs.
I remain open to any suggestions, whether it be overall approach or
Tcl specific e.g.s neural network to check logs, how to search a Tcl
data structure that contains REs, how in Tcl to check the validity of
user supplied REs and RE auto generation ideas including efficiency
issues.
Also I don't really understand "[info exists]" suggestion in regards
to RE existence checking?! Can you please explain?! (NB my program's
RE generation phase does not necessarily involve passing them to the
Tcl interpreter so I don't see how the interpreter would know about
them until they are actually used.)
Can you also elaborate on an array based RE data structure (e.g. would
you use the RE as the key, with say a dummy value?).
To clarify the example pattern. It was simplistic, in that (?i) with
a RE, say involving ".*" is NOT important, however in a broader
context when combined with alphabetic characters it is.
Thanks for the help so far!
Regards John Taylor
I gather then that the log files are quite heterogeneous. It also seems
to follow that you are really detecting just "new looking" lines and can
only identify them as errors by human inspection.
>
> So my approach is to get a log (with not many errors in it) and get my
> program to automatically generate a set of regular expressions (REs)
> that match the log's lines. Then when I want to check a new log, I
> output log file lines which don't match any of the REs (i.e.
> interesting lines including errors). As part of this technique I wish
> to optionally adjust the automatically generated REs by including user
> generated REs and by excluding some etc.
You mean users will write RE's by hand to be mixed with auto-generated?
And the program will exclude some of ? (either user generated or program
generated) - it's not clear from what you write.
I therefore need to check
> that the user is not duplicating existing REs etc.
I don't buy the the "therefore" assertion, but obviously this
is important to your goals so don't bother about that. Anyhow, what
seems still curious is the notion of comparing RE's in order to
reject "duplicates". Seems to me that for some lines *many* different
looking RE's will produce hits and there is no practical way to detect
this. So all the trouble you're going to with case-less matching appears
to address only some of the duplicates. In other words, seems like
a lot of concern to build an umbrella with a large hole in it. :*)
This probably means I just still don't understand and that's OK.
Never mind what I said about [info exists], etc which doesn't apply.
Cheers and good luck.
Roy
> However they are currently in a list and I case fold the
> non RE characters so that they are held in a canonical form to make it
> easier to check inconsistent user RE manipulations. I don't case fold
> all expressions to allow the user to have case sensitive REs for
> completeness.
I don't understand this paragraph. Do you state here you have a
_correct_ proc for case folding a regexp or do you state you just use
[string tolower].
>
> Anyway until I change things, I need a method of case insensitive
> searching of a list. I understand that my current method of using
> lsearch with string toupper works but involves "shimmering" so was
> interested when I saw the lsearch -regexp with (?i) RE prefix
> technique. However it was apparent to me that the pattern in this new
> efficient technique was no longer a string but a RE and since my
> patterns are REs and I want them to be taken just as literal strings,
> I think that this means using (?iq) RE prefix.
>
> I hope this explains somewhat why one might want to auto generate REs
> rather than handcraft them and also why you might want to semi
> literally search a data structure containing REs.
>
> I remain open to any suggestions, whether it be overall approach or
> Tcl specific e.g.s neural network to check logs, how to search a Tcl
> data structure that contains REs, how in Tcl to check the validity of
> user supplied REs and RE auto generation ideas including efficiency
> issues.
Some simple validation proc to check if it compiles ok:
proc checkregexp {exp} {
if {[catch {regexp $exp "dummy"} msg]} {
return 0
} else {
return 1
}
}
Hmm, you description sounds a bit overly complex for the task at hand,
but i may be missing something.
Some questions:
How important is maximum performance for your use case, ?
As i see it your program does log parsing, maybe run by cron, but no
user will add his private regexps at runtime of the program, so this can
be factored out easily.
I assume you have all regexp (generated and user added) in a file each.
I would do it like this:
1. Check if the auto-generated regexp file and the user file have changed.
2. If no change go to 5.
3. Validate all supplied regexps.
4. Create a cache of the valid regexps in a file.
5. Load the cache.
6. Match the log file.
7. Present output.
Step 3 which involves validating the regexps is only called if changes
happen, which probably is far less common than the unchanged case.
Validation:
You seem to have some rules when a user added regexp is valid to add
(autogenerated should always be valid otherwise the generator seems
broken to me):
(1) no exact duplicates
(2) not all duplicates after ignoring case
(3) some other requirements
(1) is trivial with either lsearch or an array and using [info exists
array($key)].
(2) Is harder because two problems appear:
a) what is a correct lowercase version of a regexp (as you cannot
use [string tolower] directly you have to write a [regexptolower] proc
that has to identify the literals in a regexp.
b) There seem to be exceptions to the rule that regexps that only
differ in the case of the literals are considered equal.
Michael
> Some simple validation proc to check if it compiles ok:
>
> proc checkregexp {exp} {
> if {[catch {regexp $exp "dummy"} msg]} {
> return 0
> } else {
> return 1
> }
> }
>
My program has two fundamental modes, a RE generation mode and a RE
usage mode.
Currently I "catch" user generated REs at usage stage, it would be
better to catch them in the generation mode.
So a RE question: Does Tcl guarantee now and in the future that all RE
syntax faults can be caught regardless of the string, such that as in
your technique one simple, single presentation is sufficient?
I wonder whether it only checks what it needs of the RE OR
that it fully parses/compiles such that your checking technique works.
Further you give the string as "dummy" what is the most efficient
string that works, is it "" ?
> Some questions:
> How important is maximum performance for your use case, ?
Much more important than the generation phase as it is run regularly
and the RE generation phase is run very rarely.
> As i see it your program does log parsing, maybe run by cron, but no
> user will add his private regexps at runtime of the program, so this can
> be factored out easily.
In general this is TRUE BUT the program in generation phase simply
outputs an editable text file of REs, since these can/could be
modified it still does some checking when they are read back in in the
normal usage phase to protect the user and itself from human errors.
Further when generating REs I run the
generation and usage phases iteractively to allow the application of
human intelligence to RE production and therefore need to check
accidental human generated RE mistakes including direct tweaks to the
auto generated REs.
> I assume you have all regexp (generated and user added) in a file each.
What happens is a training file is input that the program generates
REs for. However optionally a user supplied file of inclusion REs and
a user supplied file of exclusion REs is permitted. Then out the end
comes a single file consisting of user supplied REs plus system
generated REs less any REs the user said they didn't want.
I am not aware that usage mode runtime RE validation done on RE
"cache" file load is at all significant re performance.
The thing I don't understand is how to semantically compare two REs
for functional equivalence. Very simple example case is how does a
program know that ^[0-9][0-9]$ is functionally equivalent to
^[0-9]{2,2}$
NB my program does NOT accidently generate semantically equivalent REs
to user supplied ones BUT is unable to determine whether two user
supplied REs are semantically equivalent, it is only able to
syntactically/literally determine equivalence of user supplied REs.
So question: How does one determine whether two REs are semantically
equivalent programatically especially in Tcl?
Finally now fully back on topic:
I do lsearch -exact [string toupper $alist] [string toupper $apattern]
in my program, I don't think the suggestion of lsearch -regexp with
(?i) or my idea of (?iq) is fully functional equivalent especially
when one wants to check for existance of REs case insensitively, so I
suppose I'll stick to the above "shimmering" one?! Interesting idea
though!
Thanks!
John Taylor
> I therefore need to check
> > that the user is not duplicating existing REs etc.
> I don't buy the the "therefore" assertion, but obviously this
> is important to your goals so don't bother about that. Anyhow, what
> seems still curious is the notion of comparing RE's in order to
> reject "duplicates". Seems to me that for some lines *many* different
> looking RE's will produce hits and there is no practical way to detect
> this. So all the trouble you're going to with case-less matching appears
> to address only some of the duplicates. In other words, seems like
> a lot of concern to build an umbrella with a large hole in it. :*)
> This probably means I just still don't understand and that's OK.
> Never mind what I said about [info exists], etc which doesn't apply.
>
> Cheers and good luck.
> Roy
...
I think its best to catch as many user errors as is easy. I have
asked Michael Schlenker when I replied to him whether there is a
method of determining whether two regular expressions are semantically
equivalent. I gather your saying that there is not a known method?!
NB I don't accidently generate semantically equivalent REs to user
ones, I just don't know how to check if the user has supplied two or
more semantically equivalent REs (silly user).
Thanks
It cannot give any guarantees, you could read the regexp part of the
core yourself to find out as much details as you want, but the regexp
part is rather complex.
The regexp is compiled and than executed, so any compile time syntax
errors should show themself with this proc.
I do not know what the most efficient dummy string is, mostly it doesn't
matter much. (if i wanted a maximum performance regexp syntax checker i
would try to extract the part from the tcl regexp code and create an
appropriate c-coded extension).
>>Some questions:
>>How important is maximum performance for your use case, ?
>
> Much more important than the generation phase as it is run regularly
> and the RE generation phase is run very rarely.
So check as much as possible in the generation phase. Don't bother with
too many checks in the run phase, if they can safely be done at
generation time.
>
>>As i see it your program does log parsing, maybe run by cron, but no
>>user will add his private regexps at runtime of the program, so this can
>>be factored out easily.
>
> In general this is TRUE BUT the program in generation phase simply
> outputs an editable text file of REs, since these can/could be
> modified it still does some checking when they are read back in in the
> normal usage phase to protect the user and itself from human errors.
Basically some MD5 or CRC32 checksum on your text file should be enough
to detect editing after generation.
>>I would do it like this:
>>1. Check if the auto-generated regexp file and the user file have changed.
>>2. If no change go to 5.
>>3. Validate all supplied regexps.
>>4. Create a cache of the valid regexps in a file.
>>5. Load the cache.
>>6. Match the log file.
>>7. Present output.
> I am not aware that usage mode runtime RE validation done on RE
> "cache" file load is at all significant re performance.
It moves the checks to generation time or to the first run of the script.
> The thing I don't understand is how to semantically compare two REs
> for functional equivalence. Very simple example case is how does a
> program know that ^[0-9][0-9]$ is functionally equivalent to
> ^[0-9]{2,2}
> NB my program does NOT accidently generate semantically equivalent REs
> to user supplied ones BUT is unable to determine whether two user
> supplied REs are semantically equivalent, it is only able to
> syntactically/literally determine equivalence of user supplied REs.
> So question: How does one determine whether two REs are semantically
> equivalent programatically especially in Tcl?
Build the state machine for the regexp and check if both are equivalent.
I'm not enough into state machine theory to tell you how hard this can
get, read about in in your favorite book on CS. Doing it in pure tcl:
Possible, but be prepared to write your own regexp parser/compiler in tcl.
> Finally now fully back on topic:
> I do lsearch -exact [string toupper $alist] [string toupper $apattern]
> in my program, I don't think the suggestion of lsearch -regexp with
> (?i) or my idea of (?iq) is fully functional equivalent especially
> when one wants to check for existance of REs case insensitively, so I
> suppose I'll stick to the above "shimmering" one?! Interesting idea
> though!
You can speed up the lsearch in two ways obviously:
1. cache the [string toupper $alist], this gets interesting if you have
many lsearchs on the same list in a loop.
2. sort the list and use lsearch -sorted
example:
set alist {some large list of regexps}
set clist {some list of regexp to check}
# setup a sorted list for lookup
set ALIST [lsort -ascii [string toupper $alist]]
# match
foreach exp $clist {
if {[lsearch -sorted -ascii -exact $ALIST $exp]!=-1} {
puts "Found \"$exp\" in alist"
} else {
puts "Did not find \"$exp\" in alist"
}
}
Probably even faster:
# setup an array for lookup
array set exps [list]
foreach exp $alist {
set exps([string toupper $item]) 1
}
# match
foreach exp $clist {
if {[info exists exps([string toupper $exp])]} {
puts "Found \"$exp\" in alist"
} else {
puts "Did not find \"$exp\" in alist"
}
}
Michael
But for your usage domain you probably define duplicate in a broader sense.
Consider these two regexps:
1. {an example\s}
2. {an example\s?.*)
1. matches a subset of 2. A semantic comparision would say there
different, which they are. For your application it would be nice to
remove 1. from the list, as it only duplicates matches from 2. (matches
a true subset).
This can get arbitrarily complex, so i would not bother with it.
Performance is important, but machines get faster every month, so unless
you match really huge amounts of regexps this should not be a problem
(and if you do just make sure to order/group them intelligent to get an
early hit).
Michael Schlenker
Michael,
Thanks for all your suggestions, especially:
1. that regular expression equivalence to an application may be more
involved than even semantic equivalence which itself is hard to do
2. that can probably catch all syntax errors in regular expressions by
presenting once as they are fully parsed/compiled regardless of
supplied string.
John Taylor