Is there a way to remove all punctuation from a given string? I would
like to remove apostrophes, commas, periods, extra spaces,
completely. Then remove any other non alpha-numeric character but
after treating it as a word boundary.
Thanks!
set myString "this string has. some weird thing's in it"
set data [ regexp -all -inline -- {[a-zA-Z0-9]*} $myString ]
foreach token $data {
if { $token != {} } { lappend newData $token }
}
puts $newData
Taking this literally implies to translate "foo,bar" to "foobar",
while preserving the boundary in "foo/bar" as "foo bar". If this is
really what you want, with two categories of metacharacters (one
entirely wiped, the other collapsed to a boundary), then
regsub -all {[',. \t\n\r]+} $s {} s
regsub -all {[^A-Za-z0-9]+} $s { } s
foreach w $s {...}
(indeed the result is guaranteed to be a proper Tcl list, and the
boundaries are embodied by spaces, hence the simple iteration on words
with [foreach])
If, however, you meant to keep all boundaries, then of course you need
only the second regsub.
-Alex
Yes, it does. I added the single quote mark to the regexp pattern and
it does exactly what I want now. Thanks!
> Taking this literally implies to translate "foo,bar" to "foobar",
> while preserving the boundary in "foo/bar" as "foo bar". If this is
> really what you want, with two categories of metacharacters (one
> entirely wiped, the other collapsed to a boundary), then
>
> regsub -all {[',. \t\n\r]+} $s {} s
> regsub -all {[^A-Za-z0-9]+} $s { } s
> foreach w $s {...}
>
> (indeed the result is guaranteed to be a proper Tcl list, and the
> boundaries are embodied by spaces, hence the simple iteration on words
> with [foreach])
>
> If, however, you meant to keep all boundaries, then of course you need
> only the second regsub.
You are raising a good point about the boundaries. My input consists
of paragraphs of regular English text. It is a long string with no
line breaks, but it may contain punctuation characters. So, for me,
the second regsub is more useful as I just want the individual
words.
Thanks!
> -Alex
> On Feb 2, 9:25 pm, newtophp2...@yahoo.com wrote:
>> Is there a way to remove all punctuation from a given string? I
>> would like to remove apostrophes, commas, periods, extra spaces,
>> completely. Then remove any other non alpha-numeric character but
>> after treating it as a word boundary.
> Taking this literally implies to translate "foo,bar" to "foobar",
> while preserving the boundary in "foo/bar" as "foo bar". If this is
> really what you want, with two categories of metacharacters (one
> entirely wiped, the other collapsed to a boundary), then
> regsub -all {[',. \t\n\r]+} $s {} s
> regsub -all {[^A-Za-z0-9]+} $s { } s
> foreach w $s {...}
> (indeed the result is guaranteed to be a proper Tcl list, and the
> boundaries are embodied by spaces, hence the simple iteration on words
> with [foreach])
I was going to suggest enhancing the previous answer by adding that
first [regsub] you put there, as you say, to strip out totally ignored
punctuation. I think that would be the best solution.
Relying on constructing a string that looks like a list, when there's a
function already that'll generate a list directly, is IMHO bad form.
Although knowing your present interest for pushing the in-place cause,
does [regsub] actually work in-place, and does it have a benefit here?
Fredderic
It may be bad style today, but it has been the only way for years, ie
before [regexp -all].
Of course there's no point in ignoring progress. But at the same time
I don't think old idioms should be
entirely forgotten.
An a posteriori extra argument is also that in this specific case
(text manipulation), staying at the string level allows to continue
with further processing like phrase matching. Hence after the
[regsub]'s guaranteeing a single space between words, you may [regsub -
all {As far as I know} $s AFAIK s] etc...
> Although knowing your present interest for pushing the in-place cause,
> does [regsub] actually work in-place, and does it have a benefit here?
Hehe, no, there was no hidden message there. My in-place quest did not
pollute that simple attempt of mine to be helpful ;-)
Now about performance, I've just done a cursory check, and not
surprisingly, the two methods perform exactly the same. Reassuringly,
if you split into [regexp processing] and [list construction/
conversion], you even find a compensation. For my specific 170k-tokens
string [lrepeat 1000 {*}[glob *]], I found:
1.8s for [regexp -all ...]
1.4s for [regsub -all ...]
1.8s for [regsub -all;llength]
I interpret this by the fact that [regexp -all] has extra work
building the list (and especially allocating individual objects for
the items), which is postponed to the list conversion inside [llength]
in the third case.
Bottom line to the OP: use Russell's idiom. It is modern. But don't be
surprised to encounter the older one. It is not invalid, nor even
ridiculously slow ;-)
-Alex
Can I add the *worst* way for doing this? :-)
Its probably the most time consuming approach, but it does not modify
the input string and you know the offsets (as if you have used -indices
with regexp...):
## Text holds the data to be tokenised...
## We have to find the first word...
set start 0; set end [tcl_endOfWord $Text $start]
if {[tcl_startOfNextWord $Text $start] < $end} {
set start [tcl_startOfNextWord $Text $start]
}
## And now iterate over all other words...
while {$start >= 0} {
set start [tcl_startOfNextWord $Text $start]
set end [tcl_endOfWord $Text $start]
if {$start != -1 && $end != -1} {
## Do whatever you want with the offsets...
}
}
George
This looks similar to my original way of doing it. Except I didn't
know about tcl_startOfNextWord and tcl_endOfWord commands and I was
doing it with string indexes and comparisons.
In the end,I am going with the regexp/regsub way because my input
strings are relatively small (way smaller than Alexandre's test of
170k words) and it is OK for the string to be duplicated.
Thanks, I appreciate the input.