Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

String to words

33 views
Skip to first unread message

newtop...@yahoo.com

unread,
Feb 2, 2008, 3:25:11 PM2/2/08
to
Hello,

Is there a way to remove all punctuation from a given string? I would
like to remove apostrophes, commas, periods, extra spaces,
completely. Then remove any other non alpha-numeric character but
after treating it as a word boundary.


Thanks!

Russell Treleaven

unread,
Feb 2, 2008, 4:31:50 PM2/2/08
to
Does this do what you want?

set myString "this string has. some weird thing's in it"
set data [ regexp -all -inline -- {[a-zA-Z0-9]*} $myString ]
foreach token $data {
if { $token != {} } { lappend newData $token }
}
puts $newData

Alexandre Ferrieux

unread,
Feb 2, 2008, 4:36:48 PM2/2/08
to

Taking this literally implies to translate "foo,bar" to "foobar",
while preserving the boundary in "foo/bar" as "foo bar". If this is
really what you want, with two categories of metacharacters (one
entirely wiped, the other collapsed to a boundary), then

regsub -all {[',. \t\n\r]+} $s {} s
regsub -all {[^A-Za-z0-9]+} $s { } s
foreach w $s {...}

(indeed the result is guaranteed to be a proper Tcl list, and the
boundaries are embodied by spaces, hence the simple iteration on words
with [foreach])

If, however, you meant to keep all boundaries, then of course you need
only the second regsub.

-Alex

newtop...@yahoo.com

unread,
Feb 2, 2008, 8:55:25 PM2/2/08
to newtop...@yahoo.com
On Feb 2, 4:31 pm, Russell Treleaven <r...@else.net> wrote:
> Does this do what you want?
>
> set myString "this string has. some weird thing's in it"
> set data [ regexp -all -inline -- {[a-zA-Z0-9]*} $myString ]
> foreach token $data {
> if { $token != {} } { lappend newData $token }}
>
> puts $newData


Yes, it does. I added the single quote mark to the regexp pattern and
it does exactly what I want now. Thanks!


newtop...@yahoo.com

unread,
Feb 2, 2008, 9:02:07 PM2/2/08
to newtop...@yahoo.com
On Feb 2, 4:36 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

> Taking this literally implies to translate "foo,bar" to "foobar",
> while preserving the boundary in "foo/bar" as "foo bar". If this is
> really what you want, with two categories of metacharacters (one
> entirely wiped, the other collapsed to a boundary), then
>
> regsub -all {[',. \t\n\r]+} $s {} s
> regsub -all {[^A-Za-z0-9]+} $s { } s
> foreach w $s {...}
>
> (indeed the result is guaranteed to be a proper Tcl list, and the
> boundaries are embodied by spaces, hence the simple iteration on words
> with [foreach])
>
> If, however, you meant to keep all boundaries, then of course you need
> only the second regsub.


You are raising a good point about the boundaries. My input consists
of paragraphs of regular English text. It is a long string with no
line breaks, but it may contain punctuation characters. So, for me,
the second regsub is more useful as I just want the individual
words.


Thanks!

> -Alex

Fredderic

unread,
Feb 2, 2008, 11:13:55 PM2/2/08
to
On Sat, 2 Feb 2008 13:36:48 -0800 (PST),
Alexandre Ferrieux <alexandre...@gmail.com> wrote:

> On Feb 2, 9:25 pm, newtophp2...@yahoo.com wrote:
>> Is there a way to remove all punctuation from a given string?  I
>> would like to remove apostrophes, commas, periods, extra spaces,
>> completely.  Then remove any other non alpha-numeric character but
>> after treating it as a word boundary.
> Taking this literally implies to translate "foo,bar" to "foobar",
> while preserving the boundary in "foo/bar" as "foo bar". If this is
> really what you want, with two categories of metacharacters (one
> entirely wiped, the other collapsed to a boundary), then
> regsub -all {[',. \t\n\r]+} $s {} s
> regsub -all {[^A-Za-z0-9]+} $s { } s
> foreach w $s {...}
> (indeed the result is guaranteed to be a proper Tcl list, and the
> boundaries are embodied by spaces, hence the simple iteration on words
> with [foreach])

I was going to suggest enhancing the previous answer by adding that
first [regsub] you put there, as you say, to strip out totally ignored
punctuation. I think that would be the best solution.

Relying on constructing a string that looks like a list, when there's a
function already that'll generate a list directly, is IMHO bad form.

Although knowing your present interest for pushing the in-place cause,
does [regsub] actually work in-place, and does it have a benefit here?


Fredderic

Alexandre Ferrieux

unread,
Feb 3, 2008, 5:03:07 AM2/3/08
to
On Feb 3, 5:13 am, Fredderic <my-name-h...@excite.com> wrote:
>
> Relying on constructing a string that looks like a list, when there's a
> function already that'll generate a list directly, is IMHO bad form.

It may be bad style today, but it has been the only way for years, ie
before [regexp -all].
Of course there's no point in ignoring progress. But at the same time
I don't think old idioms should be
entirely forgotten.

An a posteriori extra argument is also that in this specific case
(text manipulation), staying at the string level allows to continue
with further processing like phrase matching. Hence after the
[regsub]'s guaranteeing a single space between words, you may [regsub -
all {As far as I know} $s AFAIK s] etc...

> Although knowing your present interest for pushing the in-place cause,
> does [regsub] actually work in-place, and does it have a benefit here?

Hehe, no, there was no hidden message there. My in-place quest did not
pollute that simple attempt of mine to be helpful ;-)

Now about performance, I've just done a cursory check, and not
surprisingly, the two methods perform exactly the same. Reassuringly,
if you split into [regexp processing] and [list construction/
conversion], you even find a compensation. For my specific 170k-tokens
string [lrepeat 1000 {*}[glob *]], I found:

1.8s for [regexp -all ...]
1.4s for [regsub -all ...]
1.8s for [regsub -all;llength]

I interpret this by the fact that [regexp -all] has extra work
building the list (and especially allocating individual objects for
the items), which is postponed to the list conversion inside [llength]
in the third case.

Bottom line to the OP: use Russell's idiom. It is modern. But don't be
surprised to encounter the older one. It is not invalid, nor even
ridiculously slow ;-)

-Alex

Georgios Petasis

unread,
Feb 3, 2008, 8:05:20 AM2/3/08
to newtop...@yahoo.com
O/H newtop...@yahoo.com έγραψε:

Can I add the *worst* way for doing this? :-)
Its probably the most time consuming approach, but it does not modify
the input string and you know the offsets (as if you have used -indices
with regexp...):

## Text holds the data to be tokenised...
## We have to find the first word...
set start 0; set end [tcl_endOfWord $Text $start]
if {[tcl_startOfNextWord $Text $start] < $end} {
set start [tcl_startOfNextWord $Text $start]
}

## And now iterate over all other words...
while {$start >= 0} {
set start [tcl_startOfNextWord $Text $start]
set end [tcl_endOfWord $Text $start]
if {$start != -1 && $end != -1} {
## Do whatever you want with the offsets...
}
}

George

newtop...@yahoo.com

unread,
Feb 3, 2008, 10:51:12 AM2/3/08
to newtop...@yahoo.com
On Feb 3, 8:05 am, Georgios Petasis <peta...@iit.demokritos.gr> wrote:
> Can I add the *worst* way for doing this? :-)
> Its probably the most time consuming approach, but it does not modify
> the input string and you know the offsets (as if you have used -indices
> with regexp...):


This looks similar to my original way of doing it. Except I didn't
know about tcl_startOfNextWord and tcl_endOfWord commands and I was
doing it with string indexes and comparisons.

In the end,I am going with the regexp/regsub way because my input
strings are relatively small (way smaller than Alexandre's test of
170k words) and it is OK for the string to be duplicated.


Thanks, I appreciate the input.

0 new messages