Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Why isn't "dash" a wordchar?

33 views
Skip to first unread message

Bob Binder

unread,
Mar 25, 2006, 9:41:48 AM3/25/06
to
string is wordchar "-"

returns false, but

string is wordchar "_"

and

string is punct "-"

return true.

Tcl string man says wordchar is "Any Unicode word character. That is any
alphanumeric character, and any Unicode connector punctuation characters
(e.g. underscore)."

The Unicode standard indicates all forms of the dash (aka minus, hyphen,
en-dash ...) are in the Unicode punctuation group. There is no mention of
"connectors" as a special class.

Why isn't "dash" a wordchar?
Bob


Adrian Ho

unread,
Mar 26, 2006, 5:06:26 AM3/26/06
to
On 2006-03-25, Bob Binder <nos...@domain.com> wrote:
> Tcl string man says wordchar is "Any Unicode word character. That is any
> alphanumeric character, and any Unicode connector punctuation characters
> (e.g. underscore)."
>
> The Unicode standard indicates all forms of the dash (aka minus, hyphen,
> en-dash ...) are in the Unicode punctuation group. There is no mention of
> "connectors" as a special class.

There is, for some definiton of "special class". See
<http://www.unicode.org/reports/tr18/> and search for "connector"
(there are 3 matches, all of which are significant). Also see
<http://www.fileformat.info/info/unicode/category/Pc/list.htm> for the
list of 10 characters that fall in the "Punctuation, Connector" category.

> Why isn't "dash" a wordchar?

Ummmm, because Unicode says it's not? 8-)

One explanation is given by the third match for "connector" in
<http://www.unicode.org/reports/tr18/>, ie. that underscore and related
"connector punctuation" characters are generally used as connectors for
identifiers in many (most?) programming languages. It just so happens
that Tcl is *not* one of them, but we can't all be conformists. 8-)

While we're on this subject, can the Tcl Core Unicode deity
(Jeff?) confirm whether "Unicode word character" includes all characters
under the Mark category umbrella, as listed against the "word" property
in Annex C of <http://www.unicode.org/reports/tr18/>? Not that I'm in
a position to use any of those particular characters, just curious. 8-)

- Adrian

Donald Arseneau

unread,
Mar 26, 2006, 7:24:25 AM3/26/06
to
Adrian Ho <t...@03s.net> writes:

> On 2006-03-25, Bob Binder <nos...@domain.com> wrote:
> "connector punctuation" characters are generally used as connectors for
> identifiers in many (most?) programming languages. It just so happens
> that Tcl is *not* one of them, but we can't all be conformists. 8-)

On the contrary! While any character may be used as part of a
variable or command name, different characters interact with the
syntax in different ways. The underscore is unusual in that it
behaves just like an alphanumeric.

% puts $a.foo
Yes.foo
% puts $a,foo
Yes,foo
% puts $a*foo
Yes*foo
% puts $a@foo
Yes@foo
% puts $a_foo
can't read "a_foo": no such variable

--
Donald Arseneau as...@triumf.ca

Adrian Ho

unread,
Mar 26, 2006, 10:07:51 AM3/26/06
to
On 2006-03-26, Donald Arseneau <as...@triumf.ca> wrote:
> Adrian Ho <t...@03s.net> writes:
>> On 2006-03-25, Bob Binder <nos...@domain.com> wrote:

Just so it's clear, I wrote the following nonsense, not Bob. 8-)

>> "connector punctuation" characters are generally used as connectors for
>> identifiers in many (most?) programming languages. It just so happens
>> that Tcl is *not* one of them, but we can't all be conformists. 8-)
>
> On the contrary! While any character may be used as part of a
> variable or command name, different characters interact with the
> syntax in different ways. The underscore is unusual in that it
> behaves just like an alphanumeric.

Sorry. You're right that it's completely nonsensical because I screwed
up my edit. The above originally read:

"connector punctuation" characters are generally used as connectors for
identifiers in many (most?) programming languages. It just so happens

that Tcl is *not* one of them (in that it's not so restrictive, but
allows just about any character in an identifier with careful bracing),


but we can't all be conformists. 8-)

but in a fit of rapid keyboard-pounding, I accidentally deleted the
entire parenthetical comment. Guess which paren-heavy language I've
been dabbling with lately. (It rhymes with "beam". 8-)

- Adrian

Donal K. Fellows

unread,
Mar 26, 2006, 4:13:55 PM3/26/06
to
Adrian Ho wrote:
> Guess which paren-heavy language I've
> been dabbling with lately. (It rhymes with "beam". 8-)

PL/1? ;-)

Donal.

Bruce Hartweg

unread,
Mar 27, 2006, 10:45:03 AM3/27/06
to

Now, having spent some time in the UK recently, I am
trying to remember which regional dialect would have
those rhyming. maybe welsh ;)

Bruce

Bob Binder

unread,
Mar 27, 2006, 6:12:36 PM3/27/06
to

"Donal K. Fellows" <donal.k...@man.ac.uk> wrote in message
news:1143407635.6...@i39g2000cwa.googlegroups.com...

Did you mean "PL/I" ?

Bob



Bob Binder

unread,
Mar 27, 2006, 6:17:48 PM3/27/06
to
Thanks for the clarification.

Although this behavior is apparently consistent with the cited Unicode
standard, it is inconvenient. I'd like a built-in string class that
recognizes all valid Tcl names, including option identifiers with their
leading dash, the dollar sign, etc.

Bob

"Adrian Ho" <t...@03s.net> wrote in message
news:7bnif3-...@rover.03s.net...

sleb...@gmail.com

unread,
Mar 27, 2006, 7:32:26 PM3/27/06
to
Bob Binder wrote:
> "Adrian Ho" <t...@03s.net> wrote in message
> news:7bnif3-...@rover.03s.net...
> > On 2006-03-26, Donald Arseneau <as...@triumf.ca> wrote:
> >> Adrian Ho <t...@03s.net> writes:
> >>> On 2006-03-25, Bob Binder <nos...@domain.com> wrote:
> >
> > Just so it's clear, I wrote the following nonsense, not Bob. 8-)
> >
> >>> "connector punctuation" characters are generally used as connectors for
> >>> identifiers in many (most?) programming languages. It just so happens
> >>> that Tcl is *not* one of them, but we can't all be conformists. 8-)
> >>
> >> On the contrary! While any character may be used as part of a
> >> variable or command name, different characters interact with the
> >> syntax in different ways. The underscore is unusual in that it
> >> behaves just like an alphanumeric.
> >
> > Sorry. You're right that it's completely nonsensical because I screwed
> > up my edit. The above originally read:
> >
> > "connector punctuation" characters are generally used as connectors for
> > identifiers in many (most?) programming languages. It just so happens
> > that Tcl is *not* one of them (in that it's not so restrictive, but
> > allows just about any character in an identifier with careful bracing),
> > but we can't all be conformists. 8-)
> >
> > but in a fit of rapid keyboard-pounding, I accidentally deleted the
> > entire parenthetical comment. Guess which paren-heavy language I've
> > been dabbling with lately. (It rhymes with "beam". 8-)
> >
> Thanks for the clarification.
>
> Although this behavior is apparently consistent with the cited Unicode
> standard, it is inconvenient. I'd like a built-in string class that
> recognizes all valid Tcl names, including option identifiers with their
> leading dash, the dollar sign, etc.
>

If that is what you want then it IS convenient as the dash is not valid
for $ substitution. Hence for your usage the dash is not a valid tcl
variable name*.

% set foo-bar "test"
% puts $foo-bar
can't read "foo": no such variable
% set -bat "test2"
%puts $-bat
$-bat

* Note: If you consider the dash as valid variable name in Tcl (which
technically it is) then the function to check for valid Tcl variable
names is:

proc validVarname {name} {return 1}

since the [set] command does not restrict any symbols which can be used
as variable name. You can even use the null character \0 or the symbol
255(0xff) as variable names:

% set \000\017 "test3"
% set \000\017
test3

My favourite is of course the empty variable name which is exactly zero
bytes long:

% set {} "test4"
% set {}
test4

Similarly, there is actually no restriction of what symbols can be used
as characters in proc names.

Adrian Ho

unread,
Mar 27, 2006, 7:38:38 PM3/27/06
to
On 2006-03-27, Bob Binder <nos...@domain.com> wrote:
> Although this behavior is apparently consistent with the cited Unicode
> standard, it is inconvenient. I'd like a built-in string class that
> recognizes all valid Tcl names, including option identifiers with their
> leading dash, the dollar sign, etc.

If by "valid Tcl names" you mean "valid Tcl identifiers", then pretty
much *all* characters (including whitespace) are legal in one (some form
of escaping may of course be necessary in certain cases).

Perhaps if you could describe the problem you're actually trying to
solve, we may be able to suggest a different approach.

- Adrian

Adrian Ho

unread,
Mar 27, 2006, 7:41:59 PM3/27/06
to
On 2006-03-27, Bob Binder <nos...@domain.com> wrote:

Them pesky Romans!

"IV, stat!"
"Four what?"

- Adrian

Andreas Leitgeb

unread,
Mar 29, 2006, 12:41:12 PM3/29/06
to
Adrian Ho <t...@03s.net> wrote:
> On 2006-03-27, Bob Binder <nos...@domain.com> wrote:
>> Although this behavior is apparently consistent with the cited Unicode
>> standard, it is inconvenient. I'd like a built-in string class that
>> recognizes all valid Tcl names, including option identifiers with their
>> leading dash, the dollar sign, etc.

> If by "valid Tcl names" you mean "valid Tcl identifiers", then pretty
> much *all* characters (including whitespace) are legal in one (some form
> of escaping may of course be necessary in certain cases).

Exceptions: a pair of parenthesis "(...)", whose closing
one is right at the end of the "varname"
and two subsequent colons "::"
may induce some meta-meaning.
These *can* be contained in valid varnames, but not arbitrarily.

e.g.
"abc)))(((())))(((" is a valid (non-array) varname
"abc)))(((())))((()" denotes an arrayelement "((())))(((" of
an array named "abc)))"

0 new messages