8.2 Truncating strings

0 views
Skip to first unread message

Robin Becker

unread,
Sep 14, 1999, 3:00:00 AM9/14/99
to
I'm having problems truncating strings in the DLL caller. Some APIs
require a buffer of known length to be passed in so in addition we have
a length.

Under 8.0.x I used

Tcl_SetObjLength(sObj, strlen(Tcl_GetStringFromObj(sObj,0)));

but this fails under 8.1/8.2 so how do I truncate strings?

I might add that I allocated the buffers with

Tcl_SetByteArrayLength( objPtr, length )

it seems to me that I might need to distinguish two different string
types in the dll caller.
--
Robin Becker

Paul Duffin

unread,
Sep 14, 1999, 3:00:00 AM9/14/99
to
Robin Becker wrote:
>
> I'm having problems truncating strings in the DLL caller. Some APIs
> require a buffer of known length to be passed in so in addition we have
> a length.
>
> Under 8.0.x I used
>
> Tcl_SetObjLength(sObj, strlen(Tcl_GetStringFromObj(sObj,0)));
>

That would truncate the string at the first NUL byte but not a specific
length.

> but this fails under 8.1/8.2 so how do I truncate strings?
>

Of course that will fail because strings in 8.1/2 are UTF-8 and as such
do not contain any embedded NULs, hence no truncation can occur.

I am confused as to exactly how you want the truncation to occur, could
you give some more information like a C API and some Tcl code to call
it.

> I might add that I allocated the buffers with
>
> Tcl_SetByteArrayLength( objPtr, length )
>
> it seems to me that I might need to distinguish two different string
> types in the dll caller.

Almost certainly, in fact probably at least three 'character' types.

A buffer of bytes.
A string of UTF-8.
A string of Unicode.

--
Paul Duffin
DT/6000 Development Email: pdu...@hursley.ibm.com
IBM UK Laboratories Ltd., Hursley Park nr. Winchester
Internal: 7-246880 International: +44 1962-816880

Robin Becker

unread,
Sep 14, 1999, 3:00:00 AM9/14/99
to
In article <37DE2F...@mailserver.hursley.ibm.com>, Paul Duffin
<pdu...@mailserver.hursley.ibm.com> writes

>Robin Becker wrote:
>>
>> I'm having problems truncating strings in the DLL caller. Some APIs
>> require a buffer of known length to be passed in so in addition we have
>> a length.
>>
>> Under 8.0.x I used
>>
>> Tcl_SetObjLength(sObj, strlen(Tcl_GetStringFromObj(sObj,0)));
>>
>
>That would truncate the string at the first NUL byte but not a specific
>length.
>
>> but this fails under 8.1/8.2 so how do I truncate strings?
>>
>
>Of course that will fail because strings in 8.1/2 are UTF-8 and as such
>do not contain any embedded NULs, hence no truncation can occur.
>
>I am confused as to exactly how you want the truncation to occur, could
>you give some more information like a C API and some Tcl code to call
>it.
>
>> I might add that I allocated the buffers with
>>
>> Tcl_SetByteArrayLength( objPtr, length )
>>
>> it seems to me that I might need to distinguish two different string
>> types in the dll caller.
>
>Almost certainly, in fact probably at least three 'character' types.
>
> A buffer of bytes.
> A string of UTF-8.
> A string of Unicode.
>
Old style C was all that 8.0.x supported. There is no problem with the
bytes style as we don't have to truncate it. Now the user can specify

*256 for an arg this means we pass the name of a tcl var. This will be
forced to be of length 256 bytes.

Under 8.0.x you could say
*256s meaning this was supposed to be a string variable. As a
convenience only I truncated the buffer to the first null byte. This was
easy under the old tcl api. Now that we have the possibility we can have
both wide chars and simple chars I need to distinguish what kinds of
length apply. The simple c string version of my original truncation will
be
Allocation
Tcl_SetByteArrayLength( sObj, 256+1);

and the truncation will be
Tcl_SetByteArrayLength(sObj, strlen(Tcl_GetByteArrayFromObj(sObj,0)));

so the wide character version ought to be

Allocation
Tcl_SetByteArrayLength( sObj, 2*256+1);

and the truncation will be
Tcl_SetObjLength(sObj, wcslen(Tcl_GetStringFromObj(sObj,0)));

but I'm not really certain about these last.
--
Robin Becker

Paul Duffin

unread,
Sep 14, 1999, 3:00:00 AM9/14/99
to

You need to use the new Unicode functions to manage the Unicode form
of the string.

> but I'm not really certain about these last.

Windows APIs can take any of the following different 'string' types.
Some of the lengths may be fixed by the API and not require a distinct
length argument.

Buffer of bytes with length (input/output)
A NUL terminated byte string (input)
A NUL terminated UTF-8 string (input) (whatever UTF-8 NUL is)
A UTF-8 string with length (input/output)
A Unicode string with length (input/output)

Your DLL caller needs to be able to differentiate between these types
of string so that it can format the input data correctly, set the
length correctly and format the output data correctly.

Robin Becker

unread,
Sep 14, 1999, 3:00:00 AM9/14/99
to
In article <37DE4E...@mailserver.hursley.ibm.com>, Paul Duffin

<pdu...@mailserver.hursley.ibm.com> writes
>Robin Becker wrote:
...
what are the new unicode functions? I had expected to see functions of
the form Tcl_GetUTF8FromObj etc, but it seems that a default of UTF8 is
what is expected so which are the 'String' functions and which aren't. I
had assumed that someone reading this could help.

>
>> but I'm not really certain about these last.
>
>Windows APIs can take any of the following different 'string' types.
>Some of the lengths may be fixed by the API and not require a distinct
>length argument.
>
> Buffer of bytes with length (input/output)
> A NUL terminated byte string (input)
> A NUL terminated UTF-8 string (input) (whatever UTF-8 NUL is)
> A UTF-8 string with length (input/output)
> A Unicode string with length (input/output)
looking in VC++ help I see for

strlen, wcslen, _mbslen, _mbstrlen
Get the length of a string.

size_t strlen( const char *string );

size_t wcslen( const wchar_t *string );

size_t _mbslen( const unsigned char *string );

size_t _mbstrlen( const char *string );

....
Parameter

string

Null-terminated string


ie I guess that M$ (at least) assumes all these string types have an
explicit zero byte as a guaranteed terminator. Indeed some of the Win32
API is schizoid in that they return lengths in characters which allow
for the fact that some characters may be Ascii Unicode mixes or DBCS eg
GetWindowTextLength.

So what is the UTF8 API in Tcl?


>
>Your DLL caller needs to be able to differentiate between these types
>of string so that it can format the input data correctly, set the
>length correctly and format the output data correctly.
>

--
Robin Becker

Robin Becker

unread,
Sep 14, 1999, 3:00:00 AM9/14/99
to
OK I guess I can narrow my confusion down to the following. It seems
that Tcl now has three string like types.

The ByteArray, String and Unicode thingies. What is used for UTF-8? Plus
a whole mess in the stuff about encodings.

I need to be able to allocate sufficient space for a particular type and
truncate in some sensible way when an external function may not fill the
whole thing. I am very confused by statements like the following from
the ByteArrayObj help.

Obtaining the string representation of a byte-array object (by calling
Tcl_GetStringFromObj) produces a properly formed UTF-8 sequence with a
one-to-one mapping between the bytes in the internal representation and
the UTF-8 characters in the string representation.

Do unicode producing functions have to set a length somehow? I mean is
there no easy way to tell the length of these things. Or is it utf8
that's tough.

Since UTF-8 is tcl's internal representation it's unlikely to appear in
external api's unless these come from tcl or am I being real stupid.
--
Robin Becker

Heribert Dahms

unread,
Sep 14, 1999, 3:00:00 AM9/14/99
to
In <u6JqtDAl...@jessikat.demon.co.uk> ro...@jessikat.demon.co.uk writes:

: I need to be able to allocate sufficient space for a particular type and


: truncate in some sensible way when an external function may not fill the
: whole thing.

I'd just zero-out the whole buffer before calling the function.


: Do unicode producing functions have to set a length somehow? I mean is


: there no easy way to tell the length of these things. Or is it utf8
: that's tough.

IIRC UTF-8 is terminated by two NULs.


Bye, Heribert (da...@ifk20.mach.uni-karlsruhe.de)

Johan Bengtsson

unread,
Sep 15, 1999, 3:00:00 AM9/15/99
to
I'm tearing my hair out because of this ...could someone please tell why this
code isn't working. When I run the script with this code it just hangs
without an error message. It seems to be a problem with using grid in a
proc...or maybe not??

set entrylist {title subtitle date importfile songs imagefile outputfile}

set r 0
set re 0
global r re

proc cr_ent {name} {
global r re
set lb lb
incr r
incr re
label .one.$name$lb -text [eval string toupper $name]
grid .one.$name$lb -row "$r" -column 1 -padx 5 -pady 2
entry .one.$name -background white -width 20 -textvariable $name
grid .one.$name -row "$re" -column 2 -padx 5 -pady 2

# this works....but grid would be better!
# pack [eval label .one.$name$lb -text [eval string toupper $name] ] -side top -padx 5 -pady 2
# pack [eval entry .one.$name -background white -width 20 -textvariable $name] -side top -padx 5 -pady 2
}

foreach value $entrylist {
cr_ent $value
}

???????????????????:-((((((!!!!!!!

Thanks for your help,
Johan B.

Gerhard Hintermayer

unread,
Sep 15, 1999, 3:00:00 AM9/15/99
to
Johan Bengtsson wrote:

Tried the code under 8.0, works fine (except for the frame definition of .one) no hangs, no errors.
Maybe insert some puts to see, where the program hangs.
Gerhard

Donal K. Fellows

unread,
Sep 15, 1999, 3:00:00 AM9/15/99
to
In article <u6JqtDAl...@jessikat.demon.co.uk>,

Robin Becker <ro...@jessikat.demon.co.uk> wrote:
> OK I guess I can narrow my confusion down to the following. It seems
> that Tcl now has three string like types.
>
> The ByteArray, String and Unicode thingies. What is used for UTF-8? Plus
> a whole mess in the stuff about encodings.

There are two types, ByteArray and String (Unicode), and one
pseudo-type - the string rep in the Tcl_Obj structure.

Desired type of data|Function to get data |Function to get length
--------------------+-----------------------+------------------------
ByteArray |Tcl_GetByteArrayFromObj|Tcl_GetByteArrayFromObj
Unicode |Tcl_GetUnicode |Tcl_GetCharLength
UTF-8 |Tcl_GetStringFromObj |Tcl_GetStringFromObj

Don't attempt to work out which is the most appropriate to do
according to the data being passed in to you; just go from the
definition of what the functions in question want input to them.

When going the other way, you need to know the length of data you want
to manipulate into Tcl (if you don't know and can't work out that, you
really are SOL) but knowing that you can quite easily use:

Tcl_NewByteArrayObj (for byte sequences)
Tcl_NewStringObj (for UTF-8 strings)
Tcl_SetUnicodeObj (for Unicode strings)

(All this is working from the 8.2.0 distribution sources.)

HTH!

Donal.
--
Donal K. Fellows http://www.cs.man.ac.uk/~fellowsd/ fell...@cs.man.ac.uk
-- The small advantage of not having California being part of my country would
be overweighed by having California as a heavily-armed rabid weasel on our
borders. -- David Parsons <o r c @ p e l l . p o r t l a n d . o r . u s>

Donal K. Fellows

unread,
Sep 15, 1999, 3:00:00 AM9/15/99
to
In article <937355712....@news.lls.se>,

Johan Bengtsson <joh...@lls.se> wrote:
> I'm tearing my hair out because of this ...could someone please tell
> why this code isn't working. When I run the script with this code
> it just hangs without an error message. It seems to be a problem
> with using grid in a proc...or maybe not??

As it stands, the code should work exactly as advertised. *However*
if you happen to have anything [pack]ed in .one as well as all your
[grid]ded widgets then you will be in trouble (since the two geometry
managers end up fighting over window sizes, which manifests itself as
a hang.) The fact that your code works when you use [pack] instead of
[grid] is also consistent with this diagnosis.

Solutions:
a) Put the widgets to be gridded into a frame that you then pack
into .one - this is probably the best solution for you, and it is
really easy to do.
b) Convert to either using [grid] or [pack] throughout - fine if you
can do this, but it does change the look of your GUI.
c) Turn off propagation for either [grid] or [pack] on .one - almost
certainly not what you want to do, but if it works (i.e. stops
the hang,) you know that my diagnosis is correct...

Robin Becker

unread,
Sep 15, 1999, 3:00:00 AM9/15/99
to
In article <7ro35r$7rj$1...@m1.cs.man.ac.uk>, Donal K. Fellows
<fell...@cs.man.ac.uk> writes
thanks Donal; this is the expertise I was hoping to find. I guess
there's a missing strlen in the top right cell.
--
Robin Becker

Alexandre Ferrieux

unread,
Sep 15, 1999, 3:00:00 AM9/15/99
to
Donal K. Fellows wrote:
>
> There are two types, ByteArray and String (Unicode), and one
> pseudo-type - the string rep in the Tcl_Obj structure.
> ...

> (All this is working from the 8.2.0 distribution sources.)

Also, please keep in mind that Paul has raised valid points about how
unsatisfactory this (8.2) setup is: namely, the fact that
String(Unicode) is a Tcl type per se, which leads to heavy shimmering at
seemingly innocuous points (like string manipulations for debugging).

I know that Jeff has more or less expressed agreement, and that
something better was in the works. All this to say: Robin, once you're
done with 8.2, be prepared to spend yet another major amount of energy
for 8.3 or 8.4.

Personal feeling: that transparent introduction of unicode-awareness in
routines that were traditionally used also for binary (like [string
range]) was a BAD DECISION. Reason: before that anyway, non-latin
characters were not at all usable. So why not have defined an entire set
of explicitly-unicode string-handling routines (like [unicode
range]...) ??? Jeff ?

Now it looks like we're stuck with a mess. Should we stick to it and
spend 10x the energy to work around every gotcha, or should we take the
courageous option of cutting that rotten branch (it may not be too late,
but time is running) ? Jeff ?

-Alex

Robin Becker

unread,
Sep 15, 1999, 3:00:00 AM9/15/99
to
In article <37DF99...@cnet.francetelecom.fr>, Alexandre Ferrieux
<alexandre...@cnet.francetelecom.fr> writes
I seem to have some of the same attitudes Alex; I guess I am just
resisting change. The principal proponent of the Unicode &
Internationalisation things is M$. Looking at how they've approached
this problem might be instructive. It seems they introduced another base
type ie the Unicode string and then the stuff to support it in the form
of codepages and the like. Having got the new type they've set about
duplicating the API for all types which have ascii string arguments.

I guess this is impossible for tcl as one of the primitive types was a
string. What I can't figure out is why an intermediate was chosen which
wasn't in one of the two categories. I guess for performance reasons and
perhaps to allow for stuff which I can't think of like Koreans writing
scripts in Sanskrit.

I guess the main fault I have with the tcl mess is that there remain
some very commonly used API calls which have changed quite subtly. For
me they may work for a while until I accidentally embed a null character
and then all hell breaks loose. Moan whinge whine; mumble mumble etc etc
:)
--
Robin Becker

Jeffrey Hobbs

unread,
Sep 15, 1999, 3:00:00 AM9/15/99
to Alexandre Ferrieux
Alexandre Ferrieux wrote:
> Donal K. Fellows wrote:
> >
> > There are two types, ByteArray and String (Unicode), and one
> > pseudo-type - the string rep in the Tcl_Obj structure.
> > ...
> > (All this is working from the 8.2.0 distribution sources.)
>
> Also, please keep in mind that Paul has raised valid points about how
> unsatisfactory this (8.2) setup is: namely, the fact that
> String(Unicode) is a Tcl type per se, which leads to heavy shimmering at
> seemingly innocuous points (like string manipulations for debugging).

Donal provided a very good, concise view into the state of 8.2.
While, in cases where Unicode can bite you, it is an improvement
over 8.1, it was just an interim hack (no not mine) to improve
things.

> done with 8.2, be prepared to spend yet another major amount of energy
> for 8.3 or 8.4.

This could be the case. However, we are looking at the Feather stuff
to improve this to the point that old stuff will magically be more
efficient, although there might also be a newer way to make it even
better.

> Personal feeling: that transparent introduction of unicode-awareness in
> routines that were traditionally used also for binary (like [string
> range]) was a BAD DECISION. Reason: before that anyway, non-latin
> characters were not at all usable. So why not have defined an entire set
> of explicitly-unicode string-handling routines (like [unicode
> range]...) ??? Jeff ?

I argued for making the binary command match the string command, so
there was a [binary range|replace|index|...], but ... This might change
later (upwards compatible). We don't want to separate the Unicode ops,
this would be a step backwards. We need to just improve how things are
handled under the covers.

--
Jeffrey Hobbs The Tcl Guy
jeffrey.hobbs at scriptics.com Scriptics Corp.

Alexandre Ferrieux

unread,
Sep 16, 1999, 3:00:00 AM9/16/99
to
Jeffrey Hobbs wrote:
>
> Alexandre Ferrieux wrote:
> > [Robin, once you're]

> > done with 8.2, be prepared to spend yet another major amount of energy
> > for 8.3 or 8.4.
>
> This could be the case. However, we are looking at the Feather stuff
> to improve this to the point that old stuff will magically be more
> efficient, although there might also be a newer way to make it even
> better.

Very nice if it is possible. However, at each such "success" the overall
internal tension of Tcl increases. What about next time we face a
similar problem ? Deflating the bubble early takes courage, but is an
excellent investment.

-Alex

joh...@lls.se

unread,
Sep 16, 1999, 3:00:00 AM9/16/99
to
In article <7ro3ml$824$1...@m1.cs.man.ac.uk>,

fell...@cs.man.ac.uk (Donal K. Fellows) wrote:
> In article <937355712....@news.lls.se>,
> Johan Bengtsson <joh...@lls.se> wrote:
> > I'm tearing my hair out because of this ...could someone please tell
> > why this code isn't working. When I run the script with this code
> > it just hangs without an error message. It seems to be a problem
> > with using grid in a proc...or maybe not??
>
> As it stands, the code should work exactly as advertised. *However*
> if you happen to have anything [pack]ed in .one as well as all your
> [grid]ded widgets then you will be in trouble (since the two geometry
> managers end up fighting over window sizes, which manifests itself as
> a hang.) The fact that your code works when you use [pack] instead of
> [grid] is also consistent with this diagnosis.
>
> Solutions:
> a) Put the widgets to be gridded into a frame that you then pack
> into .one - this is probably the best solution for you, and it is
> really easy to do.
> b) Convert to either using [grid] or [pack] throughout - fine if you
> can do this, but it does change the look of your GUI.
> c) Turn off propagation for either [grid] or [pack] on .one - almost
> certainly not what you want to do, but if it works (i.e. stops
> the hang,) you know that my diagnosis is correct...

Your diagnosis is absolutely correct! Thank you!:-))

Regards,
Johan B.

>
> Donal.
> --
> Donal K. Fellows http://www.cs.man.ac.uk/~fellowsd/
fell...@cs.man.ac.uk
> -- The small advantage of not having California being part of my
country would
> be overweighed by having California as a heavily-armed rabid weasel
on our
> borders. -- David Parsons <o r c @ p e l l . p o r t l a n d . o
r . u s>
>


Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.

Jeffrey Hobbs

unread,
Sep 16, 1999, 3:00:00 AM9/16/99
to Alexandre Ferrieux

I think we could keep that tension running until 9, when we have a
better opportunity to fiddle with the APIs and backwards compatibility.

lvi...@cas.org

unread,
Sep 16, 1999, 3:00:00 AM9/16/99
to

According to Robin Becker <ro...@jessikat.demon.co.uk>:
:resisting change. The principal proponent of the Unicode &

:Internationalisation things is M$. Looking at how they've approached

Perhaps principal due to sheer mass, but Java is Unicode incarnate. And
i18n and threads as well I believe. So it seems to me that if we are
going to continue to communicate and be able to be embedded in future
apps, we need to deal with the situation once and for all - in a compatible
manner.

--
<URL: mailto:lvi...@cas.org> Quote: Save us from the snobs.
<*> O- <URL: http://www.purl.org/NET/lvirden/>
Unless explicitly stated to the contrary, nothing in this posting
should be construed as representing my employer's opinions.

Reply all
Reply to author
Forward
0 new messages