PLEAC: translation of Perl Cookbook

Peter Lewerin

unread,

Aug 5, 2001, 4:29:07 PM8/5/01

to

There is a project at <URL: http://pleac.sourceforge.net/> that aims
to provide translations of the snippets in the Perl Cookbook into
other programming languages.

There are no Tcl translations yet. Has anyone considered doing them?
I suppose that the Cookbook solutions primarily scratches Perl itches,
but shouldn't Tcl be in on the fun anyway?

-- Peter "Not Drunk This Time, Honest" Lewerin

Peter Lewerin

unread,

Aug 6, 2001, 4:49:58 AM8/6/01

to

BTW, is this translation from Perl idiomatic?

# split at five byte boundaries (Perl)
@fivers = unpack("A5" x (length($string)/5), $string);

# split at five byte boundaries
set fivers [list]
while {[binary scan $string A5 group]} {
lappend fivers $group
set string [string range $string 5 end]
}

Richard.Suchenwirth

unread,

Aug 6, 2001, 5:45:37 AM8/6/01

to

...or this?

set fivers [list]
while {[regexp (.{1,5}) $string -> group]} {

lappend fivers $group
set string [string range $string 5 end]
}

--
Schoene Gruesse/best regards, Richard Suchenwirth - +49-7531-86 2703
Siemens Dematic AG, PA RC D2, Buecklestr.1-5, 78467 Konstanz,Germany
Personal opinions expressed only unless explicitly stated otherwise.

Cameron Laird

unread,

Aug 6, 2001, 9:54:42 AM8/6/01

to

In article <3B6E6741...@kst.siemens.de>,

Richard.Suchenwirth <Richard.S...@kst.siemens.de> wrote:
>Peter Lewerin wrote:
>>
>> BTW, is this translation from Perl idiomatic?
>>
>> # split at five byte boundaries (Perl)
>> @fivers = unpack("A5" x (length($string)/5), $string);
>>
>> # split at five byte boundaries
>> set fivers [list]
>> while {[binary scan $string A5 group]} {
>> lappend fivers $group
>> set string [string range $string 5 end]
>> }
>...or this?
>
>set fivers [list]
>while {[regexp (.{1,5}) $string -> group]} {
> lappend fivers $group
> set string [string range $string 5 end]
>}

.
.
.
Dangerously so, is my reaction. Note that the Tcl
suggestions modify $string, unlike the Perl one.
--

Cameron Laird <cla...@NeoSoft.com>
Business: http://www.Phaseit.net
Personal: http://starbase.neosoft.com/~claird/home.html

mig

unread,

Aug 6, 2001, 12:41:06 PM8/6/01

to

Cameron Laird wrote:

> In article <3B6E6741...@kst.siemens.de>,
> Richard.Suchenwirth <Richard.S...@kst.siemens.de> wrote:
> >Peter Lewerin wrote:
> >>
> >> BTW, is this translation from Perl idiomatic?
> >>
> >> # split at five byte boundaries (Perl)
> >> @fivers = unpack("A5" x (length($string)/5), $string);
> >>
> >> # split at five byte boundaries
> >> set fivers [list]
> >> while {[binary scan $string A5 group]} {
> >> lappend fivers $group
> >> set string [string range $string 5 end]
> >> }
> >...or this?
> >
> >set fivers [list]
> >while {[regexp (.{1,5}) $string -> group]} {
> > lappend fivers $group
> > set string [string range $string 5 end]
> >}
> .
> .
> .
> Dangerously so, is my reaction. Note that the Tcl

> suggestions modify $string, unlike the Perl one.

How about

set fivers [regexp -all -inline {.{1,5}} $string]

This will split at 5 *character* boundaries; I do not quite know
if/when/how some byte sequences may be interpreted by the regexp engine
as multibyte characters ...

For that matter, I do not know how [binary scan] handles multibyte
characters either. Does the first version really split at 5 byte
boundaries? It isn't too difficult to change it so as to avoid changing
the string: either using the positioner for binary (@), or else by doing
the whole thing on a temporary copy of the string.

Miguel Sofer

Cameron Laird

unread,

Aug 6, 2001, 12:14:09 PM8/6/01

to

.
.
.
It's going to amuse me no end if Tcl recipes end up
being more succinct than the corresponding ones in
Perl, the language notorious for its brevity.

Peter Lewerin

unread,

Aug 6, 2001, 12:45:22 PM8/6/01

to

>...or this?

>while {[regexp (.{1,5}) $string -> group]} {

Well, yes. I tried to keep the translation as close as possible;
unpack() -- [binary scan]. [string range] could probably do the job
too, otherwise.

It just worried me that I couldn't create a pack format "A5 A5..." and
dump the result in a list, similar to the Perl code.

Peter Lewerin

unread,

Aug 6, 2001, 12:52:31 PM8/6/01

to

>It's going to amuse me no end if Tcl recipes end up
>being more succinct than the corresponding ones in
>Perl, the language notorious for its brevity.

Currently, less than 270 lines of Perl code and comments translates to
375 lines of Tcl code and comments, so I think Tcl is going to lose
that one. I suppose a more experienced Tcl'er could trim my code a
bit, but not *that* much.

OTOH, the Tcl code wins the clarity award easily.

Richard.Suchenwirth

unread,

Aug 6, 2001, 1:02:30 PM8/6/01

to

[string repeat A5 [expr ([string length $string]+4)/5]] does the first,
and [regexp -all -inline ...] is even conciser, as Cameron noted.

Jeffrey Hobbs

unread,

Aug 6, 2001, 2:46:05 PM8/6/01

to

Peter Lewerin wrote:
>
> BTW, is this translation from Perl idiomatic?
>
> # split at five byte boundaries (Perl)
> @fivers = unpack("A5" x (length($string)/5), $string);

Remember that it isn't necessary for us to always do one-liners
to match Perl (although I like the regexp version). What is key
is that we are technically better - easier to read or more
"correct" code. We can achieve this by truly improving on the
original Perl.

The above case does state "byte boundaries", but then the sample
that follows it says "chop string into individual characters"
and still uses unpack A1. Perhaps adding 'use utf8;' will make
that work, but not as written.

Thus, you should provide both solutions - breaking into bytes
and into chars, noting for the non-i18n-sensitive crowd that
there really is a difference nowadays. :)

--
Jeff Hobbs The Tcl Guy
Senior Developer http://www.ActiveState.com/
Tcl Support and Productivity Solutions

Cameron Laird

unread,

Aug 6, 2001, 3:02:54 PM8/6/01

to

In article <3B6EE681...@ActiveState.com>,
Jeffrey Hobbs <Je...@ActiveState.com> wrote:
.
.
.

>Remember that it isn't necessary for us to always do one-liners
>to match Perl (although I like the regexp version). What is key
>is that we are technically better - easier to read or more
>"correct" code. We can achieve this by truly improving on the
>original Perl.
>
>The above case does state "byte boundaries", but then the sample
>that follows it says "chop string into individual characters"
>and still uses unpack A1. Perhaps adding 'use utf8;' will make
>that work, but not as written.
>
>Thus, you should provide both solutions - breaking into bytes
>and into chars, noting for the non-i18n-sensitive crowd that
>there really is a difference nowadays. :)

.
.
.
Seconded.

Peter Lewerin

unread,

Aug 6, 2001, 3:08:50 PM8/6/01

to

>> It just worried me that I couldn't create a pack format "A5 A5..." and
>> dump the result in a list, similar to the Perl code.
>
>[string repeat A5 [expr ([string length $string]+4)/5]] does the first,
>and [regexp -all -inline ...] is even conciser, as Cameron noted.

Yes, the format building part can be done; I think this is most
similar to the original code:

set numGroups [expr {[string length $string]/5}]
set formatStr [string repeat "A5 " $numGroups]

(Yes, I know that it leaves out any character group of <5 characters
in the tail.)

This is what I used in my original attempt. *Then* comes the hard
part:

binary scan $string $formatStr WHAT?

That's what stumped me. I tried building a string of variable names,
but it looked weird and didn't work anyway. I tried to find a way to
set up a list to receive the scanned groups, no joy.

Peter Lewerin

unread,

Aug 6, 2001, 3:43:52 PM8/6/01

to

>The above case does state "byte boundaries", but then the sample
>that follows it says "chop string into individual characters"
>and still uses unpack A1. Perhaps adding 'use utf8;' will make
>that work, but not as written.

IMHO the Perl world has a -huge- problem with inconsistent and
obsolete code and documentation lying around. The Cookbook itself is
hardly up-to-date, and important parts of it (e.g. bareword
filehandles) will even be incompatible with Perl6. This is one of the
reasons I gave up teaching Perl.

Anyways, I'm a bit confused by your remark. In my understanding, "c"
codes scan bytes, and "a"/"A" codes scan characters. Is that wrong?

>Thus, you should provide both solutions - breaking into bytes
>and into chars, noting for the non-i18n-sensitive crowd that
>there really is a difference nowadays. :)

Um... *shifts uneasily* ...would you perhaps ...mean something like
this...?

# split at five-character boundaries
set fivers [list]
set temp $string
while {[binary scan $temp A5A* group tail]} {
lappend fivers $group
set temp $tail
}

# split at five-byte boundaries
set fivers [list]
set temp $string
while {[binary scan $temp c5c* group tail]} {
lappend fivers $group
set temp [binary format c* $tail]
}

# chop string into individual characters
set chars [list]
set temp $string
while {[binary scan $temp A1A* group tail]} {
lappend chars $group
set temp $tail
}

# chop string into individual bytes
set bytes [list]
set temp $string
while {[binary scan $temp c1c* group tail]} {
lappend bytes $group
set temp [binary format c* $tail]
}

I don't think I'm there yet, because it there's no difference for
characters like едц...

Jeffrey Hobbs

unread,

Aug 6, 2001, 4:41:39 PM8/6/01

to

Peter Lewerin wrote:
>
> >The above case does state "byte boundaries", but then the sample
> >that follows it says "chop string into individual characters"
> >and still uses unpack A1. Perhaps adding 'use utf8;' will make
> >that work, but not as written.

...

> Anyways, I'm a bit confused by your remark. In my understanding, "c"
> codes scan bytes, and "a"/"A" codes scan characters. Is that wrong?

That's what they would like you to think, but that doesn't
appear to work (using perl 5.6.1):

$string = "H\x{2082}O";
@chars = unpack("A1" x length($string), $string);
print @chars;

The should be H-subscript2-O (three chars), but A really doesn't
scan unicode chars (which actually exposese a bug in the latest
Perl book where they don't handle this correctly). The same is
true for Tcl's binary, except that it says a "character string" of
some length is scanned, which is a little more accurate (but still
could be improved in definition, as it is referring to the C char*
definition). The 'scan' command does handle chars right with %c.

BTW, for the A1 case, you should be able to use binary scan with
A*, assuming you really only want bytes.

> # split at five-character boundaries

The 5-char boundary solution is the -inline regexp.

> # split at five-byte boundaries

That was OK.

> # chop string into individual characters

This is simply [split $string {}].

> I don't think I'm there yet, because it there's no difference for
> characters like едц...

Yeah, it depends on the translation of the problem. You might
have to preparse that down with [encoding convertto utf-8 $string]
first.

Peter Lewerin

unread,

Aug 6, 2001, 5:26:54 PM8/6/01

to

>BTW, for the A1 case, you should be able to use binary scan with
>A*, assuming you really only want bytes.

Yep, stupid me. I stared myself blind trying to reconcile the
solutions.

>> # split at five-character boundaries
>
>The 5-char boundary solution is the -inline regexp.

Yes, except I really want to use [binary] here. The point of the
snippets isn't to solve problems but to demonstrate techniques. I am
using [regexp] in other places to match Perl regexps.

I may be stupid, but I'm not daft :-)

>> # chop string into individual characters
>
>This is simply [split $string {}].

...which I use to translate split(//, $string) in other places.

>> I don't think I'm there yet, because it there's no difference for
>> characters like едц...
>
>Yeah, it depends on the translation of the problem. You might
>have to preparse that down with [encoding convertto utf-8 $string]
>first.

That does, indeed, do the trick. I think I'll get some sleep now;
I'll put this in the vice tomorrow and see if I can polish it into
shape.

Jeffrey Hobbs

unread,

Aug 6, 2001, 5:40:04 PM8/6/01

to

Peter Lewerin wrote:
> >> # split at five-character boundaries
> >
> >The 5-char boundary solution is the -inline regexp.
>
> Yes, except I really want to use [binary] here. The point of the
> snippets isn't to solve problems but to demonstrate techniques. I am
> using [regexp] in other places to match Perl regexps.
>
> I may be stupid, but I'm not daft :-)
>
> >> # chop string into individual characters
> >
> >This is simply [split $string {}].
>
> ...which I use to translate split(//, $string) in other places.

Really, I think that you want to do with the best solutions when
presented with the problem. The Perl Cookbook is about showing
solutions, not really demonstrating techniques. I think you want
to use the regexp and split solutions, because those are what
people should use when they want to solve that problem. If you
still want to go the other way, make sure to note that this is
just a direct translation of Perl, and not the ideal Tcl solution.

Peter Lewerin

unread,

Aug 7, 2001, 7:53:57 AM8/7/01

to

How about this (I've kept the Perl versions for comparison)?

## get a 5-byte string, skip 3, then grab 2 8-byte strings, then the
rest
#($leading, $s1, $s2, $trailing) =
# unpack("A5 x3 A8 A8 A*", $data);

# Strictly translated, this becomes:
binary scan $data "A5 x3 A8 A8 A*" leading s1 s2 trailing

# but nowadays, characters and bytes aren't necessarily the same
# size. In Tcl, strings are encoded using 16-bit Unicode characters.
# The above unpack/scan works for strings containing only character
# codes in the range 0--255, but distorts other strings by truncating
# all codes to 8 bits.

# To ensure that this distortion is avoided, the input string can be
# converted to an 8-bit encoding before scanning:

set data "H\u2082O is the chemical formula for water"
set utf8data [encoding convertto utf-8 $data]

## split at five-byte boundaries
#@fivers = unpack("A5" x (length($string)/5), $string);

set fivers [list]
set temp $utf8data

while {[binary scan $temp A5A* group tail]} {
lappend fivers $group
set temp $tail
}

if {[string length $tail]} { lappend fivers $tail }

# To split at five-character boundaries, this is much more convenient:
set fivers [regexp -all -inline {.{1,5}} $data]

# (Tcl regular expressions are Unicode aware, so the encoding doesn't
# have to be changed.)

## chop string into individual characters
#@chars = unpack("A1" x length($string), $string);

# Strict translation (needs to change encoding):
set chars [list]
set temp $utf8data
while {[binary scan $temp A1A* ch tail]} {
lappend chars $ch
set temp $tail
}

# Idiomatic translation (Unicode-aware):
set chars [split $data {}]

(Code by Miguel Sofer, Jeffrey Hobbs, and me.)

Stuart Cassoff

unread,

Aug 7, 2001, 10:09:14 AM8/7/01

to

Peter Lewerin wrote:
> set fivers [list]
> while {[binary scan $string A5 group]} {
> lappend fivers $group
> set string [string range $string 5 end]
> }

What advantage do you get from using
set fivers [list]
as oppposed to
set fivers {}

Is it that:
- There will be overhead making 'fivers' a list (internally) in the
first lappend.

- Using [list] is even more documentative since the variable's intended
use is stated more clearly. (Ugh, I have to improve my articulation)

- Nobody expects the Spanish Inquistion.

- Something else?

I like it, and if there's no reason not to, I will update my programming
methods (in my brain) so as to do this from now on.

OT: I sometimes think of Tcl as the Rodney Dangerfield of languages
'I'm Tcl, and I don't get no respect - no respect I tell ya' ;-)

Stu

Jeffrey Hobbs

unread,

Aug 7, 2001, 11:11:30 AM8/7/01

to

Stuart Cassoff wrote:
> What advantage do you get from using
> set fivers [list]
> as oppposed to
> set fivers {}

Nobody expect the Spanish Inquisition! More cushions!

> Is it that:
> - There will be overhead making 'fivers' a list (internally) in the
> first lappend.

No, they will both actually end up as empty objects (truly empty,
no defined type).

> - Using [list] is even more documentative since the variable's intended
> use is stated more clearly. (Ugh, I have to improve my articulation)

People who know the above still use it for this reason.

Peter Lewerin

unread,

Aug 7, 2001, 11:37:51 AM8/7/01

to

>- Using [list] is even more documentative since the variable's intended
>use is stated more clearly. (Ugh, I have to improve my articulation)

This one, in my case. I even use

array set [list]

Sad, isn't it?

>- Nobody expects the Spanish Inquistion.

Better not. If we Expect-ed it we might find some errors here and
there.

Neil Madden

unread,

Aug 7, 2001, 6:58:22 AM8/7/01

to

"Richard.Suchenwirth" wrote:
>
> Peter Lewerin wrote:
> >
> > BTW, is this translation from Perl idiomatic?
> >
> > # split at five byte boundaries (Perl)
> > @fivers = unpack("A5" x (length($string)/5), $string);
> >
> > # split at five byte boundaries
> > set fivers [list]
> > while {[binary scan $string A5 group]} {
> > lappend fivers $group
> > set string [string range $string 5 end]
> > }
> ...or this?
>
> set fivers [list]
> while {[regexp (.{1,5}) $string -> group]} {
> lappend fivers $group
> set string [string range $string 5 end]
> }

Does that always split on 5 bytes, or 5 characters? Are there any cases
where 16-bit character sets (UTF-16) would actually result in this
producing 10 bytes? (I don't actually know much about encodings, just
curious).

Jeffrey Hobbs

unread,

Aug 7, 2001, 4:06:08 PM8/7/01

to

Neil Madden wrote:
> "Richard.Suchenwirth" wrote:
...

> > while {[regexp (.{1,5}) $string -> group]} {
> > lappend fivers $group
> > set string [string range $string 5 end]
> > }
>
> Does that always split on 5 bytes, or 5 characters? Are there any cases
> where 16-bit character sets (UTF-16) would actually result in this
> producing 10 bytes? (I don't actually know much about encodings, just
> curious).

5 chars, which could be as many as 15 bytes in Tcl (where a char
can be up to 3 bytes in UTF). In fact, there are specs for up
to 6 byte UTF-8 chars, but those aren't really supported in most
places yet because it requires moving beyond the UCS-2 16-bit
size of a unicode char.

Richard.Suchenwirth

unread,

Aug 8, 2001, 4:30:17 AM8/8/01

to

Peter Lewerin wrote:
>
> >- Using [list] is even more documentative since the variable's intended
> >use is stated more clearly. (Ugh, I have to improve my articulation)
>
> This one, in my case. I even use
>
> array set [list]

YM: array set arrayname [list] ?

Peter Lewerin

unread,

Aug 8, 2001, 6:04:31 AM8/8/01

to

>YM: array set arrayname [list] ?

Um, yes. :*)