There are no Tcl translations yet. Has anyone considered doing them?
I suppose that the Cookbook solutions primarily scratches Perl itches,
but shouldn't Tcl be in on the fun anyway?
-- Peter "Not Drunk This Time, Honest" Lewerin
# split at five byte boundaries (Perl)
@fivers = unpack("A5" x (length($string)/5), $string);
# split at five byte boundaries
set fivers [list]
while {[binary scan $string A5 group]} {
lappend fivers $group
set string [string range $string 5 end]
}
set fivers [list]
while {[regexp (.{1,5}) $string -> group]} {
lappend fivers $group
set string [string range $string 5 end]
}
--
Schoene Gruesse/best regards, Richard Suchenwirth - +49-7531-86 2703
Siemens Dematic AG, PA RC D2, Buecklestr.1-5, 78467 Konstanz,Germany
Personal opinions expressed only unless explicitly stated otherwise.
Cameron Laird <cla...@NeoSoft.com>
Business: http://www.Phaseit.net
Personal: http://starbase.neosoft.com/~claird/home.html
> In article <3B6E6741...@kst.siemens.de>,
> Richard.Suchenwirth <Richard.S...@kst.siemens.de> wrote:
> >Peter Lewerin wrote:
> >>
> >> BTW, is this translation from Perl idiomatic?
> >>
> >> # split at five byte boundaries (Perl)
> >> @fivers = unpack("A5" x (length($string)/5), $string);
> >>
> >> # split at five byte boundaries
> >> set fivers [list]
> >> while {[binary scan $string A5 group]} {
> >> lappend fivers $group
> >> set string [string range $string 5 end]
> >> }
> >...or this?
> >
> >set fivers [list]
> >while {[regexp (.{1,5}) $string -> group]} {
> > lappend fivers $group
> > set string [string range $string 5 end]
> >}
> .
> .
> .
> Dangerously so, is my reaction. Note that the Tcl
> suggestions modify $string, unlike the Perl one.
How about
set fivers [regexp -all -inline {.{1,5}} $string]
This will split at 5 *character* boundaries; I do not quite know
if/when/how some byte sequences may be interpreted by the regexp engine
as multibyte characters ...
For that matter, I do not know how [binary scan] handles multibyte
characters either. Does the first version really split at 5 byte
boundaries? It isn't too difficult to change it so as to avoid changing
the string: either using the positioner for binary (@), or else by doing
the whole thing on a temporary copy of the string.
Miguel Sofer
Well, yes. I tried to keep the translation as close as possible;
unpack() -- [binary scan]. [string range] could probably do the job
too, otherwise.
It just worried me that I couldn't create a pack format "A5 A5..." and
dump the result in a list, similar to the Perl code.
Currently, less than 270 lines of Perl code and comments translates to
375 lines of Tcl code and comments, so I think Tcl is going to lose
that one. I suppose a more experienced Tcl'er could trim my code a
bit, but not *that* much.
OTOH, the Tcl code wins the clarity award easily.
[string repeat A5 [expr ([string length $string]+4)/5]] does the first,
and [regexp -all -inline ...] is even conciser, as Cameron noted.
Remember that it isn't necessary for us to always do one-liners
to match Perl (although I like the regexp version). What is key
is that we are technically better - easier to read or more
"correct" code. We can achieve this by truly improving on the
original Perl.
The above case does state "byte boundaries", but then the sample
that follows it says "chop string into individual characters"
and still uses unpack A1. Perhaps adding 'use utf8;' will make
that work, but not as written.
Thus, you should provide both solutions - breaking into bytes
and into chars, noting for the non-i18n-sensitive crowd that
there really is a difference nowadays. :)
--
Jeff Hobbs The Tcl Guy
Senior Developer http://www.ActiveState.com/
Tcl Support and Productivity Solutions
Yes, the format building part can be done; I think this is most
similar to the original code:
set numGroups [expr {[string length $string]/5}]
set formatStr [string repeat "A5 " $numGroups]
(Yes, I know that it leaves out any character group of <5 characters
in the tail.)
This is what I used in my original attempt. *Then* comes the hard
part:
binary scan $string $formatStr WHAT?
That's what stumped me. I tried building a string of variable names,
but it looked weird and didn't work anyway. I tried to find a way to
set up a list to receive the scanned groups, no joy.
IMHO the Perl world has a -huge- problem with inconsistent and
obsolete code and documentation lying around. The Cookbook itself is
hardly up-to-date, and important parts of it (e.g. bareword
filehandles) will even be incompatible with Perl6. This is one of the
reasons I gave up teaching Perl.
Anyways, I'm a bit confused by your remark. In my understanding, "c"
codes scan bytes, and "a"/"A" codes scan characters. Is that wrong?
>Thus, you should provide both solutions - breaking into bytes
>and into chars, noting for the non-i18n-sensitive crowd that
>there really is a difference nowadays. :)
Um... *shifts uneasily* ...would you perhaps ...mean something like
this...?
# split at five-character boundaries
set fivers [list]
set temp $string
while {[binary scan $temp A5A* group tail]} {
lappend fivers $group
set temp $tail
}
# split at five-byte boundaries
set fivers [list]
set temp $string
while {[binary scan $temp c5c* group tail]} {
lappend fivers $group
set temp [binary format c* $tail]
}
# chop string into individual characters
set chars [list]
set temp $string
while {[binary scan $temp A1A* group tail]} {
lappend chars $group
set temp $tail
}
# chop string into individual bytes
set bytes [list]
set temp $string
while {[binary scan $temp c1c* group tail]} {
lappend bytes $group
set temp [binary format c* $tail]
}
I don't think I'm there yet, because it there's no difference for
characters like едц...
That's what they would like you to think, but that doesn't
appear to work (using perl 5.6.1):
$string = "H\x{2082}O";
@chars = unpack("A1" x length($string), $string);
print @chars;
The should be H-subscript2-O (three chars), but A really doesn't
scan unicode chars (which actually exposese a bug in the latest
Perl book where they don't handle this correctly). The same is
true for Tcl's binary, except that it says a "character string" of
some length is scanned, which is a little more accurate (but still
could be improved in definition, as it is referring to the C char*
definition). The 'scan' command does handle chars right with %c.
BTW, for the A1 case, you should be able to use binary scan with
A*, assuming you really only want bytes.
> # split at five-character boundaries
The 5-char boundary solution is the -inline regexp.
> # split at five-byte boundaries
That was OK.
> # chop string into individual characters
This is simply [split $string {}].
> I don't think I'm there yet, because it there's no difference for
> characters like едц...
Yeah, it depends on the translation of the problem. You might
have to preparse that down with [encoding convertto utf-8 $string]
first.
Yep, stupid me. I stared myself blind trying to reconcile the
solutions.
>> # split at five-character boundaries
>
>The 5-char boundary solution is the -inline regexp.
Yes, except I really want to use [binary] here. The point of the
snippets isn't to solve problems but to demonstrate techniques. I am
using [regexp] in other places to match Perl regexps.
I may be stupid, but I'm not daft :-)
>> # chop string into individual characters
>
>This is simply [split $string {}].
...which I use to translate split(//, $string) in other places.
>> I don't think I'm there yet, because it there's no difference for
>> characters like едц...
>
>Yeah, it depends on the translation of the problem. You might
>have to preparse that down with [encoding convertto utf-8 $string]
>first.
That does, indeed, do the trick. I think I'll get some sleep now;
I'll put this in the vice tomorrow and see if I can polish it into
shape.
Really, I think that you want to do with the best solutions when
presented with the problem. The Perl Cookbook is about showing
solutions, not really demonstrating techniques. I think you want
to use the regexp and split solutions, because those are what
people should use when they want to solve that problem. If you
still want to go the other way, make sure to note that this is
just a direct translation of Perl, and not the ideal Tcl solution.
## get a 5-byte string, skip 3, then grab 2 8-byte strings, then the
rest
#($leading, $s1, $s2, $trailing) =
# unpack("A5 x3 A8 A8 A*", $data);
# Strictly translated, this becomes:
binary scan $data "A5 x3 A8 A8 A*" leading s1 s2 trailing
# but nowadays, characters and bytes aren't necessarily the same
# size. In Tcl, strings are encoded using 16-bit Unicode characters.
# The above unpack/scan works for strings containing only character
# codes in the range 0--255, but distorts other strings by truncating
# all codes to 8 bits.
# To ensure that this distortion is avoided, the input string can be
# converted to an 8-bit encoding before scanning:
set data "H\u2082O is the chemical formula for water"
set utf8data [encoding convertto utf-8 $data]
## split at five-byte boundaries
#@fivers = unpack("A5" x (length($string)/5), $string);
set fivers [list]
set temp $utf8data
while {[binary scan $temp A5A* group tail]} {
lappend fivers $group
set temp $tail
}
if {[string length $tail]} { lappend fivers $tail }
# To split at five-character boundaries, this is much more convenient:
set fivers [regexp -all -inline {.{1,5}} $data]
# (Tcl regular expressions are Unicode aware, so the encoding doesn't
# have to be changed.)
## chop string into individual characters
#@chars = unpack("A1" x length($string), $string);
# Strict translation (needs to change encoding):
set chars [list]
set temp $utf8data
while {[binary scan $temp A1A* ch tail]} {
lappend chars $ch
set temp $tail
}
# Idiomatic translation (Unicode-aware):
set chars [split $data {}]
(Code by Miguel Sofer, Jeffrey Hobbs, and me.)
What advantage do you get from using
set fivers [list]
as oppposed to
set fivers {}
Is it that:
- There will be overhead making 'fivers' a list (internally) in the
first lappend.
- Using [list] is even more documentative since the variable's intended
use is stated more clearly. (Ugh, I have to improve my articulation)
- Nobody expects the Spanish Inquistion.
- Something else?
I like it, and if there's no reason not to, I will update my programming
methods (in my brain) so as to do this from now on.
OT: I sometimes think of Tcl as the Rodney Dangerfield of languages
'I'm Tcl, and I don't get no respect - no respect I tell ya' ;-)
Stu
Nobody expect the Spanish Inquisition! More cushions!
> Is it that:
> - There will be overhead making 'fivers' a list (internally) in the
> first lappend.
No, they will both actually end up as empty objects (truly empty,
no defined type).
> - Using [list] is even more documentative since the variable's intended
> use is stated more clearly. (Ugh, I have to improve my articulation)
People who know the above still use it for this reason.
>- Using [list] is even more documentative since the variable's intended
>use is stated more clearly. (Ugh, I have to improve my articulation)
This one, in my case. I even use
array set [list]
Sad, isn't it?
>- Nobody expects the Spanish Inquistion.
Better not. If we Expect-ed it we might find some errors here and
there.
"Richard.Suchenwirth" wrote:
>
> Peter Lewerin wrote:
> >
> > BTW, is this translation from Perl idiomatic?
> >
> > # split at five byte boundaries (Perl)
> > @fivers = unpack("A5" x (length($string)/5), $string);
> >
> > # split at five byte boundaries
> > set fivers [list]
> > while {[binary scan $string A5 group]} {
> > lappend fivers $group
> > set string [string range $string 5 end]
> > }
> ...or this?
>
> set fivers [list]
> while {[regexp (.{1,5}) $string -> group]} {
> lappend fivers $group
> set string [string range $string 5 end]
> }
Does that always split on 5 bytes, or 5 characters? Are there any cases
where 16-bit character sets (UTF-16) would actually result in this
producing 10 bytes? (I don't actually know much about encodings, just
curious).
5 chars, which could be as many as 15 bytes in Tcl (where a char
can be up to 3 bytes in UTF). In fact, there are specs for up
to 6 byte UTF-8 chars, but those aren't really supported in most
places yet because it requires moving beyond the UCS-2 16-bit
size of a unicode char.
YM: array set arrayname [list] ?
Um, yes. :*)