how to gets with an arbitrary "newline" character

Eric Mahurin

unread,

Jun 11, 2006, 5:06:47 PM6/11/06

to

Could somebody tell me a fast way to get a sequence of characters from
an I/O channel that ends in an arbitrary character? In other languages
I use (perl, ruby, C++) the get line routines have some control over
what the "newline" character is (exactly). Other than controlling
whether you want \n, \r\n, or \r as a "newline", I don't see this in
tcl. I know I could do this by reading one character at a time, but
this would be slow. Specifically, I'm wanting to use the null
character (from a pipe) in a protocol to delimt records. My current
solution is to use \n to delimit in the records and \0 to mean newline
within each record (and use string map to translate). Seem kind of
silly to jump through these hoops.

Any help would be appreciated.

SM Ryan

unread,

Jun 11, 2006, 7:07:43 PM6/11/06

to

"Eric Mahurin" <eric.m...@gmail.com> wrote:
# Could somebody tell me a fast way to get a sequence of characters from
# an I/O channel that ends in an arbitrary character? In other languages
# I use (perl, ruby, C++) the get line routines have some control over
# what the "newline" character is (exactly). Other than controlling
# whether you want \n, \r\n, or \r as a "newline", I don't see this in
# tcl. I know I could do this by reading one character at a time, but
# this would be slow. Specifically, I'm wanting to use the null
# character (from a pipe) in a protocol to delimt records. My current
# solution is to use \n to delimit in the records and \0 to mean newline
# within each record (and use string map to translate). Seem kind of
# silly to jump through these hoops.

How about something like
proc getn {channel args} {
if {[llength $args]} {
upvar 1 [lindex $args 0] line
set count 1
} else {
set count 0
}
if {[length $args]>=2} {
set term [lindex $args 1]
} else {
set term \x00
}
set line ""
while 1 {
set character [read $channel 1]
if {[string length $character]==0} {
if {$count} {
if {[string length $line]==0} {
return -1
} else {
return [string length $line]
}
} else {
return $line
}
} elseif {$character eq $term} {
if {$count} {
return [string length $line]
} else {
return $line
}
} else {
append line $character
}
}
}

--
SM Ryan http://www.rawbw.com/~wyrmwif/
No pleasure, no rapture, no exquisite sin greater than central air.

Eric Mahurin

unread,

Jun 11, 2006, 8:11:55 PM6/11/06

to

SM Ryan wrote:
> "Eric Mahurin" <eric.m...@gmail.com> wrote:
> # Could somebody tell me a fast way to get a sequence of characters from
> # an I/O channel that ends in an arbitrary character? In other languages
> # I use (perl, ruby, C++) the get line routines have some control over
> # what the "newline" character is (exactly). Other than controlling
> # whether you want \n, \r\n, or \r as a "newline", I don't see this in
> # tcl. I know I could do this by reading one character at a time, but
> # this would be slow. Specifically, I'm wanting to use the null
> # character (from a pipe) in a protocol to delimt records. My current
> # solution is to use \n to delimit in the records and \0 to mean newline
> # within each record (and use string map to translate). Seem kind of
> # silly to jump through these hoops.
>
> How about something like
> proc getn {channel args} {
> if {[llength $args]} {
> upvar 1 [lindex $args 0] line
> set count

> } else {
> set count 0
> }
> if {[length $args]>=2} {
> set term [lindex $args 1]
> } else {
> set term \x00
> }
> set line ""
> while 1 {
> set character [read $channel 1]
> if {[string length $character]==0} {
> if {$count} {
> if {[string length $line]==0} {
> return -1
> } else {
> return [string length $line]
> }
> } else {
> return $line
> }
> } elseif {$character eq $term} {
> if {$count} {
> return [string length $line]
> } else {
> return $line
> }
> } else {
> append line $character
> }
> }
> }

I took the proc above (with one correction) and did this:

foreach n {0 1 10 100} {
puts "line size : $n"
set s [string repeat "X" $n]
set f [open "test" w]
puts "puts : [time {puts $f $s} 100000]"
close $f
set f [open "test" r]
puts "gets : [time {gets $f line} 100000]"
close $f
set f [open "test" r]
puts "getn : [time {getn $f line "\n"} 100000]"
close $f
}

and here is the output on my machine:

line size : 0
puts : 2 microseconds per iteration
gets : 3 microseconds per iteration
getn : 6 microseconds per iteration
line size : 1
puts : 2 microseconds per iteration
gets : 3 microseconds per iteration
getn : 8 microseconds per iteration
line size : 10
puts : 2 microseconds per iteration
gets : 3 microseconds per iteration
getn : 24 microseconds per iteration
line size : 100
puts : 3 microseconds per iteration
gets : 4 microseconds per iteration
getn : 175 microseconds per iteration

getn looks like about O(n) compared to about O(1) for gets (for these
line sizes). Too slow.

Maybe there is some kind of extension to help out? Or some other trick
I don't know about?

EKB

unread,

Jun 11, 2006, 9:37:21 PM6/11/06

to

If I were to try to attack this (I haven't), I'd start by using "read"
to get a block of text in from the I/O channel into a buffer. Then I'd
use split to split the buffered text on \0, then rebuild the pieces
from the blocks I read in and send along strings as delimited by \0.

I'd wrap all that logic up into a proc, so I wouldn't have to think
about it anymore, something like "gets0".

I hope that helps, and isn't too vague. It also may not be preferable
to string map.

Michael A. Cleverly

unread,

Jun 11, 2006, 9:44:50 PM6/11/06

to

On Sun, 11 Jun 2006, Eric Mahurin wrote:

> Specifically, I'm wanting to use the null character (from a pipe) in a
> protocol to delimt records.

Something like this may work for you--if you use \0 to delimit a record
(and allow \n to mean newlines within any given record). Then you can use
[fconfigure channel -eofchar \0] to set the null character to be the eof.
Then you can read in an entire record at once, clear the EOF flag, and
read another one. (Assuming your protocol never sends completely blank
records you'll know you were really at the EOF on your pipe when a read
returns 0 bytes.)

Here is an example. Two separate Tcl scripts involved, "client.tcl" and
"pipe.tcl":

##### contents of "pipe.tcl"
#!/bin/sh
#\
exec tclsh "$0" ${1+"$@"}

fconfigure stdout -translation binary -buffering none
for {set i 1} {$i <= 3} {incr i} {
puts -nonewline [format "Record %d\nData %d\nMore Data%d%s" \
$i $i $i \0]
}

##### contents of "client.tcl"
#!/bin/sh
#\
exec tclsh "$0" ${1+"$@"}

set fp [open "|./pipe.tcl"]
fconfigure $fp -eofchar \0 -buffering full

proc resetEOF {fp} {
# Changing the eof character to something different resets
# the EOF flag on the channel
#
# Note: if this were a regular file that we [seek] in then
# a mere [seek $fp 1 current] would clear the EOF flag and
# move us beyond the \0, but we can't [seek] on a pipe ...
fconfigure $fp -eofchar {}
# we read the null byte (to get past it) then reset \0 to
# be the eofchar
read $fp 1
fconfigure $fp -eofchar \0
}

set counter 0
while 1 {
set record [read $fp]
if {[string length $record] == 0} then break else {
# process the record ...
puts "Read #[incr counter] returned:\n$record\n"
resetEOF $fp
}
}

Michael

Eric Mahurin

unread,

Jun 11, 2006, 10:49:39 PM6/11/06

to

This is exactly the kind of thing I'm looking for - but it doesn't seem
to work for me. I ran the above and got:

Read #1 returned:
Record 1
Data 1
More Data1

It only got the first record. It didn't seem able to reset the eof
status. I also tried the above client.tcl with fp set to stdin and
using some arbitrary eof character like "~". After it reached the
first "end of file", resetEOF only allowed reading what was left in the
input buffer.

This seems really close. Just need to completely reset the eof status,
so that it can keep reading.

Michael A. Cleverly

unread,

Jun 12, 2006, 1:42:54 AM6/12/06

to

I get:

Read #1 returned:
Record 1
Data 1
More Data1

Read #2 returned:
Record 2
Data 2
More Data2

Read #3 returned:
Record 3
Data 3
More Data3

I'm running 8.4 on OS X. What platform are you on?

Michael

SM Ryan

unread,

Jun 12, 2006, 3:03:21 AM6/12/06

to

"Eric Mahurin" <eric.m...@gmail.com> wrote:

# getn looks like about O(n) compared to about O(1) for gets (for these
# line sizes). Too slow.

Sometimes you can speed up append with something like

proc K varname {
upvar 1 $varname var
set result $var
set var ""
}
...
set string [append [K string] character]
...

--
SM Ryan http://www.rawbw.com/~wyrmwif/

I love the smell of commerce in the morning.

suchenwi

unread,

Jun 12, 2006, 3:20:50 AM6/12/06

to

SM Ryan schrieb:

> proc K varname {
> upvar 1 $varname var
> set result $var
> set var ""
> }

This name is misleading - K is of course the basic functional
combinator defined as
proc K {a b} {set a}
I'd rather call the above "destructive-read" or so :^)

Donal K. Fellows

unread,

Jun 12, 2006, 8:30:54 AM6/12/06

to

Eric Mahurin wrote:
> Could somebody tell me a fast way to get a sequence of characters from
> an I/O channel that ends in an arbitrary character?

[...]

> Specifically, I'm wanting to use the null
> character (from a pipe) in a protocol to delimt records.

While we don't support arbitrary record separator characters, you can
instead use the -eofchar option to get the same effect:

fconfigure $pipe -eofchar \u0000
while {1} {
# Read up to the next eof char, as configured
set record [read $pipe]
if {[string length $record]} {
# process $record here
} else {
break
}
# skip over the eof char
seek $pipe 1 current
}

Equivalently:

for {fconfigure $pipe -eofchar \u0000} {
[string length [set record [read $pipe]]]
} {seek $pipe 1 current} {
# process $record here
}

If you don't like using [seek] to skip the char, [fconfigure] the
channel to clear the -eofchar temporarily and [read] the char instead.

proc foreachRecord {var channel separator body} {
upvar 1 $var v
while {1} {
fconfigure $channel -eofchar $separator
set v [read $channel]
uplevel 1 $body
fconfigure $channel -eofchar {}
read $channel 1
if {[eof $channel]} {
break
}
}
}
foreachRecord record $pipe \u0000 {
# process $record here
}

I've no idea which option is fastest.

(If you can, the fastest option might be to load the whole contents of
the stream into memory and then [split] on the separator character, but
that might not work with the protocol you're using.)

Donal.

Eric Mahurin

unread,

Jun 12, 2006, 10:02:38 AM6/12/06

to

Michael A. Cleverly wrote:
> > This is exactly the kind of thing I'm looking for - but it doesn't seem
> > to work for me. I ran the above and got:
> >
> > Read #1 returned:
> > Record 1
> > Data 1
> > More Data1
>
> I get:
>
> Read #1 returned:
> Record 1
> Data 1
> More Data1
>
> Read #2 returned:
> Record 2
> Data 2
> More Data2
>
> Read #3 returned:
> Record 3
> Data 3
> More Data3
>
> I'm running 8.4 on OS X. What platform are you on?
>
> Michael

Here's what I'm on:

# uname -a
Linux localhost 2.6.8.1 #10 Tue Sep 21 12:10:29 CDT 2004 i686 Intel(R)
Pentium(R) M processor 1.70GHz unknown GNU/Linux
# rpm -q tcl
tcl-8.4.5-6mdk

I was able to get the above results if instead of using -eofchar {}, I
used -eofchar <any-non-null-char>. But, I still could use stdin (which
has big pauses in the stream) with a typable -eofchar. Also, if I add
a "after 1000" in the pipe.tcl loop, I'm back to one record again. It
is almost as though hitting an eofchar makes it go into non-blocking
mode. I tried adding -blocking 1 to the fconfigure commands, but it
didn't seem to help.

Does it work for you if you add some delay to your loop in pipe.tcl?

If I don't hear a better solution, I think I'll go with using \n to be
my record separator and \0 to mean newline within the record (my
original solution). I'll just use something like this : [string map
{\0 \n} [gets $fp]]. It sounds like anything else will be slower and
likely have compatibility issues.

Eric Mahurin

unread,

Jun 12, 2006, 10:05:47 AM6/12/06

to

Eric Mahurin wrote:
> But, I still could use stdin (which has big pauses in the stream) with a typable -eofchar.

Sorry, I meant "could not".

Donal K. Fellows

unread,

Jun 13, 2006, 6:44:15 AM6/13/06

to

Donal K. Fellows wrote:
> I've no idea which option is fastest.

But I should have thought a bit harder and noted that the [seek] version
doesn't work with pipes. D'oh! Use the other one.

Donal.

Eric Mahurin

unread,

Jun 13, 2006, 10:51:10 AM6/13/06

to

FYI, the protocol I was setting up was simply a way to eval/catch tcl
commands in a separate (possibly remote) process. The solution I went
with was to use CR(\r) to separate the commands (so that each command
could be a complex script with newlines). Here is the basic client
code:

fconfigure stdout -buffering none
fconfigure stdin -buffering none -translation cr
while {1} {
puts -nonewline stdout \0[catch [gets stdin] result]$result\0
}

I also made a slight optimization to the loop above assuming that the
majority of commands won't have exceptions. This gave a little speedup
with the same functionality:

while {1} {
puts -nonewline stdout \0[catch {
while {1} {
puts -nonewline stdout \0000[eval [gets stdin]]\0
}
} result]$result\0
}

When I initially implemented the server in TCL, I put an extra \n in
the stdout pipe to make it easier/faster. I now have the server in C++
(since it needs to talk with C++ code anyways) where there is no reason
for the extra character. Although I felt like my hands were tied with
TCL, I was surprised to see that it wasn't much slower implemented in
pure TCL.

BTW, if you are wondering how the client ever terminates, it dies when
the stdout pipe is closed on the other end and puts fails. I'm
allowing this instead of the having the client check stdin and
gracefully exit upon EOF.

Donal K. Fellows

unread,

Jun 13, 2006, 10:58:31 AM6/13/06

to

Eric Mahurin wrote:
> FYI, the protocol I was setting up was simply a way to eval/catch tcl
> commands in a separate (possibly remote) process. The solution I went
> with was to use CR(\r) to separate the commands (so that each command
> could be a complex script with newlines). Here is the basic client
> code:

I've experimented with these sorts of things, and I find I prefer to
use counted strings instead. By that, I mean that I send the number of
chars in the string as a fixed-width binary value, followed by the
chars themselves. This turns out to admit a very fast implementation in
multiple languages while still allowing arbitrary binary data in the
payloads, which can be useful.

> When I initially implemented the server in TCL, I put an extra \n in
> the stdout pipe to make it easier/faster. I now have the server in C++
> (since it needs to talk with C++ code anyways) where there is no reason
> for the extra character. Although I felt like my hands were tied with
> TCL, I was surprised to see that it wasn't much slower implemented in
> pure TCL.

The bottleneck is probably the pipe handling and context switching, and
not the data marshalling on either side. Given that, Tcl will hold its
own just fine against C++.

Donal.

Eric Mahurin

unread,

Jun 13, 2006, 12:36:08 PM6/13/06

to

Donal K. Fellows wrote:
> Eric Mahurin wrote:
> > FYI, the protocol I was setting up was simply a way to eval/catch tcl
> > commands in a separate (possibly remote) process. The solution I went
> > with was to use CR(\r) to separate the commands (so that each command
> > could be a complex script with newlines). Here is the basic client
> > code:
>
> I've experimented with these sorts of things, and I find I prefer to
> use counted strings instead. By that, I mean that I send the number of
> chars in the string as a fixed-width binary value, followed by the
> chars themselves. This turns out to admit a very fast implementation in
> multiple languages while still allowing arbitrary binary data in the
> payloads, which can be useful.

Yep, I was thinking of doing this initially, but decided not because
for the stdout coming from the command, I didn't know what its length
would be up front. I wanted to just let the command put what it wanted
on stdout and then terminate it. For the stdin pipe, what you suggest
would be perfectly reasonable. With a protocol of <length><eol><data>,
this would work

read stdin [gets stdin]

although it wouldn't be as space efficient as sending a fixed-width
binary value as you suggest (although probably faster in tcl).