I'd like to read the last 100 lines from a text file. I use the below
method:
1. Move the file descriptor (fd) to the end of file.
2. Move fd 2 byte backwards.
3. Read a byte.
4. If the char is '\n', increase counter. And restart from 1 until
counter == 100.
Are there any better methods to do the job?
I am coding in C++, Perl or tcl.
Thanks in advance!
If you are on a *nix type system, you may be much better off using
exec and tail -100 to do the job. This will cost you in terms of
portability.
If you are not, and you can use Tcl, try reading say a 16K buffer from
the end, and then using split and llength and reading zero or more
blocks located closer to the beginning. This will probably be faster
than working byte by byte.
Dave
How big is the file? If it fits in memory, then the simplest and
quickest way will be:
lrange [split [read $fd] \n] end-99 end
If it doesn't fit into memory, then you can at least read in much bigger
chunks than 1 byte and use:
incr count [regexp -all {\n} $buffer]
To count the number of newlines in each block. Reading 1 byte blocks
will be sloooooow.
-- Nei
Here is a proc that explains what Neil and I were talking about:
proc tail {f want} {
set fd [open $f]
set gotCount 0
set buffSize 4096
while {$gotCount < $want} {
seek $fd -$buffSize end
set buff [read $fd $buffSize]
set gotCount [regexp -all {\n} $buff]
if {$gotCount < $want } {
set factor [expr {1+int(ceil($want/
$gotCount))}]
set buffSize [expr {$buffSize * $factor}]
}
}
return [lrange [split $buff \n] end-[expr {$want-1}] end]
}
set output [tail access.log 100]
puts stdout [join $output "\n"]
Note: My feeling is that the file IO subsystem will buffer things
enough, that there is no point in using Tcl's append etc to make a big
buffer out of lots of small buffers. This is the reason I chose an
expanding buffer. Also, a good guess rather than choosing an 4K
guestimate may help, as will checks that the file exists, and that the
buffer does not explode (no EOLs in the file)!!
time ./xx.tcl >/dev/null
real 0m0.006s
user 0m0.008s
sys 0m0.000s
time tail -100 access.log >/dev/null
real 0m0.001s
user 0m0.000s
sys 0m0.004s
Dave
proc LastKLines {k} {
set fh stdin
set LinesRead 0
set ModCnt 0
while {1} {
set CharsRead [gets $fh line]
set ModCnt [expr $LinesRead % $k]
if {$CharsRead >= 0} {
set Lines($ModCnt) $line
} else {
break
}
incr LinesRead
}
set LastIndex $ModCnt
set FirstIndex [expr ($ModCnt-1) % $k]
# Handle the case in which the file does not contain as many lines
# as desired.
if {$LinesRead > $k} {
set Limit $k
} else {
set Limit $LinesRead
}
for {set i $LastIndex} {$i < $Limit} {incr i} {
puts $Lines($i)
}
for {set i 0} {$i < $LastIndex} {incr i} {
puts $Lines($i)
}
}
#Test: LastKLines 100
I'd write that as:
package require struct::queue ;# from tcllib
proc lastNlines {chan n} {
# see http://wiki.tcl.tk/8555 for unique_name generator
set q [::struct::queue [::tcl::tmp::unique_name]]
while {[gets $chan line] != -1} {
$q put $line
if {[$q size] > $n} {
$q get
}
}
set lines [$q peek $n]
$q destroy
return $lines
}
Then:
set fid [open some.test.file r]
set lines [lastNlines $fid 10]
close $fid
or:
set lines [lastNlines stdin 10]
--
Glenn Jackman
Write a wise saying and your name will live forever. -- Anonymous
Here is a command line version of this, also using a list instead of
an array (but note that these examples add a newline to the output):
#!/usr/bin/env tclsh
global argv
set lines [lindex $argv 0]
set lineList [list]
set ifd stdin
set ofd stdout
set i 0
while {![eof $ifd]} {
if {$i >= $lines} {
lset lineList [expr {$i % $lines}] [gets $ifd]
} else {
lappend lineList [gets $ifd]
}
incr i
}
set begin [expr {($i % $lines)} ]
set end [expr {($i % $lines) -1}]
foreach line [lrange $lineList $begin end] {
puts $ofd $line
}
foreach line [lrange $lineList 0 $end] {
puts $ofd $line
}
Yes, there are methods that isolate the last X lines of a file by
starting at the front of the file, but that does not address the
original poster's question. What is the fastest and most robust way of
doing it when you start at the end of the file ?
Logic and my speed tests show that:
a) Nothing is as fast as tail under *nix
b) Starting at the front on a line by line basis is always slower, and
can be made even slower by throwing a bigger file at it.
Dave
On what operating system will the program run? On most unix systems,
there isn't an operation to do this sort of thing automatically, and
unless you can find a specialty file system or set of functions, there
isn't an easy way to do this sort of thing.
When you say "better" , are you thinking "algorithmically more
correct", "faster", or something else?
What you describe is basically the way to do literally what you have
asked for. To improve speed, you could try moving and reading in
blocks of some size.
But this is not a real programming task, it is an academic exercise,
at least I hope so ;)
If you look at my example, it takes stdin, which could be a pipe:
$ cat myfile | ./lastlines 25 > last25
In this case myfile exists and the tcl shell script 'lastlines' reads
it and outputs the last 25, once EOF is reached.
$ ./lastlines 25 > last25
In this case lastlines loops over stdin reading each line that I type.
You can't start at the end, since the end doesn't exist. Hit Ctrl-D
and the program reaches EOF and prints at most 25 lines.
Of course we haven't talked about what is a line, what constitutes a
line break, if we should use the same line break to recreate the last
number of lines, etc. If you try the unix command 'tail', this is what
it does:
$ tail - > output
Type over ten lines in, you get the last ten into output, after the
ctrl-D.
If you're taking this as an acedemic excercise then it's better to
have a clear definition of the problem you're trying to solve and. The
challange here is to speed up reading the last N lines from a text
(presumably ASCII, utf-8 anyone?) file. The key requirements are:
1. it's definitely a file.
2. we're optimising for speed, not generality.
so for this particular acedemic excercise you can make the assumption
that you may start at the end of file.
> $ tail - > output
>
> Type over ten lines in, you get the last ten into output,
> after the ctrl-D.
Actually most implementations of tail seeks from the end of file if
possible and falls back to line-at-a-time strategy only when it can't
be done (I suppose it switches mode on what device stdin points to,
kind of like how 'ls' switches mode on what device stdout points to).
If the goal is to optimise for speed (and not memory usage), then the
clear winner must be the simplest approach:
proc tail {file n} {
set in [open $file]
set tl [lrange [split [read $in] \n] end-99 end]
close $in
return $tl
}
If your file's too big, then buy more RAM (we are optimising for speed,
after all). :-)
-- Neil
> set in [open $file]
> set tl [lrange [split [read $in] \n] end-99 end]
Of course, that should be:
set tl [lrange [split [read $in] \n] end-$n end]
^^
> close $in
> return $tl
Yes but if the file's really big *and* fits into RAM, the chunked
approach proposed by Dave and you will most likely be faster than the
jumbo [split] which will uselessly scan the whole file (and more than
double the memory usage for individual mallocs of all these strings,
which affects the threshold for "fits into RAM"). And a reasonable
hypothesis on line length may do it in one chunk most of the time :-)
-Alex
As Glenn pointed out, this should be [lrange ... end-$n end]. Also
better to use [read $in [file size $file]].
>>
>> If your file's too big, then buy more RAM (we are optimising for speed,
>> after all). :-)
>
> Yes but if the file's really big *and* fits into RAM, the chunked
> approach proposed by Dave and you will most likely be faster than the
> jumbo [split] which will uselessly scan the whole file (and more than
> double the memory usage for individual mallocs of all these strings,
> which affects the threshold for "fits into RAM"). And a reasonable
> hypothesis on line length may do it in one chunk most of the time :-)
Eventually, yes. I suspect [split] will be faster in most cases, and by
the time it starts getting slow you really shouldn't be using a text
file for your data! (i.e. it's time to start thinking about SQLite).
-- Neil
I think it should be end-($n-1) end
My speed test show that the 'read whole file and split' is slower (as
Alex thought it might be). The test file is not even particularly
big...
1201284 2008-06-26 17:49 access.log
cat zz.tcl
#!/usr/bin/tclsh
proc tail {file n} {
set in [open $file]
set tl [lrange [split [read $in] \n] end-[expr {$n-1}] end]
close $in
return $tl
}
set output [tail access.log 100]
puts stdout [join $output "\n"]
nexus:/home/httpd/logs# time ./zz.tcl >/dev/null
real 0m0.022s
user 0m0.016s
sys 0m0.004s
Read from end....
nexus:/home/httpd/logs# time ./xx.tcl >/dev/null
real 0m0.005s
user 0m0.004s
sys 0m0.000s
nexus:/home/httpd/logs# time tail -100 access.log >/dev/null
real 0m0.001s
user 0m0.000s
sys 0m0.000s
Dave
Dave
You're right. I stand corrected. Timing over 10 runs on a similarly
sized file:
tail/split: 29010.383299999998 microseconds per iteration
tail/chunk: 987.8046 microseconds per iteration
So, [split] is quite a lot slower.
-- Neil
> At 2008-06-26 09:06AM, "Neil Madden" wrote:
>> proc tail {file n} {
> ^
>
>> set in [open $file]
>> set tl [lrange [split [read $in] \n] end-99 end]
>
> Of course, that should be:
> set tl [lrange [split [read $in] \n] end-$n end]
> ^^
Actually, that should be:
set tl [lrange [split [read $in] \n end-[incr n -1] end]
^^^^^^^^^^^
>> close $in
>> return $tl
>> }
--
-Kaitzschu
set s TCL\ ;wh 1 {pu [set s [st ra $s 1 3][st in $s 0]]\033\[1A;af 99}
"Good thing programmers don't build bridges
[the same way as Kaitzschu writes code]."
--John Kelly in comp.lang.tcl
I won't use shorthand in comments next time! (If you look at the code
itself you will see what I actually wrote (both in the original code
and the alternate code).)
As an aside, I found myself shocked at the thought that one could use
incr that way! Funny how one gets in a rut. On the other hand, some
poor soul could blindly copy and paste the code inline instead of
keeping it in a proc, and then wonder why his code fails further on.
Dave
> On Jun 30, 12:35 pm, Kaitzschu
>> set tl [lrange [split [read $in] \n end-[incr n -1] end]
>
> As an aside, I found myself shocked at the thought that one could use
> incr that way! Funny how one gets in a rut. On the other hand, some poor
> soul could blindly copy and paste the code inline instead of keeping it
> in a proc, and then wonder why his code fails further on.
Or worse, copy sequentially these lines in same scope. Ohh, have I felt
that pain!
There has been some tangential discussion using "inline return values" on
tclcore mailing list regarding [chan pipe], but I guess this doesn't
really apply since using a "mutator" command this way is most likely going
to result in an error, as you point out with your scenario.
But I like it! Even if it is going to put a nail to my foot sometimes :)
Ah yes, for "casual" programming tasks I also like (ab)using return
values of commands. I especially like that [set] always returns a
value. My favourite (ab)use is the "slurp" one-liner:
set data [read [set f [open $filename]]][close $f]
Of course, in production code it's not a good idea to do this.
That is still very readable and only thing that can confuse reader is if
(s)he doesn't know that [close] returns an empty string (which is phased
out). But once we have a -unique -sorted $list that needs to be added an
$item while ${exclude}ing another and still kept -unique -sorted in case
$item already was there, then we have one ugly one-liner:
[code]
set list [lreplace [set list [lsort -unique [lappend list $item]]] [set ind [lsearch -sorted $list $exclude]] $ind]
[/code]
..which mutates list once, twice, three times (plus $ind is calculated in
the mist). And does different things to each cycle (which, obviously, is
needed, since otherwise there is no need for mutating the value, return
value could be used directly; true, list is [lappend]ed for nothing, but
[linsert] is uglier).
[code]
lappend list $item
set list [lsort -unique $list]
set ind [lsearch -sorted $list $exclude]
set list [lreplace $list $ind $ind]
# lreplace will bail out when $ind isn't in range (-1 = not found)
# woohoo, 4-fold increase in LOC! plus vague documentation!
[/code]
> Of course, in production code it's not a good idea to do this.
So _that_ must be the reason I'm not being considered a productive member
of society :P