Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to read the last 100 lines from a text file

1,090 views
Skip to first unread message

zhou...@gmail.com

unread,
Jun 22, 2008, 1:59:00 AM6/22/08
to
Hello,

I'd like to read the last 100 lines from a text file. I use the below
method:
1. Move the file descriptor (fd) to the end of file.
2. Move fd 2 byte backwards.
3. Read a byte.
4. If the char is '\n', increase counter. And restart from 1 until
counter == 100.

Are there any better methods to do the job?

I am coding in C++, Perl or tcl.

Thanks in advance!

dave.j...@googlemail.com

unread,
Jun 22, 2008, 4:41:25 AM6/22/08
to

If you are on a *nix type system, you may be much better off using
exec and tail -100 to do the job. This will cost you in terms of
portability.

If you are not, and you can use Tcl, try reading say a 16K buffer from
the end, and then using split and llength and reading zero or more
blocks located closer to the beginning. This will probably be faster
than working byte by byte.

Dave

Neil Madden

unread,
Jun 22, 2008, 9:49:15 AM6/22/08
to

How big is the file? If it fits in memory, then the simplest and
quickest way will be:

lrange [split [read $fd] \n] end-99 end

If it doesn't fit into memory, then you can at least read in much bigger
chunks than 1 byte and use:

incr count [regexp -all {\n} $buffer]

To count the number of newlines in each block. Reading 1 byte blocks
will be sloooooow.

-- Nei

dave.j...@googlemail.com

unread,
Jun 22, 2008, 1:19:24 PM6/22/08
to
On Jun 22, 2:49 pm, Neil Madden <n...@cs.nott.ac.uk> wrote:

Here is a proc that explains what Neil and I were talking about:
proc tail {f want} {
set fd [open $f]
set gotCount 0
set buffSize 4096

while {$gotCount < $want} {
seek $fd -$buffSize end
set buff [read $fd $buffSize]
set gotCount [regexp -all {\n} $buff]
if {$gotCount < $want } {
set factor [expr {1+int(ceil($want/
$gotCount))}]
set buffSize [expr {$buffSize * $factor}]
}
}
return [lrange [split $buff \n] end-[expr {$want-1}] end]
}
set output [tail access.log 100]
puts stdout [join $output "\n"]

Note: My feeling is that the file IO subsystem will buffer things
enough, that there is no point in using Tcl's append etc to make a big
buffer out of lots of small buffers. This is the reason I chose an
expanding buffer. Also, a good guess rather than choosing an 4K
guestimate may help, as will checks that the file exists, and that the
buffer does not explode (no EOLs in the file)!!

time ./xx.tcl >/dev/null
real 0m0.006s
user 0m0.008s
sys 0m0.000s

time tail -100 access.log >/dev/null
real 0m0.001s
user 0m0.000s
sys 0m0.004s

Dave

bill...@alum.mit.edu

unread,
Jun 22, 2008, 5:25:00 PM6/22/08
to
A slower approach that uses much less memory than slurping the entire
file at once and then splitting it is to read one line at a time and
keep the lines in a circular buffer. One way to implement a circular
buffer is to use an array indexed on the line number modulo the
desired number of lines.

proc LastKLines {k} {
set fh stdin
set LinesRead 0
set ModCnt 0
while {1} {
set CharsRead [gets $fh line]
set ModCnt [expr $LinesRead % $k]
if {$CharsRead >= 0} {
set Lines($ModCnt) $line
} else {
break
}
incr LinesRead
}
set LastIndex $ModCnt
set FirstIndex [expr ($ModCnt-1) % $k]
# Handle the case in which the file does not contain as many lines
# as desired.
if {$LinesRead > $k} {
set Limit $k
} else {
set Limit $LinesRead
}
for {set i $LastIndex} {$i < $Limit} {incr i} {
puts $Lines($i)
}
for {set i 0} {$i < $LastIndex} {incr i} {
puts $Lines($i)
}
}

#Test: LastKLines 100

bill...@alum.mit.edu

unread,
Jun 22, 2008, 5:27:22 PM6/22/08
to
The line initializing ModCnt and the line setting FirstIndex above are
superfluous.

Glenn Jackman

unread,
Jun 23, 2008, 12:02:05 PM6/23/08
to

I'd write that as:

package require struct::queue ;# from tcllib

proc lastNlines {chan n} {
# see http://wiki.tcl.tk/8555 for unique_name generator
set q [::struct::queue [::tcl::tmp::unique_name]]

while {[gets $chan line] != -1} {
$q put $line
if {[$q size] > $n} {
$q get
}

}

set lines [$q peek $n]
$q destroy
return $lines
}

Then:
set fid [open some.test.file r]
set lines [lastNlines $fid 10]
close $fid

or:
set lines [lastNlines stdin 10]

--
Glenn Jackman
Write a wise saying and your name will live forever. -- Anonymous

bill...@alum.mit.edu

unread,
Jun 23, 2008, 3:24:03 PM6/23/08
to
The documentation for struct::queue is completely incomprehensible if
you don't already know how a queue is supposed to be behave. It really
needs an initial paragraph explaining this.

tom.rmadilo

unread,
Jun 24, 2008, 7:46:33 PM6/24/08
to

Here is a command line version of this, also using a list instead of
an array (but note that these examples add a newline to the output):

#!/usr/bin/env tclsh

global argv

set lines [lindex $argv 0]
set lineList [list]

set ifd stdin
set ofd stdout

set i 0

while {![eof $ifd]} {

if {$i >= $lines} {
lset lineList [expr {$i % $lines}] [gets $ifd]
} else {
lappend lineList [gets $ifd]
}

incr i
}

set begin [expr {($i % $lines)} ]
set end [expr {($i % $lines) -1}]

foreach line [lrange $lineList $begin end] {

puts $ofd $line
}

foreach line [lrange $lineList 0 $end] {

puts $ofd $line
}

dave.j...@googlemail.com

unread,
Jun 25, 2008, 4:40:10 AM6/25/08
to
I am concerned about the way this thread is turning out.

Yes, there are methods that isolate the last X lines of a file by
starting at the front of the file, but that does not address the
original poster's question. What is the fastest and most robust way of
doing it when you start at the end of the file ?

Logic and my speed tests show that:
a) Nothing is as fast as tail under *nix
b) Starting at the front on a line by line basis is always slower, and
can be made even slower by throwing a bigger file at it.

Dave

Larry W. Virden

unread,
Jun 25, 2008, 5:57:22 AM6/25/08
to
On Jun 22, 1:59 am, zhouv...@gmail.com wrote:
> Hello,
>
> I'd like to read the last 100 lines from a text file. I use the below
> method:
> 1. Move the file descriptor (fd) to the end of file.
> 2. Move fd 2 byte backwards.
> 3. Read a byte.
> 4. If the char is '\n', increase counter. And restart from 1 until
> counter == 100.
>
> Are there any better methods to do the job?

On what operating system will the program run? On most unix systems,
there isn't an operation to do this sort of thing automatically, and
unless you can find a specialty file system or set of functions, there
isn't an easy way to do this sort of thing.

When you say "better" , are you thinking "algorithmically more
correct", "faster", or something else?

What you describe is basically the way to do literally what you have
asked for. To improve speed, you could try moving and reading in
blocks of some size.

tom.rmadilo

unread,
Jun 25, 2008, 3:16:14 PM6/25/08
to
On Jun 25, 1:40 am, "dave.joub...@googlemail.com"

But this is not a real programming task, it is an academic exercise,
at least I hope so ;)

If you look at my example, it takes stdin, which could be a pipe:

$ cat myfile | ./lastlines 25 > last25

In this case myfile exists and the tcl shell script 'lastlines' reads
it and outputs the last 25, once EOF is reached.

$ ./lastlines 25 > last25

In this case lastlines loops over stdin reading each line that I type.
You can't start at the end, since the end doesn't exist. Hit Ctrl-D
and the program reaches EOF and prints at most 25 lines.

Of course we haven't talked about what is a line, what constitutes a
line break, if we should use the same line break to recreate the last
number of lines, etc. If you try the unix command 'tail', this is what
it does:

$ tail - > output

Type over ten lines in, you get the last ten into output, after the
ctrl-D.

sleb...@gmail.com

unread,
Jun 25, 2008, 3:45:27 PM6/25/08
to
On Jun 26, 3:16 am, "tom.rmadilo" <tom.rmad...@gmail.com> wrote:
> On Jun 25, 1:40 am, "dave.joub...@googlemail.com"
>
> <dave.joub...@googlemail.com> wrote:
> > I am concerned about the way this thread is turning out.
>
> > Yes, there are methods that isolate the last X lines of a file by
> > starting at the front of the file, but that does not address the
> > original poster's question. What is the fastest and most robust way of
> > doing it when you start at the end of the file ?
>
> > Logic and my speed tests show that:
> > a) Nothing is as fast as tail under *nix
> > b) Starting at the front on a line by line basis is always slower, and
> > can be made even slower by throwing a bigger file at it.
>
> But this is not a real programming task, it is an academic exercise,
> at least I hope so ;)
>
> If you look at my example, it takes stdin, which could be a pipe:
>
> $ cat myfile | ./lastlines 25 > last25
>
> In this case myfile exists and the tcl shell script 'lastlines' reads
> it and outputs the last 25, once EOF is reached.
>
> $ ./lastlines 25 > last25
>
> In this case lastlines loops over stdin reading each line that I type.
> You can't start at the end, since the end doesn't exist. Hit Ctrl-D
> and the program reaches EOF and prints at most 25 lines.
>

If you're taking this as an acedemic excercise then it's better to
have a clear definition of the problem you're trying to solve and. The
challange here is to speed up reading the last N lines from a text
(presumably ASCII, utf-8 anyone?) file. The key requirements are:
1. it's definitely a file.
2. we're optimising for speed, not generality.
so for this particular acedemic excercise you can make the assumption
that you may start at the end of file.

> $ tail - > output
>
> Type over ten lines in, you get the last ten into output,
> after the ctrl-D.

Actually most implementations of tail seeks from the end of file if
possible and falls back to line-at-a-time strategy only when it can't
be done (I suppose it switches mode on what device stdin points to,
kind of like how 'ls' switches mode on what device stdout points to).

Neil Madden

unread,
Jun 26, 2008, 9:06:47 AM6/26/08
to
sleb...@yahoo.com wrote:
[...]

> 1. it's definitely a file.
> 2. we're optimising for speed, not generality.

If the goal is to optimise for speed (and not memory usage), then the
clear winner must be the simplest approach:

proc tail {file n} {
set in [open $file]
set tl [lrange [split [read $in] \n] end-99 end]
close $in
return $tl
}

If your file's too big, then buy more RAM (we are optimising for speed,
after all). :-)

-- Neil

Glenn Jackman

unread,
Jun 26, 2008, 10:00:16 AM6/26/08
to
At 2008-06-26 09:06AM, "Neil Madden" wrote:
> proc tail {file n} {
^

> set in [open $file]
> set tl [lrange [split [read $in] \n] end-99 end]

Of course, that should be:
set tl [lrange [split [read $in] \n] end-$n end]
^^

> close $in
> return $tl

Alexandre Ferrieux

unread,
Jun 26, 2008, 10:06:39 AM6/26/08
to
On Jun 26, 3:06 pm, Neil Madden <n...@cs.nott.ac.uk> wrote:

Yes but if the file's really big *and* fits into RAM, the chunked
approach proposed by Dave and you will most likely be faster than the
jumbo [split] which will uselessly scan the whole file (and more than
double the memory usage for individual mallocs of all these strings,
which affects the threshold for "fits into RAM"). And a reasonable
hypothesis on line length may do it in one chunk most of the time :-)

-Alex

Neil Madden

unread,
Jun 26, 2008, 10:41:25 AM6/26/08
to
Alexandre Ferrieux wrote:
> On Jun 26, 3:06 pm, Neil Madden <n...@cs.nott.ac.uk> wrote:
>> slebet...@yahoo.com wrote:
>>
>> [...]
>>
>>> 1. it's definitely a file.
>>> 2. we're optimising for speed, not generality.
>> If the goal is to optimise for speed (and not memory usage), then the
>> clear winner must be the simplest approach:
>>
>> proc tail {file n} {
>> set in [open $file]
>> set tl [lrange [split [read $in] \n] end-99 end]
>> close $in
>> return $tl
>>
>> }

As Glenn pointed out, this should be [lrange ... end-$n end]. Also
better to use [read $in [file size $file]].

>>
>> If your file's too big, then buy more RAM (we are optimising for speed,
>> after all). :-)
>
> Yes but if the file's really big *and* fits into RAM, the chunked
> approach proposed by Dave and you will most likely be faster than the
> jumbo [split] which will uselessly scan the whole file (and more than
> double the memory usage for individual mallocs of all these strings,
> which affects the threshold for "fits into RAM"). And a reasonable
> hypothesis on line length may do it in one chunk most of the time :-)

Eventually, yes. I suspect [split] will be faster in most cases, and by
the time it starts getting slow you really shouldn't be using a text
file for your data! (i.e. it's time to start thinking about SQLite).

-- Neil

dave.j...@googlemail.com

unread,
Jun 26, 2008, 1:27:57 PM6/26/08
to

dave.j...@googlemail.com

unread,
Jun 26, 2008, 1:48:29 PM6/26/08
to
On Jun 26, 3:41 pm, Neil Madden <n...@cs.nott.ac.uk> wrote:
> Alexandre Ferrieux wrote:
> > On Jun 26, 3:06 pm, Neil Madden <n...@cs.nott.ac.uk> wrote:
> >> slebet...@yahoo.com wrote:
>
> >> [...]
>
> >>> 1. it's definitely a file.
> >>> 2. we're optimising for speed, not generality.
> >> If the goal is to optimise for speed (and not memory usage), then the
> >> clear winner must be the simplest approach:
>
> >> proc tail {file n} {
> >> set in [open $file]
> >> set tl [lrange [split [read $in] \n] end-99 end]
> >> close $in
> >> return $tl
>
> >> }
>
> As Glenn pointed out, this should be [lrange ... end-$n end]. Also
> better to use [read $in [file size $file]].
>

I think it should be end-($n-1) end

My speed test show that the 'read whole file and split' is slower (as
Alex thought it might be). The test file is not even particularly
big...
1201284 2008-06-26 17:49 access.log

cat zz.tcl
#!/usr/bin/tclsh

proc tail {file n} {

set in [open $file]
set tl [lrange [split [read $in] \n] end-[expr {$n-1}] end]

close $in
return $tl
}


set output [tail access.log 100]
puts stdout [join $output "\n"]

nexus:/home/httpd/logs# time ./zz.tcl >/dev/null

real 0m0.022s
user 0m0.016s
sys 0m0.004s


Read from end....
nexus:/home/httpd/logs# time ./xx.tcl >/dev/null

real 0m0.005s
user 0m0.004s
sys 0m0.000s


nexus:/home/httpd/logs# time tail -100 access.log >/dev/null

real 0m0.001s
user 0m0.000s
sys 0m0.000s

Dave

dave.j...@googlemail.com

unread,
Jun 26, 2008, 1:52:34 PM6/26/08
to
On Jun 26, 6:48 pm, "dave.joub...@googlemail.com"
<dave.joub...@googlemail.com> wrote:
.....
and
set tl [lrange [split [read $in [file size $file]] \n] end-[expr
{$n-1}] end]
makes no speed difference that I can see.

Dave

Neil Madden

unread,
Jun 26, 2008, 2:49:18 PM6/26/08
to
dave.j...@googlemail.com wrote:
> On Jun 26, 3:41 pm, Neil Madden <n...@cs.nott.ac.uk> wrote:
>> Alexandre Ferrieux wrote:
>>> On Jun 26, 3:06 pm, Neil Madden <n...@cs.nott.ac.uk> wrote:
>>>> slebet...@yahoo.com wrote:
>>>> [...]
>>>>> 1. it's definitely a file.
>>>>> 2. we're optimising for speed, not generality.
>>>> If the goal is to optimise for speed (and not memory usage), then the
>>>> clear winner must be the simplest approach:
>>>> proc tail {file n} {
>>>> set in [open $file]
>>>> set tl [lrange [split [read $in] \n] end-99 end]
>>>> close $in
>>>> return $tl
>>>> }
>> As Glenn pointed out, this should be [lrange ... end-$n end]. Also
>> better to use [read $in [file size $file]].
>>
>
> I think it should be end-($n-1) end
>
> My speed test show that the 'read whole file and split' is slower (as
> Alex thought it might be). The test file is not even particularly
> big...
> 1201284 2008-06-26 17:49 access.log

You're right. I stand corrected. Timing over 10 runs on a similarly
sized file:

tail/split: 29010.383299999998 microseconds per iteration
tail/chunk: 987.8046 microseconds per iteration

So, [split] is quite a lot slower.

-- Neil

Kaitzschu

unread,
Jun 30, 2008, 7:35:29 AM6/30/08
to
On Thu, 26 Jun 2008, Glenn Jackman wrote:

> At 2008-06-26 09:06AM, "Neil Madden" wrote:
>> proc tail {file n} {
> ^
>
>> set in [open $file]
>> set tl [lrange [split [read $in] \n] end-99 end]
>
> Of course, that should be:
> set tl [lrange [split [read $in] \n] end-$n end]
> ^^

Actually, that should be:
set tl [lrange [split [read $in] \n end-[incr n -1] end]
^^^^^^^^^^^

>> close $in
>> return $tl
>> }

--
-Kaitzschu
set s TCL\ ;wh 1 {pu [set s [st ra $s 1 3][st in $s 0]]\033\[1A;af 99}

"Good thing programmers don't build bridges
[the same way as Kaitzschu writes code]."
--John Kelly in comp.lang.tcl

dave.j...@googlemail.com

unread,
Jul 1, 2008, 4:42:16 AM7/1/08
to
On Jun 30, 12:35 pm, Kaitzschu

I won't use shorthand in comments next time! (If you look at the code
itself you will see what I actually wrote (both in the original code
and the alternate code).)

As an aside, I found myself shocked at the thought that one could use
incr that way! Funny how one gets in a rut. On the other hand, some
poor soul could blindly copy and paste the code inline instead of
keeping it in a proc, and then wonder why his code fails further on.

Dave

Kaitzschu

unread,
Jul 1, 2008, 7:44:48 AM7/1/08
to
On Tue, 1 Jul 2008, dave.j...@googlemail.com wrote:

> On Jun 30, 12:35 pm, Kaitzschu

>> set tl [lrange [split [read $in] \n end-[incr n -1] end]
>

> As an aside, I found myself shocked at the thought that one could use
> incr that way! Funny how one gets in a rut. On the other hand, some poor
> soul could blindly copy and paste the code inline instead of keeping it
> in a proc, and then wonder why his code fails further on.

Or worse, copy sequentially these lines in same scope. Ohh, have I felt
that pain!

There has been some tangential discussion using "inline return values" on
tclcore mailing list regarding [chan pipe], but I guess this doesn't
really apply since using a "mutator" command this way is most likely going
to result in an error, as you point out with your scenario.

But I like it! Even if it is going to put a nail to my foot sometimes :)

sleb...@gmail.com

unread,
Jul 1, 2008, 9:25:52 PM7/1/08
to
On Jul 1, 7:44 pm, Kaitzschu
<kaitzs...@kaitzschu.cjb.net.nospam.plz.invalid> wrote:

> On Tue, 1 Jul 2008, dave.joub...@googlemail.com wrote:
> > On Jun 30, 12:35 pm, Kaitzschu
> >> set tl [lrange [split [read $in] \n end-[incr n -1] end]
>
> > As an aside, I found myself shocked at the thought that one could use
> > incr that way!
>
> <snip>

>
> There has been some tangential discussion using "inline return values" on
> tclcore mailing list regarding [chan pipe], but I guess this doesn't
> really apply since using a "mutator" command this way is most likely going
> to result in an error, as you point out with your scenario.
>
> But I like it! Even if it is going to put a nail to my foot sometimes :)
>

Ah yes, for "casual" programming tasks I also like (ab)using return
values of commands. I especially like that [set] always returns a
value. My favourite (ab)use is the "slurp" one-liner:

set data [read [set f [open $filename]]][close $f]

Of course, in production code it's not a good idea to do this.

Kaitzschu

unread,
Jul 2, 2008, 7:43:55 AM7/2/08
to

That is still very readable and only thing that can confuse reader is if
(s)he doesn't know that [close] returns an empty string (which is phased
out). But once we have a -unique -sorted $list that needs to be added an
$item while ${exclude}ing another and still kept -unique -sorted in case
$item already was there, then we have one ugly one-liner:

[code]
set list [lreplace [set list [lsort -unique [lappend list $item]]] [set ind [lsearch -sorted $list $exclude]] $ind]
[/code]

..which mutates list once, twice, three times (plus $ind is calculated in
the mist). And does different things to each cycle (which, obviously, is
needed, since otherwise there is no need for mutating the value, return
value could be used directly; true, list is [lappend]ed for nothing, but
[linsert] is uglier).

[code]
lappend list $item
set list [lsort -unique $list]
set ind [lsearch -sorted $list $exclude]
set list [lreplace $list $ind $ind]
# lreplace will bail out when $ind isn't in range (-1 = not found)
# woohoo, 4-fold increase in LOC! plus vague documentation!
[/code]

> Of course, in production code it's not a good idea to do this.

So _that_ must be the reason I'm not being considered a productive member
of society :P

0 new messages