efficient 'tail' implementation

s999999...@yahoo.com

未讀,

2005年12月8日凌晨1:31:102005/12/8

收件者：

hi

I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks

bon...@gmail.com

未讀,

2005年12月8日凌晨1:43:002005/12/8

收件者：

I don't think this is a python specific issue but a generic problem for
all "file as byte stream" system. The problem is, "line" is not a
property of the file, but its content(some big iron system use
"records" for lines and can be addressed with O(1))

So the simplest is just read and drop until the one you want.

for x in f:
if x_is_what_I_want: something

If you really want, you can do the reverse lookup like this :

f.seek(0,EOF)
x = f.tell()

then loop byte by byte backward till you find you stuff. The is quite
cumbersome and may not be faster, depending on your content.

Mike Meyer

未讀,

2005年12月8日凌晨2:09:582005/12/8

收件者：

Well, 200mb isn't all that big these days. But it's easy to code:

# untested code
input = open(filename)
tail = input.readlines()[:tailcount]
input.close()

and you're done. However, it will go through a lot of memory. Fastest
is probably working through it backwards, but that may take multiple
tries to get everything you want:

# untested code
input = open(filename)
blocksize = tailcount * expected_line_length
tail = []
while len(tail) < tailcount:
input.seek(-blocksize, EOF)
tail = input.read().split('\n')
blocksize *= 2
input.close()
tail = tail[:tailcount]

It would probably be more efficient to read blocks backwards and paste
them together, but I'm not going to get into that.

<mike
--
Mike Meyer <m...@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

bon...@gmail.com

未讀,

2005年12月8日凌晨2:29:342005/12/8

收件者：

Mike Meyer wrote:
> It would probably be more efficient to read blocks backwards and paste
> them together, but I'm not going to get into that.
>

That actually is a pretty good idea. just reverse the buffer and do a
split, the last line becomes the first line and so on. The logic then
would be no different than reading from beginning of file. Just need to
keep the last "half line" of the reversed buffer if the wanted one
happens to be across buffer boundary.

Gerald Klix

未讀,

2005年12月8日凌晨2:44:522005/12/8

收件者：pytho...@python.org

As long as memory mapped files are available, the fastest
method is to map the whole file into memory and use the
mappings rfind method to search for an end of line.

The following code snippets may be usefull:
reportFile = open( filename )
length = os.fstat( reportFile.fileno() ).st_size
if length == 0:
# Don't map zero length files, windows will barf
continue
try:
mapping = mmap.mmap( reportFile.fileno(), length,
mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
mapping = mmap.mmap(
reportFile.fileno(),
0, None,
mmap.ACCESS_READ )

Then you can use
mapping.rfind( os.linesep )
to find the end of the but last line and so on.

This is very fast, because nearly all work is done by are rfind, which
is implemented in C and the OS' paging logic.

HTH,
Gerald

bon...@gmail.com schrieb:

Jerry Sievers

未讀,

2005年12月8日上午8:35:502005/12/8

收件者：

s999999...@yahoo.com writes:

First of all, what makes you think that tail on your system isn't
already optinized for this? Devil's advocate here... I have no clue
really.

Anyway, if you must roll your own;

Determine some reasonable max line size,, multiply this by a few
larger than the numbers of lines that you want to tail, seek to EOF
and then back to the position in the file where this chunk would
start, get and split that chunk into lines and now output the last 3
or however many you need.

If the read fails to hold enough lines, seek back a bit further and do
same but you'll have to be prepared to concat the second and Nth last
chunks together.

Have fun

--
-------------------------------------------------------------------------------
Jerry Sievers 305 854-3001 (home) WWW ECommerce Consultant
305 321-1144 (mobile http://www.JerrySievers.com/

Nick Craig-Wood

未讀,

2005年12月8日上午9:30:042005/12/8

收件者：

Gerald Klix <Geral...@klix.ch> wrote:
> As long as memory mapped files are available, the fastest
> method is to map the whole file into memory and use the
> mappings rfind method to search for an end of line.

Excellent idea.

It'll blow up for large >2GB files on a 32bit OS though.
--
Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick

Nick Craig-Wood

未讀,

2005年12月8日上午10:30:042005/12/8

收件者：

Gerald Klix <Geral...@klix.ch> wrote:
> As long as memory mapped files are available, the fastest
> method is to map the whole file into memory and use the
> mappings rfind method to search for an end of line.

Actually mmap doesn't appear to have an rfind method :-(

Here is a tested solution using mmap using your code. Inefficient if
number of lines to be tailed is too big.

import os
import sys
import mmap

def main(nlines, filename):

reportFile = open( filename )
length = os.fstat( reportFile.fileno() ).st_size
if length == 0:
# Don't map zero length files, windows will barf

return

try:
mapping = mmap.mmap( reportFile.fileno(), length,
mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
mapping = mmap.mmap(
reportFile.fileno(),
0, None,
mmap.ACCESS_READ )

search = 1024
lines = []
while 1:
if search > length:
search = length
tail = mapping[length-search:]
lines = tail.split(os.linesep)
if len(lines) >= nlines or search == length:
break
search *= 2
lines = lines[-nlines-1:]
print "\n".join(lines)

if __name__ == "__main__":
if len(sys.argv) != 3:
print "Syntax: %s n file" % sys.argv[0]
else:
main(int(sys.argv[1]), sys.argv[2])

Neal Becker

未讀,

2005年12月8日下午1:34:382005/12/8

收件者：pytho...@python.org

s999999...@yahoo.com wrote:

You should look at pyinotify. I assume we're talking linux here.

Bengt Richter

未讀,

2005年12月9日清晨5:53:302005/12/9

收件者：

Ok, I'll have a go (only tested slightly ;-)

>>> def frsplit(fname, nitems=10, splitter='\n', chunk=8192):
... f = open(fname, 'rb')
... f.seek(0, 2)
... bufpos = f.tell() # pos from file beg == file length
... buf = ['']
... for nl in xrange(nitems):
... while len(buf)<2:
... chunk = min(chunk, bufpos)
... bufpos = bufpos-chunk
... f.seek(bufpos)
... buf = (f.read(chunk)+buf[0]).split(splitter)
... if buf== ['']: break
... if bufpos==0: break
... if len(buf)>1: yield buf.pop(); continue
... if bufpos==0: yield buf.pop(); break
...

20 lines from the tail of november's python-dev archive

>>> print '\n'.join(reversed(list(frsplit(r'v:\temp\clp\2005-November.txt', 20))))
lives in the mimelib project's hidden CVS on SF, but that seems pretty
silly.

Basically I'm just going to add the test script, setup.py, generated
html docs and a few additional unit tests, along with svn:external refs
to pull in Lib/email from the appropriate Python svn tree. This way,
I'll be able to create standalone email packages from the sandbox (which
I need to do because I plan on fixing a few outstanding email bugs).

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 307 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail/python-dev/attachments/20051130/e88db51d/attachment.pgp

Might want to throw away the first item returned by frsplit, unless it is !='' (indicating a
last line with no \n). Splitting with os.linesep is a problematical default, since e.g. it
wouldn't work with the above archive, since it has unix endings, and I didn't download it
in a manner that would convert it.

Regards,
Bengt Richter

Magnus Lycka

未讀,

2005年12月9日上午8:03:552005/12/9

收件者：

To read the last x bytes of a file, you could do:

>>> import os
>>> x = 2000 # or whatever...
>>> f=open('my_big_file')
>>> l=os.fstat(f.fileno()).st_size
>>> f.seek(l-x)
>>> f.read()

Maybe that's a start. I didn't try it on a anything bigger than 16MB,
but it was more or less instantaneous for 16Megs.

If you want the last X lines and know that lines are no more than N
chars, f.seek(l-X*N); f.readlines()[-X:] should give you what you
need... (I think...I didn't test it.)

Colin J. Williams

未讀,

2005年12月9日上午10:12:432005/12/9

收件者：

If your file is seekable you could consider using file's seek function.

This would appear to give you a way of just working with the last, say 1mb.

Colin W.

Marius Gedminas

未讀,

2005年12月15日凌晨4:57:512005/12/15

收件者：

Magnus Lycka wrote:
> To read the last x bytes of a file, you could do:
>
> >>> import os
> >>> x = 2000 # or whatever...
> >>> f=open('my_big_file')
> >>> l=os.fstat(f.fileno()).st_size
> >>> f.seek(l-x)
> >>> f.read()

You don't need fstat/st_size, you can ask seek to move to an offset
relative to the end of the file:

>>> x = 2000
>>> f = open('my_big_file')
>>> f.seek(-x, 2)
>>> f.read()