I have a file which is very large eg over 200Mb , and i am going to use
python to code a "tail"
command to get the last few lines of the file. What is a good algorithm
for this type of task in python for very big files?
Initially, i thought of reading everything into an array from the file
and just get the last few elements (lines) but since it's a very big
file, don't think is efficient.
thanks
So the simplest is just read and drop until the one you want.
for x in f:
if x_is_what_I_want: something
If you really want, you can do the reverse lookup like this :
f.seek(0,EOF)
x = f.tell()
then loop byte by byte backward till you find you stuff. The is quite
cumbersome and may not be faster, depending on your content.
Well, 200mb isn't all that big these days. But it's easy to code:
# untested code
input = open(filename)
tail = input.readlines()[:tailcount]
input.close()
and you're done. However, it will go through a lot of memory. Fastest
is probably working through it backwards, but that may take multiple
tries to get everything you want:
# untested code
input = open(filename)
blocksize = tailcount * expected_line_length
tail = []
while len(tail) < tailcount:
input.seek(-blocksize, EOF)
tail = input.read().split('\n')
blocksize *= 2
input.close()
tail = tail[:tailcount]
It would probably be more efficient to read blocks backwards and paste
them together, but I'm not going to get into that.
<mike
--
Mike Meyer <m...@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
The following code snippets may be usefull:
reportFile = open( filename )
length = os.fstat( reportFile.fileno() ).st_size
if length == 0:
# Don't map zero length files, windows will barf
continue
try:
mapping = mmap.mmap( reportFile.fileno(), length,
mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
mapping = mmap.mmap(
reportFile.fileno(),
0, None,
mmap.ACCESS_READ )
Then you can use
mapping.rfind( os.linesep )
to find the end of the but last line and so on.
This is very fast, because nearly all work is done by are rfind, which
is implemented in C and the OS' paging logic.
HTH,
Gerald
bon...@gmail.com schrieb:
First of all, what makes you think that tail on your system isn't
already optinized for this? Devil's advocate here... I have no clue
really.
Anyway, if you must roll your own;
Determine some reasonable max line size,, multiply this by a few
larger than the numbers of lines that you want to tail, seek to EOF
and then back to the position in the file where this chunk would
start, get and split that chunk into lines and now output the last 3
or however many you need.
If the read fails to hold enough lines, seek back a bit further and do
same but you'll have to be prepared to concat the second and Nth last
chunks together.
Have fun
--
-------------------------------------------------------------------------------
Jerry Sievers 305 854-3001 (home) WWW ECommerce Consultant
305 321-1144 (mobile http://www.JerrySievers.com/
Excellent idea.
It'll blow up for large >2GB files on a 32bit OS though.
--
Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick
Actually mmap doesn't appear to have an rfind method :-(
Here is a tested solution using mmap using your code. Inefficient if
number of lines to be tailed is too big.
import os
import sys
import mmap
def main(nlines, filename):
reportFile = open( filename )
length = os.fstat( reportFile.fileno() ).st_size
if length == 0:
# Don't map zero length files, windows will barf
return
try:
mapping = mmap.mmap( reportFile.fileno(), length,
mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
mapping = mmap.mmap(
reportFile.fileno(),
0, None,
mmap.ACCESS_READ )
search = 1024
lines = []
while 1:
if search > length:
search = length
tail = mapping[length-search:]
lines = tail.split(os.linesep)
if len(lines) >= nlines or search == length:
break
search *= 2
lines = lines[-nlines-1:]
print "\n".join(lines)
if __name__ == "__main__":
if len(sys.argv) != 3:
print "Syntax: %s n file" % sys.argv[0]
else:
main(int(sys.argv[1]), sys.argv[2])
Ok, I'll have a go (only tested slightly ;-)
>>> def frsplit(fname, nitems=10, splitter='\n', chunk=8192):
... f = open(fname, 'rb')
... f.seek(0, 2)
... bufpos = f.tell() # pos from file beg == file length
... buf = ['']
... for nl in xrange(nitems):
... while len(buf)<2:
... chunk = min(chunk, bufpos)
... bufpos = bufpos-chunk
... f.seek(bufpos)
... buf = (f.read(chunk)+buf[0]).split(splitter)
... if buf== ['']: break
... if bufpos==0: break
... if len(buf)>1: yield buf.pop(); continue
... if bufpos==0: yield buf.pop(); break
...
20 lines from the tail of november's python-dev archive
>>> print '\n'.join(reversed(list(frsplit(r'v:\temp\clp\2005-November.txt', 20))))
lives in the mimelib project's hidden CVS on SF, but that seems pretty
silly.
Basically I'm just going to add the test script, setup.py, generated
html docs and a few additional unit tests, along with svn:external refs
to pull in Lib/email from the appropriate Python svn tree. This way,
I'll be able to create standalone email packages from the sandbox (which
I need to do because I plan on fixing a few outstanding email bugs).
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 307 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail/python-dev/attachments/20051130/e88db51d/attachment.pgp
Might want to throw away the first item returned by frsplit, unless it is !='' (indicating a
last line with no \n). Splitting with os.linesep is a problematical default, since e.g. it
wouldn't work with the above archive, since it has unix endings, and I didn't download it
in a manner that would convert it.
Regards,
Bengt Richter
To read the last x bytes of a file, you could do:
>>> import os
>>> x = 2000 # or whatever...
>>> f=open('my_big_file')
>>> l=os.fstat(f.fileno()).st_size
>>> f.seek(l-x)
>>> f.read()
Maybe that's a start. I didn't try it on a anything bigger than 16MB,
but it was more or less instantaneous for 16Megs.
If you want the last X lines and know that lines are no more than N
chars, f.seek(l-X*N); f.readlines()[-X:] should give you what you
need... (I think...I didn't test it.)
This would appear to give you a way of just working with the last, say 1mb.
Colin W.
You don't need fstat/st_size, you can ask seek to move to an offset
relative to the end of the file:
>>> x = 2000
>>> f = open('my_big_file')
>>> f.seek(-x, 2)
>>> f.read()