first, second, etc line of text file

3 views
Skip to first unread message

Daniel Nogradi

unread,
Jul 25, 2007, 3:44:39 PM7/25/07
to pytho...@python.org
A very simple question: I currently use a cumbersome-looking way of
getting the first, second, etc. line of a text file:

for i, line in enumerate( open( textfile ) ):
if i == 0:
print 'First line is: ' + line
elif i == 1:
print 'Second line is: ' + line
.......
.......

I thought about f = open( textfile ) and then f[0], f[1], etc but that
throws a TypeError: 'file' object is unsubscriptable.

Is there a simpler way?

Jeff

unread,
Jul 25, 2007, 4:00:28 PM7/25/07
to
Files should be iterable on their own:

filehandle = open('/path/to/foo.txt')
for line in filehandle:
# do something...

But you could also do a generic lines = filehandle.readlines(), which
returns a list of all lines in the file, but that's a bit memory
hungry.

George Sakkis

unread,
Jul 25, 2007, 4:12:30 PM7/25/07
to

If all you need is sequential access, you can use the next() method of
the file object:

nextline = open(textfile).next
print 'First line is: %r' % nextline()
print 'Second line is: %r' % nextline()
...

For random access, the easiest way is to slurp all the file in a list
using file.readlines().

HTH,
George

Jeff McNeil

unread,
Jul 25, 2007, 4:13:49 PM7/25/07
to Daniel Nogradi, pytho...@python.org
Depending on the size of your file, you can just use file.readlines.
Note that file.readlines is going to read the entire file into memory,
so don't use it on your plain-text version of War and Peace.

>>> f = open("/etc/passwd")
>>> lines = f.readlines()
>>> lines[5]
'# lookupd DirectoryServices \n'
>>>

You can also check out the fileinput module. That ought to be sightly
more efficient and provides some additional functionality. I think
there are some restrictions on accessing lines out of order, though.

-Jeff

On 7/25/07, Daniel Nogradi <nog...@gmail.com> wrote:
> A very simple question: I currently use a cumbersome-looking way of
> getting the first, second, etc. line of a text file:
>
> for i, line in enumerate( open( textfile ) ):
> if i == 0:
> print 'First line is: ' + line
> elif i == 1:
> print 'Second line is: ' + line
> .......
> .......
>
> I thought about f = open( textfile ) and then f[0], f[1], etc but that
> throws a TypeError: 'file' object is unsubscriptable.
>
> Is there a simpler way?

> --
> http://mail.python.org/mailman/listinfo/python-list
>

Daniel Nogradi

unread,
Jul 25, 2007, 4:33:36 PM7/25/07
to pytho...@python.org
Thanks all! I think I will stick to my original method because the
files can be quite large and without reading the whole file into
memory probably enumerate( open( textfile ) ) is the only way to
access an arbitrary Nth line.

Grant Edwards

unread,
Jul 25, 2007, 4:48:57 PM7/25/07
to
On 2007-07-25, Jeff McNeil <je...@jmcneil.net> wrote:

> Depending on the size of your file, you can just use
> file.readlines. Note that file.readlines is going to read the
> entire file into memory, so don't use it on your plain-text
> version of War and Peace.

I don't think that would actually be a problem for any recent
machine.

The Project Gutenberg version of W&P is 3.1MB of text in 67403
lines. I just did an f.readlines() on it and it was pretty
much instantaneous, and the python interpreter instance that
contains that list of 67403 lines is using a bit less than 8MB
of RAM. An "empty" interpreter uses about 2.7MB. So, doing
f.readlines() on War and Peace requires a little over 5MB of RAM
-- not really much of a concern on any machine that's likely to
be running Python.

--
Grant Edwards grante Yow! Now I understand the
at meaning of "THE MOD SQUAD"!
visi.com

Jeff

unread,
Jul 25, 2007, 4:54:00 PM7/25/07
to
Grant,

That might be a memory problem if you are running multiple processes
regularly, such as on a webserver.

Bjoern Schliessmann

unread,
Jul 25, 2007, 5:07:18 PM7/25/07
to
Grant Edwards wrote:
> On 2007-07-25, Jeff McNeil <je...@jmcneil.net> wrote:

>> Depending on the size of your file, you can just use
>> file.readlines. Note that file.readlines is going to read the
>> entire file into memory, so don't use it on your plain-text
>> version of War and Peace.
>
> I don't think that would actually be a problem for any recent
> machine.
>
> The Project Gutenberg version of W&P is 3.1MB of text in 67403
> lines. I just did an f.readlines() on it and it was pretty
> much instantaneous, and the python interpreter instance that
> contains that list of 67403 lines is using a bit less than 8MB
> of RAM.

YMMD :)

Regards,


Björn

--
BOFH excuse #335:

the AA battery in the wallclock sends magnetic interference

Grant Edwards

unread,
Jul 25, 2007, 5:13:38 PM7/25/07
to
On 2007-07-25, Jeff <jeff...@gmail.com> wrote:

> That might be a memory problem if you are running multiple processes
> regularly, such as on a webserver.

I suppose if you did it in parallel 50 processes, you could use
up 250MB of RAM. Still not a big deal on many servers. A
decent OS will swap regions that aren't being used to disk, so
it's likely not to be a problem.

If you're talking several hundred instances, you could start to
use up serios amounts of VM. Still, I say do it the simple,
obvious way first, and optimize it _after_ you've determined
you have a performance problem (and determined where the
bottleneck is). Premature optimization...

--
Grant Edwards grante Yow! This PORCUPINE knows
at his ZIPCODE ... And he has
visi.com "VISA"!!

James Stroud

unread,
Jul 25, 2007, 6:14:28 PM7/25/07
to

This is the same logic but less cumbersome, if that's what you mean:

to_get = [0, 3, 7, 11, 13]
got = dict((i,s) for (i,s) in enumerate(open(textfile)) if i in to_get)
print got[3]

This would probably be the best way for really big files and if you know
all of the lines you want ahead of time. If you need to access the file
multiple times at arbitrary positions, you may need to seek(0), cache
lines already read, or slurp the whole thing, which has already been
suggested.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Jay Loden

unread,
Jul 25, 2007, 6:38:45 PM7/25/07
to Grant Edwards, pytho...@python.org
Grant Edwards wrote:
> On 2007-07-25, Jeff <jeff...@gmail.com> wrote:
>
>> That might be a memory problem if you are running multiple processes
>> regularly, such as on a webserver.
>
> I suppose if you did it in parallel 50 processes, you could use
> up 250MB of RAM. Still not a big deal on many servers. A
> decent OS will swap regions that aren't being used to disk, so
> it's likely not to be a problem.

Or, you might be reading from a text file dramatically larger than a 3MB copy of War and Peace. I regularly deal with log files that are often many times that, including some that have been well over a GB or more. Trust me, you don't want to read in the entire file when it's a 1.5GB text file. It's true that many times readlines() will work fine, but there are also certainly cases where it's not acceptable for memory and performance reasons.

-Jay

Paul Rubin

unread,
Jul 25, 2007, 7:50:44 PM7/25/07
to
"Daniel Nogradi" <nog...@gmail.com> writes:

> A very simple question: I currently use a cumbersome-looking way of
> getting the first, second, etc. line of a text file:
>
> for i, line in enumerate( open( textfile ) ):
> if i == 0:
> print 'First line is: ' + line
> elif i == 1:
> print 'Second line is: ' + line
> .......
> .......

from itertools import islice
first_five_lines = list(islice(open(textfile), 5))

print 'first line is', first_five_lines[0]
print 'second line is', first_five_lines[1]
...

Gabriel Genellina

unread,
Jul 25, 2007, 9:56:53 PM7/25/07
to pytho...@python.org
En Wed, 25 Jul 2007 19:14:28 -0300, James Stroud <jst...@mbi.ucla.edu>
escribió:

> Daniel Nogradi wrote:
>> A very simple question: I currently use a cumbersome-looking way of
>> getting the first, second, etc. line of a text file:
>

> to_get = [0, 3, 7, 11, 13]
> got = dict((i,s) for (i,s) in enumerate(open(textfile)) if i in to_get)
> print got[3]
>
> This would probably be the best way for really big files and if you know
> all of the lines you want ahead of time.

But it still has to read the complete file (altough it does not keep the
unwanted lines).
Combining this with Paul Rubin's suggestion of itertools.islice I think we
get the best solution:


got = dict((i,s) for (i,s) in

enumerate(islice(open(textfile),max(to_get)+1)) if i in to_get)

--
Gabriel Genellina

Daniel Nogradi

unread,
Jul 26, 2007, 8:46:02 AM7/26/07
to pytho...@python.org

Thanks! This looks the best, I only need the first couple of lines
sequentially so don't need to read in the whole file ever.

Neil Cerutti

unread,
Jul 26, 2007, 9:23:15 AM7/26/07
to
On 2007-07-25, George Sakkis <george...@gmail.com> wrote:
> For random access, the easiest way is to slurp all the file in
> a list using file.readlines().

A lazy evaluation scheme might be useful for random access that
only slurps as much as you need.

class LazySlurper(object):
r""" Lazily read a file using readline, allowing random access to the
results with __getitem__.

>>> import StringIO
>>> infile = StringIO.StringIO(
... "Line 0\n"
... "Line 1\n"
... "Line 2\n"
... "Line 3\n"
... "Line 4\n"
... "Line 5\n"
... "Line 6\n"
... "Line 7\n")
>>> slurper = LazySlurper(infile)
>>> print slurper[0],
Line 0
>>> print slurper[5],
Line 5
>>> print slurper[1],
Line 1
>>> infile.close()
"""
def __init__(self, fileobj):
self.fileobj = fileobj
self.upto = 0
self.lines = []
self._readupto(0)
def _readupto(self, n):
while self.upto <= n:
line = self.fileobj.readline()
if line == "":
break
self.lines.append(line)
self.upto += 1
def __getitem__(self, n):
self._readupto(n)
return self.lines[n]

--
Neil Cerutti
Eddie Robinson is about one word: winning and losing. --Eddie Robinson's agent
Paul Collier

Mike

unread,
Jul 26, 2007, 9:42:26 AM7/26/07
to

if you only ever need the first few lines of a file, why not keep it
simple and do something like this?

mylines = open("c:\\myfile.txt","r").readlines()[:5]

that will give you the first five lines of the file. Replace 5 with
whatever number you need. next will work, too, obviously, but won't
that use of next hold the file open until you are done with it? Or,
more specifically, since you do not have a file object at all, won't
you have to wait until the function goes out of scope to release the
file? Would that be a problem? Or am I just being paranoid?

Steve Holden

unread,
Jul 26, 2007, 12:32:16 PM7/26/07
to pytho...@python.org
Unfortunately the expression

f.readlines()[:5]

reads the whole file in and generates a list of the lines just so it can
slice the first five off. Compare that, on a large file, with something like

[f.next() for _ in range(5)]

and I think you will see that the latter is significantly better in
almost all respects.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------

Scott David Daniels

unread,
Apr 23, 2009, 5:50:06 PM4/23/09
to

or even faster:
wanted = set([0, 3, 7, 11, 13])
with open(textfile) as src:
got = dict((i, s) for (i, s) in enumerate(islice(src,
min(wanted), max(wanted) + 1))
if i in wanted)
Of course that could just as efficiently create a list as a dict.
Note that using a list rather than a set for wanted takes len(wanted)
comparisons on misses, and len(wanted)/2 on hits, but most likely a
single comparison for a dict whether it is a hit or a miss.

--Scott David Daniels
Scott....@Acm.Org

Gabriel Genellina

unread,
Apr 24, 2009, 1:12:08 PM4/24/09
to pytho...@python.org
En Thu, 23 Apr 2009 18:50:06 -0300, Scott David Daniels
<Scott....@acm.org> escribió:

> Gabriel Genellina wrote:
>> En Wed, 25 Jul 2007 19:14:28 -0300, James Stroud <jst...@mbi.ucla.edu>
>> escribió:

[nice recipe to retrieve only certain lines of a file]

I think your time machine needs an adjustment, it spits things almost two
years later :)

--
Gabriel Genellina

Reply all
Reply to author
Forward
0 new messages