Python 3 read() function

Cro

unread,

Dec 4, 2008, 11:13:41 AM12/4/08

to

Good day.
I have installed Python 3 and i have a problem with the builtin read()
function.

[code]
huge = open ( 'C:/HUGE_FILE.pcl', 'rb', 0 )
import io
vContent = io.StringIO()
vContent = huge.read() # This line takes hours to process !!!
vSplitContent = vContent.split
( 'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;' ) # This one i have neve
tried...
[/code]

The same thing, in Python 2.5 :

[code]
huge = open ( 'C:/HUGE_FILE.pcl', 'rb', 0 )
import StringIO
vContent = StringIO.StringIO()
vContent = huge.read() # This line takes 2 seconds !!!
vSplitContent = vContent.split
( 'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;' ) # This takes a few
seconds...
[/code]

My "HUGE_FILE" has about 900 MB ...
I know this is not the best method to open the file and split the
content by that code...
Can anyone please suggest a good method to split the file with that
code very fast, in Python 3 ?
The memory is not important for me, i have 4GB of RAM and i rarely use
more than 300 MB of it.

Thank you very very much.

Christian Heimes

unread,

Dec 4, 2008, 11:48:27 AM12/4/08

to pytho...@python.org

Cro wrote:
> Good day.
> I have installed Python 3 and i have a problem with the builtin read()
> function.
>
> [code]
> huge = open ( 'C:/HUGE_FILE.pcl', 'rb', 0 )
> import io
> vContent = io.StringIO()
> vContent = huge.read() # This line takes hours to process !!!
> vSplitContent = vContent.split
> ( 'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;' ) # This one i have neve
> tried...
> [/code]

Do you really mean io.StringIO? I guess you want io.BytesIO() ..

Christian

Cro

unread,

Dec 4, 2008, 11:57:31 AM12/4/08

to

> Do you really mean io.StringIO? I guess you want io.BytesIO() ..
>
> Christian

Mmm... i don't know.
I also tried :

[code]
IDLE 3.0
>>> import io
>>> vContent = io.BytesIO()
>>> huge = io.open("C:\HUGE_FILE.pcl",'r+b',0)
>>> vContent = huge.read()
[/code]

It still waits a lot... i don't have the patience to wait for the file
to load completely... it takes a lot!

Thank you for your reply.

Jerry Hill

unread,

Dec 4, 2008, 12:00:49 PM12/4/08

to Christian Heimes, pytho...@python.org

On Thu, Dec 4, 2008 at 11:48 AM, Christian Heimes <li...@cheimes.de> wrote:
> Cro wrote:
>> vContent = io.StringIO()
>> vContent = huge.read() # This line takes hours to process !!!
>

> Do you really mean io.StringIO? I guess you want io.BytesIO() ..

I don't think it matters. Here's a quick comparison between 2.5 and
3.0 on a relatively small 17 meg file:

C:\>c:\Python30\python -m timeit -n 1
"open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
1 loops, best of 3: 36.8 sec per loop

C:\>c:\Python25\python -m timeit -n 1
"open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
1 loops, best of 3: 33 msec per loop

That's 3 orders of magnitude slower on python3.0!

--
Jerry

Chris Rebert

unread,

Dec 4, 2008, 12:42:31 PM12/4/08

to Cro, pytho...@python.org

On Thu, Dec 4, 2008 at 8:57 AM, Cro <pra...@gmail.com> wrote:
>> Do you really mean io.StringIO? I guess you want io.BytesIO() ..
>>
>> Christian
>
> Mmm... i don't know.
> I also tried :
>
> [code]
> IDLE 3.0
>>>> import io
>>>> vContent = io.BytesIO()

You do realize that the previous line is completely pointless, right?
Later you rebind vContent to the results of huge.read() without ever
having used it between that line and the above line.

Cheers,
Chris

--
Follow the path of the Iguana...
http://rebertia.com

>>>> huge = io.open("C:\HUGE_FILE.pcl",'r+b',0)
>>>> vContent = huge.read()
> [/code]
>
> It still waits a lot... i don't have the patience to wait for the file
> to load completely... it takes a lot!
>
> Thank you for your reply.

> --
> http://mail.python.org/mailman/listinfo/python-list
>

sk...@pobox.com

unread,

Dec 4, 2008, 12:49:34 PM12/4/08

to Cro, pytho...@python.org

>>> huge = io.open("C:\HUGE_FILE.pcl",'r+b',0)

Why do you want to disable buffering? From the io.open help:

open(file, mode='r', buffering=None, encoding=None, errors=None,
newline=None, closefd=True)
Open file and return a stream. Raise IOError upon failure.
...
buffering is an optional integer used to set the buffering policy. By
default full buffering is on. Pass 0 to switch buffering off (only
allowed in binary mode), 1 to set line buffering, and an integer > 1
for full buffering.

I think you will get better performance if you open the file without the
third arg:

huge = io.open("C:\HUGE_FILE.pcl",'r+b')

--
Skip Montanaro - sk...@pobox.com - http://smontanaro.dyndns.org/

MRAB

unread,

Dec 4, 2008, 12:51:46 PM12/4/08

to pytho...@python.org

Can't you read it without StringIO?

huge = open('C:/HUGE_FILE.pcl', 'rb', 0)
vContent = huge.read()
vSplitContent = vContent.split(b'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;')

vContent will contain a bytestring (bytes), so I think you need to split
on a bytestring b'...' (in Python 3 unmarked string literals are Unicode).

Istvan Albert

unread,

Dec 4, 2008, 1:20:00 PM12/4/08

to

I can confirm this,

I am getting very slow read performance when reading a smaller 20 MB
file.

- Python 2.5 takes 0.4 seconds
- Python 3.0 takes 62 seconds

fname = "dmel-2R-chromosome-r5.1.fasta"
data = open(fname, 'rt').read()
print ( len(data) )

Terry Reedy

unread,

Dec 4, 2008, 1:31:01 PM12/4/08

to pytho...@python.org

Jerry Hill wrote:
> On Thu, Dec 4, 2008 at 11:48 AM, Christian Heimes <li...@cheimes.de> wrote:
>> Cro wrote:

>>> vContent = io.StringIO()
>>> vContent = huge.read() # This line takes hours to process !!!

>> Do you really mean io.StringIO? I guess you want io.BytesIO() ..
>

> I don't think it matters. Here's a quick comparison between 2.5 and
> 3.0 on a relatively small 17 meg file:
>
> C:\>c:\Python30\python -m timeit -n 1
> "open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
> 1 loops, best of 3: 36.8 sec per loop
>
> C:\>c:\Python25\python -m timeit -n 1
> "open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
> 1 loops, best of 3: 33 msec per loop
>
> That's 3 orders of magnitude slower on python3.0!

Timing of os interaction may depend on os. I verified above on WinXp
with 4 meg Pythonxy.chm file. Eye blink versus 3 secs, duplicated. I
think something is wrong that needs fixing in 3.0.1.

http://bugs.python.org/issue4533

tjr

Istvan Albert

unread,

Dec 4, 2008, 1:40:41 PM12/4/08

to

On Dec 4, 1:31 pm, Terry Reedy <tjre...@udel.edu> wrote:

> Jerry Hill wrote:

> > That's 3 orders of magnitude slower on python3.0!
>
> Timing of os interaction may depend on os. I verified above on WinXp
> with 4 meg Pythonxy.chm file. Eye blink versus 3 secs, duplicated. I
> think something is wrong that needs fixing in 3.0.1.
>
> http://bugs.python.org/issue4533

I believe that the slowdowns are even more substantial when opening
the file in text mode.

Дамјан Георгиевски

unread,

Dec 4, 2008, 2:01:11 PM12/4/08

to

> I don't think it matters. Here's a quick comparison between 2.5 and
> 3.0 on a relatively small 17 meg file:
>
> C:\>c:\Python30\python -m timeit -n 1
> "open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
> 1 loops, best of 3: 36.8 sec per loop
>
> C:\>c:\Python25\python -m timeit -n 1
> "open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
> 1 loops, best of 3: 33 msec per loop
>
> That's 3 orders of magnitude slower on python3.0!

Isn't this because you have the file cached in memory on the second run?

--
дамјан ( http://softver.org.mk/damjan/ )

"The moment you commit and quit holding back, all sorts of unforseen
incidents, meetings and material assistance will rise up to help you.
The simple act of commitment is a powerful magnet for help." -- Napoleon

George Sakkis

unread,

Dec 4, 2008, 2:19:53 PM12/4/08

to

On Dec 4, 2:01 pm, Дамјан Георгиевски <gdam...@gmail.com> wrote:
> > I don't think it matters. Here's a quick comparison between 2.5 and
> > 3.0 on a relatively small 17 meg file:
>
> > C:\>c:\Python30\python -m timeit -n 1
> > "open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
> > 1 loops, best of 3: 36.8 sec per loop
>
> > C:\>c:\Python25\python -m timeit -n 1
> > "open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
> > 1 loops, best of 3: 33 msec per loop
>
> > That's 3 orders of magnitude slower on python3.0!
>
> Isn't this because you have the file cached in memory on the second run?

That's probably it; I see much more modest slowdown (2-3X) if I repeat
many times each run.

George

Terry Reedy

unread,

Dec 4, 2008, 2:25:48 PM12/4/08

to pytho...@python.org

Дамјан Георгиевски wrote:
>> I don't think it matters. Here's a quick comparison between 2.5 and
>> 3.0 on a relatively small 17 meg file:
>>
>> C:\>c:\Python30\python -m timeit -n 1
>> "open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
>> 1 loops, best of 3: 36.8 sec per loop
>>
>> C:\>c:\Python25\python -m timeit -n 1
>> "open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
>> 1 loops, best of 3: 33 msec per loop
>>
>> That's 3 orders of magnitude slower on python3.0!
>
> Isn't this because you have the file cached in memory on the second run?

In my test, I read Python25.chm with 2.5 and Python30.chm with 3.0.

Rereading Python30.chm without closing *is* much faster.
>>> f=open('Doc/Python30.chm','rb')
>>> d=f.read()
>>> d=f.read()
>>> d=f.read()
Closing, reopening, and rereading is slower.

MRAB

unread,

Dec 4, 2008, 2:40:49 PM12/4/08

to pytho...@python.org

Terry Reedy wrote:

> Дамјан Георгиевски wrote:
>>> I don't think it matters. Here's a quick comparison between 2.5 and
>>> 3.0 on a relatively small 17 meg file:
>>>
>>> C:\>c:\Python30\python -m timeit -n 1
>>> "open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
>>> 1 loops, best of 3: 36.8 sec per loop
>>>
>>> C:\>c:\Python25\python -m timeit -n 1
>>> "open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
>>> 1 loops, best of 3: 33 msec per loop
>>>
>>> That's 3 orders of magnitude slower on python3.0!
>>
>> Isn't this because you have the file cached in memory on the second run?
>

> In my test, I read Python25.chm with 2.5 and Python30.chm with 3.0.
>
> Rereading Python30.chm without closing *is* much faster.
> >>> f=open('Doc/Python30.chm','rb')
> >>> d=f.read()
> >>> d=f.read()
> >>> d=f.read()
> Closing, reopening, and rereading is slower.
>

It certainly is faster if you're already at the end of the file. :-)

Jean-Paul Calderone

unread,

Dec 4, 2008, 2:42:32 PM12/4/08

to pytho...@python.org

On Thu, 04 Dec 2008 14:25:48 -0500, Terry Reedy <tjr...@udel.edu> wrote:
> [snip]

>
>In my test, I read Python25.chm with 2.5 and Python30.chm with 3.0.
>
>Rereading Python30.chm without closing *is* much faster.
> >>> f=open('Doc/Python30.chm','rb')
> >>> d=f.read()
> >>> d=f.read()
> >>> d=f.read()

Did you think about what this does?

Jean-Paul

Albert Hopkins

unread,

Dec 4, 2008, 2:46:10 PM12/4/08

to pytho...@python.org

On Thu, 2008-12-04 at 20:01 +0100, Дамјан Георгиевски wrote:
> > I don't think it matters. Here's a quick comparison between 2.5 and
> > 3.0 on a relatively small 17 meg file:
> >
> > C:\>c:\Python30\python -m timeit -n 1
> > "open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
> > 1 loops, best of 3: 36.8 sec per loop
> >
> > C:\>c:\Python25\python -m timeit -n 1
> > "open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
> > 1 loops, best of 3: 33 msec per loop
> >
> > That's 3 orders of magnitude slower on python3.0!
>
> Isn't this because you have the file cached in memory on the second run?

Even on different files of identical size it's ~3x slower:

$ dd if=/dev/urandom of=file1 bs=1M count=70
70+0 records in
70+0 records out
73400320 bytes (73 MB) copied, 14.8693 s, 4.9 MB/s
$ dd if=/dev/urandom of=file2 bs=1M count=70
70+0 records in
70+0 records out
73400320 bytes (73 MB) copied, 16.1581 s, 4.5 MB/s
$ python2.5 -m timeit -n 1 'open("file1", "rb").read()'
1 loops, best of 3: 5.26 sec per loop
$ python3.0 -m timeit -n 1 'open("file2", "rb").read()'
1 loops, best of 3: 14.8 sec per loop

Terry Reedy

unread,

Dec 4, 2008, 2:55:08 PM12/4/08

to pytho...@python.org

Whoops ;-)
f.seek(0) first and it is maybe a bit faster, but not 'much'.

Istvan Albert

unread,

Dec 4, 2008, 3:40:34 PM12/4/08

to

Turns out write performance is also slow!

The program below takes

3 seconds on python 2.5
17 seconds on python 3.0

yes, 17 seconds! tested many times in various order. I believe the
slowdowns are not constant (3x) but some sort of nonlinear function
(quadratic?) play with the N to see it.

===================================

import time

start = time.time()

N = 10**6
fp = open('testfile.txt', 'wt')
for n in range(N):
fp.write( '%s\n' % n )
fp.close()

end = time.time()

print (end-start)

Christian Heimes

unread,

Dec 4, 2008, 5:39:04 PM12/4/08

to pytho...@python.org

Terry Reedy wrote:
> Timing of os interaction may depend on os. I verified above on WinXp
> with 4 meg Pythonxy.chm file. Eye blink versus 3 secs, duplicated. I
> think something is wrong that needs fixing in 3.0.1.
>
> http://bugs.python.org/issue4533

I've attached a patch to the bug. reading was so slow because the
fileio_readall() was increasing the buffer by 8kB in each iteration. The
new code doubles the buffer until it reaches 512kB. Starting with 512kB
it's increased in 512kB blocks. Python 2.x has the same growth rate.

Christian

Terry Reedy

unread,

Dec 4, 2008, 9:00:29 PM12/4/08

to pytho...@python.org

Thank you. Too bad this was not caught a week ago, but the timing of
each method of each class is a bit hard to test for. Perhaps in the future.

tjr