[code]
huge = open ( 'C:/HUGE_FILE.pcl', 'rb', 0 )
import io
vContent = io.StringIO()
vContent = huge.read() # This line takes hours to process !!!
vSplitContent = vContent.split
( 'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;' ) # This one i have neve
tried...
[/code]
The same thing, in Python 2.5 :
[code]
huge = open ( 'C:/HUGE_FILE.pcl', 'rb', 0 )
import StringIO
vContent = StringIO.StringIO()
vContent = huge.read() # This line takes 2 seconds !!!
vSplitContent = vContent.split
( 'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;' ) # This takes a few
seconds...
[/code]
My "HUGE_FILE" has about 900 MB ...
I know this is not the best method to open the file and split the
content by that code...
Can anyone please suggest a good method to split the file with that
code very fast, in Python 3 ?
The memory is not important for me, i have 4GB of RAM and i rarely use
more than 300 MB of it.
Thank you very very much.
Do you really mean io.StringIO? I guess you want io.BytesIO() ..
Christian
Mmm... i don't know.
I also tried :
[code]
IDLE 3.0
>>> import io
>>> vContent = io.BytesIO()
>>> huge = io.open("C:\HUGE_FILE.pcl",'r+b',0)
>>> vContent = huge.read()
[/code]
It still waits a lot... i don't have the patience to wait for the file
to load completely... it takes a lot!
Thank you for your reply.
I don't think it matters. Here's a quick comparison between 2.5 and
3.0 on a relatively small 17 meg file:
C:\>c:\Python30\python -m timeit -n 1
"open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
1 loops, best of 3: 36.8 sec per loop
C:\>c:\Python25\python -m timeit -n 1
"open('C:\\work\\temp\\bppd_vsub.csv', 'rb').read()"
1 loops, best of 3: 33 msec per loop
That's 3 orders of magnitude slower on python3.0!
--
Jerry
You do realize that the previous line is completely pointless, right?
Later you rebind vContent to the results of huge.read() without ever
having used it between that line and the above line.
Cheers,
Chris
--
Follow the path of the Iguana...
http://rebertia.com
>>>> huge = io.open("C:\HUGE_FILE.pcl",'r+b',0)
>>>> vContent = huge.read()
> [/code]
>
> It still waits a lot... i don't have the patience to wait for the file
> to load completely... it takes a lot!
>
> Thank you for your reply.
Why do you want to disable buffering? From the io.open help:
open(file, mode='r', buffering=None, encoding=None, errors=None,
newline=None, closefd=True)
Open file and return a stream. Raise IOError upon failure.
...
buffering is an optional integer used to set the buffering policy. By
default full buffering is on. Pass 0 to switch buffering off (only
allowed in binary mode), 1 to set line buffering, and an integer > 1
for full buffering.
I think you will get better performance if you open the file without the
third arg:
huge = io.open("C:\HUGE_FILE.pcl",'r+b')
--
Skip Montanaro - sk...@pobox.com - http://smontanaro.dyndns.org/
huge = open('C:/HUGE_FILE.pcl', 'rb', 0)
vContent = huge.read()
vSplitContent = vContent.split(b'BIN;SP1;PW0.3,1;PA100,700;PD625,700;PU;')
vContent will contain a bytestring (bytes), so I think you need to split
on a bytestring b'...' (in Python 3 unmarked string literals are Unicode).
I am getting very slow read performance when reading a smaller 20 MB
file.
- Python 2.5 takes 0.4 seconds
- Python 3.0 takes 62 seconds
fname = "dmel-2R-chromosome-r5.1.fasta"
data = open(fname, 'rt').read()
print ( len(data) )
Timing of os interaction may depend on os. I verified above on WinXp
with 4 meg Pythonxy.chm file. Eye blink versus 3 secs, duplicated. I
think something is wrong that needs fixing in 3.0.1.
http://bugs.python.org/issue4533
tjr
> Jerry Hill wrote:
> > That's 3 orders of magnitude slower on python3.0!
>
> Timing of os interaction may depend on os. I verified above on WinXp
> with 4 meg Pythonxy.chm file. Eye blink versus 3 secs, duplicated. I
> think something is wrong that needs fixing in 3.0.1.
>
> http://bugs.python.org/issue4533
I believe that the slowdowns are even more substantial when opening
the file in text mode.
Isn't this because you have the file cached in memory on the second run?
--
дамјан ( http://softver.org.mk/damjan/ )
"The moment you commit and quit holding back, all sorts of unforseen
incidents, meetings and material assistance will rise up to help you.
The simple act of commitment is a powerful magnet for help." -- Napoleon
That's probably it; I see much more modest slowdown (2-3X) if I repeat
many times each run.
George
In my test, I read Python25.chm with 2.5 and Python30.chm with 3.0.
Rereading Python30.chm without closing *is* much faster.
>>> f=open('Doc/Python30.chm','rb')
>>> d=f.read()
>>> d=f.read()
>>> d=f.read()
Closing, reopening, and rereading is slower.
Did you think about what this does?
Jean-Paul
Even on different files of identical size it's ~3x slower:
$ dd if=/dev/urandom of=file1 bs=1M count=70
70+0 records in
70+0 records out
73400320 bytes (73 MB) copied, 14.8693 s, 4.9 MB/s
$ dd if=/dev/urandom of=file2 bs=1M count=70
70+0 records in
70+0 records out
73400320 bytes (73 MB) copied, 16.1581 s, 4.5 MB/s
$ python2.5 -m timeit -n 1 'open("file1", "rb").read()'
1 loops, best of 3: 5.26 sec per loop
$ python3.0 -m timeit -n 1 'open("file2", "rb").read()'
1 loops, best of 3: 14.8 sec per loop
Whoops ;-)
f.seek(0) first and it is maybe a bit faster, but not 'much'.
The program below takes
3 seconds on python 2.5
17 seconds on python 3.0
yes, 17 seconds! tested many times in various order. I believe the
slowdowns are not constant (3x) but some sort of nonlinear function
(quadratic?) play with the N to see it.
===================================
import time
start = time.time()
N = 10**6
fp = open('testfile.txt', 'wt')
for n in range(N):
fp.write( '%s\n' % n )
fp.close()
end = time.time()
print (end-start)
I've attached a patch to the bug. reading was so slow because the
fileio_readall() was increasing the buffer by 8kB in each iteration. The
new code doubles the buffer until it reaches 512kB. Starting with 512kB
it's increased in 512kB blocks. Python 2.x has the same growth rate.
Christian
Thank you. Too bad this was not caught a week ago, but the timing of
each method of each class is a bit hard to test for. Perhaps in the future.
tjr