Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

python dowload

1 view
Skip to first unread message

monkeys paw

unread,
Feb 23, 2010, 2:42:05 PM2/23/10
to
I used the following code to download a PDF file, but the
file was invalid after running the code, is there problem
with the write operation?

import urllib2
url = 'http://www.whirlpoolwaterheaters.com/downloads/6510413.pdf'
a = open('adobe.pdf', 'w')
for line in urllib2.urlopen(url):
a.write(line)

John Bokma

unread,
Feb 23, 2010, 3:09:50 PM2/23/10
to
monkeys paw <mon...@joemoney.net> writes:

pdf is /not/ text. You're processing it like it's a text file (and
storing it like it's text, which on Windows is most likely a no no).

import urllib2

url = 'http://www.whirlpoolwaterheaters.com/downloads/6510413.pdf'
response = urllib2.urlopen(url)
fh = open('adobe.pdf', 'wb')
fh.write(response.read())
fh.close()
response.close()

--
John Bokma j3b

Hacking & Hiking in Mexico - http://johnbokma.com/
http://castleamber.com/ - Perl & Python Development

Tim Chase

unread,
Feb 23, 2010, 3:17:11 PM2/23/10
to monkeys paw, pytho...@python.org
monkeys paw wrote:
> I used the following code to download a PDF file, but the
> file was invalid after running the code, is there problem
> with the write operation?
>
> import urllib2
> url = 'http://www.whirlpoolwaterheaters.com/downloads/6510413.pdf'
> a = open('adobe.pdf', 'w')

Sure you don't need this to be 'wb' instead of 'w'?

> for line in urllib2.urlopen(url):
> a.write(line)

I also don't know if this "for line...a.write(line)" loop is
doing newline translation. If it's a binary file, you should use
.read() (perhaps with a modest-sized block-size, writing it in a
loop if the file can end up being large.)

-tkc


Jerry Hill

unread,
Feb 23, 2010, 3:17:19 PM2/23/10
to monkeys paw, pytho...@python.org

Two guesses:

First, you need to call a.close() when you're done writing to the file.

This will happen automatically when you have no more references to the
file, but I'm guessing that you're running this code in IDLE or some
other IDE, and a is still a valid reference to the file after you run
that snippet.

Second, you're treating the pdf file as text (you're assuming it has
lines, you're not writing the file in binary mode, etc.). I don't
know if that's correct for a pdf file. I would do something like this
instead:

Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit
(Intel)] on win32
IDLE 2.6.4

>>> a = open('C:/test.pdf', 'wb')
>>> data = urllib2.urlopen(url).read()
>>> a.write(data)
>>> a.close()

That seems to works for me, in that it downloads a 16 page pdf
document, and that document opens without error or any other obvious
problems.

--
Jerry

David Robinow

unread,
Feb 23, 2010, 3:20:54 PM2/23/10
to monkeys paw, pytho...@python.org
On Tue, Feb 23, 2010 at 2:42 PM, monkeys paw <mon...@joemoney.net> wrote:

If you're running Windows, try
a = open('adobe.pdf', 'wb')

[Works for me]

sste...@gmail.com

unread,
Feb 23, 2010, 3:22:39 PM2/23/10
to monkeys paw, pytho...@python.org

On Feb 23, 2010, at 2:42 PM, monkeys paw wrote:

> I used the following code to download a PDF file, but the
> file was invalid after running the code, is there problem
> with the write operation?
>
> import urllib2
> url = 'http://www.whirlpoolwaterheaters.com/downloads/6510413.pdf'
> a = open('adobe.pdf', 'w')

Try 'wb', just in case.

S

> for line in urllib2.urlopen(url):
> a.write(line)

> --
> http://mail.python.org/mailman/listinfo/python-list

monkeys paw

unread,
Feb 23, 2010, 6:08:19 PM2/23/10
to
On 2/23/2010 3:17 PM, Tim Chase wrote:
> monkeys paw wrote:
>> I used the following code to download a PDF file, but the
>> file was invalid after running the code, is there problem
>> with the write operation?
>>
>> import urllib2
>> url = 'http://www.whirlpoolwaterheaters.com/downloads/6510413.pdf'
>> a = open('adobe.pdf', 'w')
>
> Sure you don't need this to be 'wb' instead of 'w'?

'wb' does the trick. Thanks all!

Here is the final working code, i used an index(i)
to see how many reads took place, i have to assume there is
a default buffer size:

import urllib2
a = open('adobe.pdf', 'wb')
i = 0
for line in
urllib2.urlopen('http://www.whirlpoolwaterheaters.com/downloads/6510413.pdf'):
i = i + 1
a.write(line)

print "Number of reads: %d" % i
a.close()


NEW QUESTION if y'all are still reading:

Is there an integer increment operation in Python? I tried
using i++ but had to revert to 'i = i + 1'

Wes James

unread,
Feb 23, 2010, 6:19:41 PM2/23/10
to pytho...@python.org
<snip>

>
>
> NEW QUESTION if y'all are still reading:
>
> Is there an integer increment operation in Python? I tried
> using i++ but had to revert to 'i = i + 1'

i+=1

<snip>

Ethan Furman

unread,
Feb 23, 2010, 6:34:02 PM2/23/10
to pytho...@python.org
monkeys paw wrote:
> NEW QUESTION if y'all are still reading:
>
> Is there an integer increment operation in Python? I tried
> using i++ but had to revert to 'i = i + 1'

Nope, but try i += 1.

~Ethan~


Diez B. Roggisch

unread,
Feb 24, 2010, 4:39:44 PM2/24/10
to
Am 24.02.10 00:08, schrieb monkeys paw:


Instead, use enumerate:

for i, line in enumerate(...):
...


Diez

Aahz

unread,
Feb 27, 2010, 7:47:30 PM2/27/10
to
In article <2fWdnXOfjat-whnW...@insightbb.com>,

monkeys paw <mon...@joemoney.net> wrote:
>On 2/23/2010 3:17 PM, Tim Chase wrote:
>>
>> Sure you don't need this to be 'wb' instead of 'w'?
>
>'wb' does the trick. Thanks all!
>
>import urllib2
>a = open('adobe.pdf', 'wb')
>i = 0
>for line in
>urllib2.urlopen('http://www.whirlpoolwaterheaters.com/downloads/6510413.pdf'):
> i = i + 1
> a.write(line)

Using a for loop here is still a BAD IDEA -- line could easily end up
megabytes in size (though that is statistically unlikely).
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

"Many customs in this life persist because they ease friction and promote
productivity as a result of universal agreement, and whether they are
precisely the optimal choices is much less important." --Henry Spencer

Tim Chase

unread,
Feb 27, 2010, 9:19:03 PM2/27/10
to Aahz, pytho...@python.org
Aahz wrote:
> monkeys paw <mon...@joemoney.net> wrote:
>> On 2/23/2010 3:17 PM, Tim Chase wrote:
>>> Sure you don't need this to be 'wb' instead of 'w'?
>> 'wb' does the trick. Thanks all!
>>
>> import urllib2
>> a = open('adobe.pdf', 'wb')
>> i = 0
>> for line in
>> urllib2.urlopen('http://www.whirlpoolwaterheaters.com/downloads/6510413.pdf'):
>> i = i + 1
>> a.write(line)
>
> Using a for loop here is still a BAD IDEA -- line could easily end up
> megabytes in size (though that is statistically unlikely).

Just so the OP has it, dealing with binary files without reading
the entire content into memory would look something like

from urllib2 import urlopen
CHUNK_SIZE = 1024*4 # 4k, why not?
OUT_NAME = 'out.pdf'
a = open(OUT_NAME, 'wb')
u = urlopen(URL)
bytes_read = 0
while True:
data = u.read(CHUNK_SIZE)
if not data: break
a.write(data)
bytes_read += len(data)
print "Wrote %i bytes to %s" % (
bytes_read, OUT_NAME)

-tkc

0 new messages