Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Download the "head" of a large file?

1 view
Skip to first unread message

erikcw

unread,
Jul 27, 2009, 4:38:25 PM7/27/09
to
I'm trying to figure out how to download just the first few lines of a
large (50mb) text file form a server to save bandwidth. Can Python do
this?

Something like the Python equivalent of curl http://url.com/file.xml |
head -c 2048

Thanks!
Erik

Ben Charrow

unread,
Jul 27, 2009, 6:33:51 PM7/27/09
to erikcw, pytho...@python.org
erikcw wrote:
> ...download just the first few lines of a large (50mb) text file form a
> server to save bandwidth..... Something like the Python equivalent of curl

> http://url.com/file.xml | head -c 2048

If you're OK calling curl and head from within python:

from subprocess import Popen, PIPE
url = "http://docs.python.org/"
p1 = Popen(["curl", url], stdout = PIPE, stderr = PIPE)
p2 = Popen(["head", "-c", "1024"], stdin = p1.stdout, stdout = PIPE)
p2.communicate()[0]

If you want a pure python approach:

import urllib2
url = "http://docs.python.org/"
req = urllib2.Request(url)
f = urllib2.urlopen(req)
f.read(1024)

HTH,
Ben

John Yeung

unread,
Jul 27, 2009, 6:40:25 PM7/27/09
to

urllib.urlopen gives you a file-like object, which you can then read
line by line or in fixed-size chunks. For example:

import urllib
chunk = urllib.urlopen('http://url.com/file.xml').read(2048)

At that point, chunk is just bytes, which you can write to a local
file, print, or whatever it is you want.

John

Gabriel Genellina

unread,
Jul 27, 2009, 10:16:42 PM7/27/09
to pytho...@python.org
En Mon, 27 Jul 2009 19:40:25 -0300, John Yeung
<gallium....@gmail.com> escribi�:

As the OP wants to save bandwidth, it's better to ask exactly the amount
of data to read. That is, add a Range header field [1] to the request, and
inspect the response for a corresponding Content-Range header [2].

py> import urllib2
py> url = "http://www.python.org/"
py> req = urllib2.Request(url)
py> req.add_header('Range', 'bytes=0-10239') # first 10K
py> f = urllib2.urlopen(req)
py> data = f.read()
py> print repr(data[-30:]), len(data)
'\t <a href="http://www.zope.' 10240
py> f.headers['Content-Range']
'bytes 0-10239/18196'
py> f.getcode()
206 # 206=Partial Content
py> f.close()

[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35

[2] http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.16

--
Gabriel Genellina

Message has been deleted

Ben Charrow

unread,
Jul 28, 2009, 2:18:37 AM7/28/09
to Dennis Lee Bieber, pytho...@python.org
Dennis Lee Bieber wrote:
> On Mon, 27 Jul 2009 13:38:25 -0700 (PDT), erikcw
> <erikwi...@gmail.com> declaimed the following in
> gmane.comp.python.general:

>> Something like the Python equivalent of curl http://url.com/file.xml |
>> head -c 2048
>>
> Presuming that | is a shell pipe operation, then doesn't that
> command line use "curl" to download the entire file, and "head" to
> display just the first 2k?

No, the entire file is not downloaded. My understanding of why this is (which
could be wrong) is that the output of curl is piped to head, and once head gets
the first 2k it closes the pipe. Then, when curl tries to write to the pipe
again, it gets sent the SIGPIPE signal at which point it exits.

Cheers,
Ben

0 new messages