Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Web Scraping - Output File

31 views

Skip to first unread message

SMac...@comcast.net

unread,

Apr 26, 2012, 1:54:27 PM4/26/12

Hello,

I am having some difficulty generating the output I want from web
scraping. Specifically, the script I wrote, while it runs without any
errors, is not writing to the output file correctly. It runs, and
creates the output .txt file; however, the file is blank (ideally it
should be populated with a list of names).

I took the base of a program that I had before for a different data
gathering task, which worked beautifully, and edited it for my
purposes here. Any insight as to what I might be doing wrote would be
highly appreciated. Code is included below. Thanks!

import os
import re
import urllib2

outfile = open("Skadden.txt","w")

A = 1
Z = 26

for letter in range(A,Z):

for line in urllib2.urlopen("http://www.skadden.com/Index.cfm?
contentID=44&alphaSearch="+str(letter)):

x = line
if '">' in line:
start=x.find('">"')
end= x.find('</A></td>',start)
name=x[start:end]
outfile.write(name+"\n")
print name

Kiuhnm

unread,

Apr 26, 2012, 2:19:02 PM4/26/12

On 4/26/2012 19:54, SMac...@comcast.net wrote:
> Hello,
>
> I am having some difficulty generating the output I want from web
> scraping. Specifically, the script I wrote, while it runs without any
> errors, is not writing to the output file correctly. It runs, and
> creates the output .txt file; however, the file is blank (ideally it
> should be populated with a list of names).
>
> I took the base of a program that I had before for a different data
> gathering task, which worked beautifully, and edited it for my
> purposes here. Any insight as to what I might be doing wrote would be
> highly appreciated. Code is included below. Thanks!
>
> import os
> import re
> import urllib2
>
> outfile = open("Skadden.txt","w")
>
> A = 1
> Z = 26
>
> for letter in range(A,Z):
>
> for line in urllib2.urlopen("http://www.skadden.com/Index.cfm?contentID=44&alphaSearch="+str(letter)):

You need
alphaSearch=a
but you're using
alphaSearch=1

> x = line
> if '">' in line:

You should search for ' >'.

> start=x.find('">"')

Ditto.

> end= x.find('</A></td>',start)
> name=x[start:end]

You should use start+5 to skip ' >'.

> outfile.write(name+"\n")
> print name

Your code is bound to break over and over (you should do some smarter parsing), but here's a working version:

--->

import os
import re
import urllib2

outfile = open("Skadden.txt","w")

A = ord('a')
Z = ord('z')

for letter in range(A, Z):
for line in urllib2.urlopen("http://www.skadden.com/Index.cfm?contentID=44&alphaSearch="+chr(letter)):

x = line
if ' >' in line:
start=x.find(' >')
end= x.find('</A></td>',start)

name=x[start+5:end]

outfile.write(name+"\n")
print name

<---

Kiuhnm

SMac...@comcast.net

unread,

Apr 26, 2012, 4:47:00 PM4/26/12

On Apr 26, 2:19 pm, Kiuhnm <kiuhnm03.4t.yahoo.it> wrote:

Great, thanks so much for your help!

0 new messages