Address data

577 views
Skip to first unread message

Richard.Williams.20

unread,
Aug 1, 2009, 6:52:46 PM8/1/09
to biterscripting
I was asked the following question (reworded)

> I have a string value in a variable $address of the
> following form.
>
> "XXX-YYYa New ABC DEF Rd Double Bay NSW 2028"

> can i divide it further while extracting like
> Street--302-304a New South Head Rd
> Suburb--Double Bay
> State--NSW
> Postcode--2028
> into columns of excel?


What is the criteria that separate the street, suburb, state and
postcode ? I can see based on your example that the state and
postcode are separated by multiple spaces. So postcode can be
separated as follows.

var string postcode
stex -r "^ ;^[" $address > $postcode

Basically, we are extracting the string after a space and any number
of unprintable characters from $address and saving it into $postcode.

If all the states have only one word (no spaces), then extracting the
last word from the remaining $address will give us the state.

Separating the suburb will be a little difficult because it can be one
word, two words, three words, etc.

If you are able to get a complete list of suburbs from some web site,
then it can be done.

Richard




Bharath Ram

unread,
Aug 6, 2009, 5:59:32 AM8/6/09
to biterscripting
Thanks for the reply.And more thing is there a instruction called as
next so that i can extract entries from 10 or 15 pages at once?Just a
doubt.I have already extracted the entries using the script you gave.

On Aug 2, 3:52 am, "Richard.Williams.20"

Richard.Williams.20

unread,
Aug 6, 2009, 11:37:45 AM8/6/09
to biterscripting

Can you please post the script you are using ? It will be good to have
that script in front of us for this discussion.

I am not aware of an instruction/command called next. The way it is
done, is that you extract the URL for the NEXT link on a page and call
the script for that link. Typically, this is done using 2 scripts.

Script1: Parses one page, gets data and gets the URL for the NEXT
link.
Script2: Repeatedly calls Script1 for the URL for the NEXT link.

Sometimes, this can also be done using an index number in the URL. If
you post your Script1, I can help write Script2.

Richard

Bharath Ram

unread,
Aug 6, 2009, 8:45:55 PM8/6/09
to bitersc...@googlegroups.com
# Script page.txt
# Get page contents into a string variable.
echo -e "DEBUG: Reading web page"
var str content ; cat "http://yellowpages.com.au/search/postSearchEntry.do?clueType=0&clue=Real+Estate&locationClue=New+south+wales&x=55&y=14" > $content
 
echo -e "DEBUG: Extracting entries"
# Successively extract portions between <span id="listing-name-xxx"> and the following 3rd instance of </span>.
while ( {sen -c -r "^<span id=\"listing-name-&\"\>^" $content } > 0 )
do
    # Discard the portion up to the <span id="listing-name-xxx">
    stex -c -r "^<span id=\"listing-name-&\"\>^]" $content > null
 
    # Collect the portion upto the 3rd instance of </span>.
    var str entry ; stex -c "]^</span>^3" $content > $entry
 
    # $entry now contains one entry. Portion up to 1st </span> is the name.
    var str name ; stex -c "]^</span>^" $entry > $name
 
    # Discard the portion up to <span class="address">.
    stex -c "^<span class=\"address\">^]" $entry > null
 
     # Portion up to 1st </span> is the address.
    var str address ; stex -c "]^</span>^" $entry > $address
 
    # Discard the portion up to <span class="phoneNumber">ph: .
    stex -c "^<span class=\"phoneNumber\">ph: ^]" $entry > null
 
     # The remaining portion is the phone.
    var str phone ; set $phone = $entry
 
    # Output the name, address and phone in tab separated values format.
    echo $name "\t" $address "\t" $phone
 
done
# End of script page.txt


This is the script.Thanks for your reply.
Message has been deleted

Richard.Williams.20

unread,
Aug 7, 2009, 12:31:02 PM8/7/09
to biterscripting

I think in this case, it will be better to use an index for pageNumber
instead of getting the URL for each Next button. (Sometimes, the other
approach works.)

I would pass a pageNumber as an argument to page.txt by changing the

var str content ; cat "whatever" > $content

to

var int pageNumber
var str content
cat ("whatever&pageNumber="+makestr(int($pageNumber))) > $content


Then the following outer script will get multiple pages.

# Script pageloop.txt
var int max
var int index
set $index = 1
while ($index < $max)
do
echo -e "DEBUG: Script pageloop.txt: Getting page " $index
script yellowpages.txt pageNumber($index)
set $index=$index+1
done


Let me know if this works for you.

Richard

(When using biterscripting or any other language to get data from a
web page, it is always a good idea to make sure it is ok to do so with
the site's owner.)

Richard.Williams.20

unread,
Aug 7, 2009, 12:49:17 PM8/7/09
to biterscripting

Call the pageloop.txt script as follows.

script pageloop.txt max(5)

or, some number for the number of pages you want to get.

Richard

Bharath Ram

unread,
Aug 7, 2009, 9:03:45 PM8/7/09
to biterscripting
The program works fine.I did try it and it gave me desired to
output.but how do i save all the pages into excel
i tried
script pageloop.txt max(5) > output.xls
system start output.xls

But it saves only first page into excel.How do i save other pages.

On Aug 7, 9:49 pm, "Richard.Williams.20"

Richard.Williams.20

unread,
Aug 8, 2009, 12:17:42 PM8/8/09
to biterscripting

It is saving 160 entries into one Excel spreadsheet output.xls.

Compare the excel with manual web browsing. You will find it is saving
first four (4) pages. That is because in script pageloop.txt $index is
only going up to 4. That is in the while condition. That was my
oversite. I will let you figure it out and fix it so it will go up to
5 pages.

Also, there seems to be one extraneous entry. Just manually remove it
from Excel.

If you make any of these scripts better, please post them here or some
other forum, so others can benefit from it.

Richard

Bharath Ram

unread,
Aug 8, 2009, 1:01:43 PM8/8/09
to bitersc...@googlegroups.com
For sure.I am already telling about your awesome application where ever i can.

Richard.Williams.20

unread,
Aug 9, 2009, 12:49:49 PM8/9/09
to biterscripting

All I did, was to write a small script for you. Once you know how to
parse a web page with biterscripting, any new page/format takes about
10 minutes of thinking and 2 mins of scripting. But, thanks for
appreciating even that small effort. It made my day.

Richard

Bharath Ram

unread,
Aug 10, 2009, 10:17:30 PM8/10/09
to bitersc...@googlegroups.com
Hi there.You gave this program for saving multiple pages into excel.

# Script pageloop.txt
var int max
var int index
set $index = 1
while ($index < $max)
do
   echo -e "DEBUG: Script pageloop.txt: Getting page " $index
   script page.txt pageNumber($index)
   set $index=$index+1
done

To save into excel i used the instructions


script pageloop.txt max(5) > output.xls
system start output.xls

But the program is saving the first page only 4 times.It is not saving 2 ,3 ,4 th page.Can this be rectified?

Richard.Williams.20

unread,
Aug 11, 2009, 10:24:24 AM8/11/09
to biterscripting

Did you remember to do the following change in page.txt script ?
(earlier posting)



from

var str content ; cat "whatever" > $content

to

var int pageNumber
var str content
cat ("whatever&pageNumber="+makestr(int($pageNumber))) > $content


Without this change, the value of input argument $pageNumber would
not
be passed as part of the URL to the cat command, and only the
"whatever" page will be extracted 4 times.

Also, please change in pageloop.txt

from

while ($index < $max)

to

while ($index <= $max)

That will properly get 5 pages instead of 4.

Richard


Bharath Ram

unread,
Aug 12, 2009, 11:23:38 AM8/12/09
to bitersc...@googlegroups.com
Yes i have changed it


# Script page.txt
# Get page contents into a string variable.
echo -e "DEBUG: Reading web page"
var int pageNumber
var str content
cat ("http://yellowpages.com.au/search/listingsSearch.do?region=australia&ul.street=&headingCode=23167&userFreeFormAddress=Street+Number+%26+Name&sortByAlphabetical=false&rankType=1&userState=select+---%3E&sortByDistance=false&pageNumber=31&locationForSortBySelected=false&locationText=Queensland&userSuburb=Suburb%2C+Town%2C+City&adPs=&adPs=&adPs=&adPs=&adPs=&ul.streetNumber=&sortByDetail=true&ul.suburb=&businessType=realestate&sortByClosestMatch=false&rankWithTolls=true&sortBy=mostInfo&stateId=4&safeLocationClue=Queensland&currentLetter=&locationClue=Queensland&serviceArea=true&suburbPostcode=&pageNumber="+makestr(int($pageNumber))) > $content

                                             
echo -e "DEBUG: Extracting entries"
# Successively extract portions between <span id="listing-name-xxx"> and the following 3rd instance of </span>.
while ( {sen -c -r "^<span id=\"listing-name-&\"\>^" $content } > 0 )
do
    # Discard the portion up to the <span id="listing-name-xxx">
    stex -c -r "^<span id=\"listing-name-&\"\>^]" $content > null
 
    # Collect the portion upto the 3rd instance of </span>.
    var str entry ; stex -c "]^</span>^3" $content > $entry
 
    # $entry now contains one entry. Portion up to 1st </span> is the name.
    var str name ; stex -c "]^</span>^" $entry > $name
 
    # Discard the portion up to <span class="address">.
    stex -c "^<span class=\"address\">^]" $entry > null
 
     # Portion up to 1st </span> is the address.
    var str address ; stex -c "]^</span>^" $entry > $address
 
    # Discard the portion up to <span class="phoneNumber">ph: .
    stex -c "^<span class=\"phoneNumber\">ph: ^]" $entry > null
 
     # The remaining portion is the phone.
    var str phone ; set $phone = $entry
 
    # Output the name, address and phone in tab separated values format.
    echo $name "\t" $address "\t" $phone
 
done
# End of script page.txt


This is the program i am using.But it extracts the same page [i mean the first page] 5 times.Doesn't save any other pages into excel. How can this rectified?

Richard.Williams.20

unread,
Aug 12, 2009, 2:00:18 PM8/12/09
to biterscripting

I see, the URL has changed. (I like the new URL - it is more
explicit.)

Anyway, there is an extraneous &pageNumber=31 in the URL that comes
before pageNumber="+makestr(). So, I suspect that it is always showing
you page 31. When I removed &pageNumber=31, I am getting 5 pages.

Richard
Reply all
Reply to author
Forward
0 new messages