Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to extract all links/url from web page?

1 view
Skip to first unread message

learnyourabc

unread,
May 5, 2007, 8:59:37 AM5/5/07
to
For a webcrawler, you need to extract all links from the web page. For
normal html anchor tags or any of the src and href attribute on the
tag can be easily extracted using ihtmldocument.
What about links inside of javascript function like below??

<HEAD>
<SCRIPT language="JavaScript">
<!--hide

function newwindow()
{
window.open('jex5.htm','jav','width=300,height=200,resizable=yes');
}
//-->
</SCRIPT>

<A HREF="javascript:newwindow()" >Click Here!</A>

or
javascript function with the following
function newwindow()
{
.....
window.location('http://www.google.com')
}

<input type=button onclick="javascript:newwindow()" >Click Here!

How to extract the links from these javascript function??

Any help would be much appreciated. Can a crawler extract such links
and how??

vincen...@gmail.com

unread,
May 5, 2007, 9:09:24 AM5/5/07
to

Regular expressions are the best way to go. Store the entire HTML
contents in a string and search it for patterns matches. You can find
a ton of RegEx tutorials online.

learnyourabc

unread,
May 6, 2007, 5:13:31 AM5/6/07
to
Regular expressions can only be used to extract the link from the text
if it is displayed inside the javascript in clear text. how to extract
all instances of links formed inside javascript automatically? say
combination of some variables to form the link? Have to execute the
script for the onclick button ext to get the link?? Anyone has any
suggestions?? How

0 new messages