Parsing for urls

0 views
Skip to first unread message

cengineer

unread,
Nov 23, 2009, 9:31:52 AM11/23/09
to DotNetDevelopment, VB.NET, C# .NET, ADO.NET, ASP.NET, XML, XML Web Services,.NET Remoting
I recently learnt about crawling web pages and parsing the html source
code using VB.NET. This has worked well so far with my code working
error free. I recently came across a website that contains links I
would like to capture but the links as they appear on the webpage are
not visible in the html source. I tried to extract them differently by
visiting articles and posts which worked but some of those links were
not included.

My question is whether this is unusual or is there something I need to
read up on. If you scroll over the link or click on it, it follows to
the link site appropriately but why is it not in the source html. Does
this have something to do with other scripts running on the page>

Any advice or suggestions appreciated.

Processor Devil

unread,
Nov 23, 2009, 11:41:56 AM11/23/09
to dotnetde...@googlegroups.com
links can be created dynamically using javascript or maybe they are in some frameset or iframe... :D

2009/11/23 cengineer <ceng...@hushmail.com>

Cerebrus

unread,
Nov 23, 2009, 11:30:17 AM11/23/09
to DotNetDevelopment, VB.NET, C# .NET, ADO.NET, ASP.NET, XML, XML Web Services,.NET Remoting
Your hunch is spot on... the links are probably created dynamically
using client side script (read "Javascript").

Adam Lee

unread,
Nov 23, 2009, 11:45:37 PM11/23/09
to DotNetDevelopment, VB.NET, C# .NET, ADO.NET, ASP.NET, XML, XML Web Services,.NET Remoting
cengineer would you mind posting some code or sending me the basic
code to do web crawling?
Message has been deleted

Processor Devil

unread,
Nov 24, 2009, 5:24:04 AM11/24/09
to dotnetde...@googlegroups.com
Here is simple function to extract links from html source code...
You need to use namespaces System.Text.RegularExpressions and System.Collections.Generic

        public static List<string> getLinks(string data)
        {
            Regex rx = new Regex("<a href=\"([^>\"]+)[^>]*\">");
            List<string> forReturn = new List<string>();
            foreach(Match m in rx.Matches(data))
                forReturn.Add(m.Groups[1].Value);
            return forReturn;
        }

2009/11/24 iwork iwork <iwork....@gmail.com>
Me too.

Reply all
Reply to author
Forward
0 new messages