C# - Sample code to demonstrate web crawling?

898 views
Skip to first unread message

Dixie Normous

unread,
Dec 28, 2008, 2:26:15 AM12/28/08
to DotNetDevelopment, VB.NET, C# .NET, ADO.NET, ASP.NET, XML, XML Web Services,.NET Remoting
I'm researching "screen scraping" or "web crawling", basically
creating a tool which could activate controls on a particular web page
to simulate user activity. Specifically I have an external customer
website I want to access which has various dropdowns that let you
drill down selections until you reach a specific set of criteria, then
retrieve the data which is retrieved using that criteria.

I have the ability and authorization to access the site manually as
far as credentials are concerned, I'd just like to automate the
process both to make things easier for me and to help me learn about
how crawling is accomplished through .NET.

If someone could please help me find some sample code or help me with
some additional info I'd appreciate it greatly. Thanks!!

Cerebrus

unread,
Dec 28, 2008, 9:58:36 AM12/28/08
to DotNetDevelopment, VB.NET, C# .NET, ADO.NET, ASP.NET, XML, XML Web Services,.NET Remoting
I have done this sort of thing pretty effectively in the past. It's
all about understanding how pages post data and the technology (ASP/
ASP.NET/JSP/PHP...) those pages use to validate or process that
submitted data.

If you have a specific issue, I'd be most happy to help out.

Dixie Normous

unread,
Dec 28, 2008, 2:57:51 PM12/28/08
to DotNetDevelopment, VB.NET, C# .NET, ADO.NET, ASP.NET, XML, XML Web Services,.NET Remoting
Thank you Cerebrus!!

Ok, say you have a page where you search for a Customer. It's got
radio buttons where you choose what search criteria to use; for
example, one might let you search by First and Last Name, another
might let you search by Address, yet another by an AccountNumber.

To automate this, I'd want to be able to identify the controls I need
to manipulate to choose the search method, and then be able to input
the search criteria accordingly. So if I choose the AccountNumber
search method, I'd want to be able to take some value stored elsewhere
in my code and input that into the AccountNumber field, then
programmatically hit the Search button and let the site do its work.
Then, when the search has been completed, I'd need to be able to grab
the result set, if any, and do whatever else I need to with it.

TechOwl

unread,
Dec 29, 2008, 6:53:24 AM12/29/08
to DotNetDevelopment, VB.NET, C# .NET, ADO.NET, ASP.NET, XML, XML Web Services,.NET Remoting
Dixie...

You are being a little too vague and not providing enough information
for us to really help you. I believe what you are wanting to do is
"doable" but we need more detailed information.

For example, is the external site a .NET site or something else? Such
a distinction is important because if it is a .NET site (for example)
then you have to be concerned with the things like viewstate, etc.

I have done this recently, and am also doing it now for a current
project, so if you want to share additional details and get help, I
can certainly try to help if you want to e-mail me or something.

This is not very straight-forward, and it is also not really a
recommended way of developing solutions due to the dependencies
inherent to the approach.

For starters...

1) Get Fiddler & learn some about it: http://www.fiddlertool.com/fiddler/
2) Look into the System.Net namespace (if you're dealing with .NET) :
http://msdn.microsoft.com/en-us/library/system.net.aspx
[specifically HttpWebRequest, WebResponse, as well as
System.IO.StreamReader]

Hope this helps some!

Milo

unread,
Dec 29, 2008, 4:13:26 AM12/29/08
to DotNetDevelopment, VB.NET, C# .NET, ADO.NET, ASP.NET, XML, XML Web Services,.NET Remoting
Download an application called "fiddler" this will capture all the
posted packet through a manual manipulation of the form.

You just need mimick you post programmatically.

Dixie Normous

unread,
Dec 29, 2008, 7:24:26 PM12/29/08
to DotNetDevelopment, VB.NET, C# .NET, ADO.NET, ASP.NET, XML, XML Web Services,.NET Remoting
TechOwl, thanks for the links, let me try to present some more
specifics about my situation.

The external site is an internet site which my site has been granted
access to. A group of users on site has a shared username and
password they use to manually browse the site to search for customer
info, kind of like a corporate B2B account a reseller might have to a
vendor's site. That site uses ASP .NET 1.1, but I've got no access
beyond the browser to this site per an agreement between our company
and theirs.

I'm trying to build a tool which would basically involve me creating a
"wrapper" site on my corporate intranet to convey search queries to
the remote site. For example, a site on my intranet might show the
various search fields available, the user would input their criteria,
and when they submit this, my backend code (maybe running as a WCF
service or something on one of the servers on my intranet) would
package up and execute a browsing session (crawl?) to the remote site,
and retrieve any results of said query.

Since the remote site doesn't expose any kind of API or anything that
I could query more directly (or less indirectly?!) it seems automating
the web browsing process is the next best option, in my limited
experience so far at this kind of stuff. Hope this helps further
clarify what I'm talking about; if any further details would help ask
away, meanwhile I'll definitely check out Fiddler as suggested by
yourself and Milo.

Cerebrus

unread,
Dec 30, 2008, 12:06:05 PM12/30/08
to DotNetDevelopment, VB.NET, C# .NET, ADO.NET, ASP.NET, XML, XML Web Services,.NET Remoting
I was doing scraping using the same code for 4 completely different
sites simultaneously, so I had a pretty object oriented approach (read
polymorphic inheritance) and was running my object in multiple
threads. But the basic principle remains the same. There may be many
ways to do it... Here's how I did it and would suggest you to take the
following steps :

1. Understand how the request-response system works for the web and
how your browser works internally.
2. Study the HttpWebRequest and HttpWebResponse classes in the .NET
Framework and understand how they can be used to mimic any web
traffic.
3. Use Fiddler (v.2 parses HTTPS !) to analyze the HTTP post taking
place when you submit data on the target site. That will give you a
list of form parameters that are passed to the server with each
request. Since this is an ASP.NET site, the ViewState would be one of
the required parameters and must be passed untampered.
4. Create a (preferably configurable) list of the form parameters that
are submitted with the request. An XML file that is dynamically loaded
by the application was how I did it.
5. Load this list and create a querystring containing all those
parameter names and the values they expect. The values here would be
substituted by the values obtained from users of your intranet site.
You might need to URLEncode the Viewstate value.
6. Convert the string to a byte array and write this byte array to the
RequestStream exposed by your HttpWebRequest object.
7. Set as many properties of the HttpWebRequest object as you can...
such as ProtocolVersion, Method, ContentType, Timeout, and any other
Headers you need. This information can usually be found by studying
the request.that Fiddler plugged into.
8. Use the HttpWebRequest.GetResponse() method to obtain a
HttpWebResponse object that is the response returned by the server. If
all works well, this would be the next page that you see when manually
submitting data.

You can then analyze this page for any data you want.

Hope that helps !
> > Hope this helps some!- Hide quoted text -
>
> - Show quoted text -
Reply all
Reply to author
Forward
0 new messages