Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Webcrawle / Search engine

29 views
Skip to first unread message

Noni

unread,
Apr 7, 2004, 5:27:42 PM4/7/04
to
Hi

Need help !!

Iv'e embarked on a journey to develop a Webcrawler / Search engine.
It will eventually get late holiday deals and info from the web and
display in an orderly fashion.

I have to implement using Java.
Since I have very little exprience in Java.... I dont even know where
to start.
Please help.... my lively hood depends on it.

Regards

Andrew Thompson

unread,
Apr 7, 2004, 6:10:31 PM4/7/04
to
On 7 Apr 2004 14:27:42 -0700, Noni wrote:

> Need help !!
....


> Since I have very little exprience in Java....

>..I dont even know where
> to start.

Both those statements scream..
<http://www.physci.org/codes/javafaq.jsp#cljh>

> Please help.... my lively hood depends on it.

That sounds like a definite problem.

I estimate it might take 6 months,
depending upon your proficiency at
taking in information, and prior
knowledge of OO design, to successfully
attempt the project you describe.

--
Andrew Thompson
http://www.PhySci.org/ Open-source software suite
http://www.PhySci.org/codes/ Web & IT Help
http://www.1point1C.org/ Science & Technology

Noni

unread,
Apr 8, 2004, 9:47:55 AM4/8/04
to
Andrew Thompson <SeeMy...@www.invalid> wrote in message news:<1aqerg40qmt3o.bf0pwffn97te$.d...@40tude.net>...

> On 7 Apr 2004 14:27:42 -0700, Noni wrote:
>
> > Need help !!
> ....
> > Since I have very little exprience in Java....
> >..I dont even know where
> > to start.
>
> Both those statements scream..
> <http://www.physci.org/codes/javafaq.jsp#cljh>
>
> > Please help.... my lively hood depends on it.
>
> That sounds like a definite problem.
>
> I estimate it might take 6 months,
> depending upon your proficiency at
> taking in information, and prior
> knowledge of OO design, to successfully
> attempt the project you describe.

I dont have 6 months. I have untill the 20th of April!!!
Please please please please HELP !!!!

Christophe Vanfleteren

unread,
Apr 8, 2004, 10:10:20 AM4/8/04
to
Noni wrote:

Not to be pessimistic or anything, but that just isn't going to happen by
then.

--
Kind regards,
Christophe Vanfleteren

Chris Uppal

unread,
Apr 8, 2004, 11:17:21 AM4/8/04
to
Christophe Vanfleteren wrote in reply to Noni:

> > I dont have 6 months. I have untill the 20th of April!!!
> > Please please please please HELP !!!!
>
> Not to be pessimistic or anything, but that just isn't going to happen by
> then.

To be more accurate, someone who knew Java well could write a fully functional
prototype in a few days -- say, a week. To do that they would use Java's
networking classes like java.net.URL and java.net.HttpURLConnection to do
the downloading, and javax.swing.text.html.parser.DocumentParser to parse the
webpages and find the URLs embedded in them.

Of course, it would *only* be a prototype -- it would not handle all valid
HTML, for example (let alone the *masses* of invalid HTML you find). It would
not even start to handle links that were generated by / embedded in JavaScript
(been there, done that, got the scars). It would probably not scale very well
to downloading large numbers of pages at once. It would lack many features
that a production-quality crawler would want (e.g. management interface). Etc,
etc. But it *would* be a start.

To Noni:

The problem with that is that you have to know Java first. I'm sorry, but I
have to agree with the others that you are unlikely to be able to learn Java
well enough to write a fairly sophisticated program, *and* use it to do so, in
12 days.

This is the point where you go discuss matters with your manager. They have
given you an impossible job, and that's not *your* fault. The important thing
is to make constructive suggestions instead of just saying "I can't do it in
the time". E.g. would your time be better spent trying to find an
off-the-shelf solution ? Is there someone else you could do it while you take
over their job for a while ? And so on...

Do you know any other programming languages ? Scripting languages such as
Perl/Python/Ruby would be good. If you already know one of those then you
could almost certainly hack together a simple web crawler in that faster than I
could in Java. Maybe that would do for the time being ?

(BTW, if this is course work, rather than a real job, then much the same points
hold -- if you've reached the point where you know can't complete your
assignment, then the sooner you ask for help from your teachers the better.
The earlier you ask for help the more likely they are to be able to *give* you
help, and the more likely you are to get credit for the work you *have* been
able to do)

-- chris

Noni

unread,
Apr 8, 2004, 7:37:48 PM4/8/04
to
"Chris Uppal" <chris...@metagnostic.REMOVE-THIS.org> wrote in message news:<noKdnd4nAdW...@nildram.net>...

Guys the above is much appreciated.

Chris, Your right that this is my course work. I have left it to the
last minute by no ones but my own fault.

How about I not build a full crawler but something that will go to
maybe a couple of sites and grab what i need and display it ???

I cannot stress how much I appreciate the help !!!!

Thank you
Regards
Noni

Andrew Thompson

unread,
Apr 8, 2004, 9:17:48 PM4/8/04
to
On 8 Apr 2004 16:37:48 -0700, Noni wrote:

> Your right that this is my course work. I have left it to the
> last minute by no ones but my own fault.
>
> How about I not build a full crawler but something that will go to
> maybe a couple of sites and grab what i need and display it ???

How about you vacate the position at
the educational institution of which
you are obviously wasting their (and
your) time, so that someone who deserves
the position might have the opportunity.

> I cannot stress how much I appreciate the help !!!!

...is that supposed to get us beyond the
fact that you are lying to people publicly,
cheating on your homework, and lazy besides???

Chris Uppal

unread,
Apr 9, 2004, 6:29:35 AM4/9/04
to
Noni,

> How about I not build a full crawler but something that will go to
> maybe a couple of sites and grab what i need and display it ???

I think you'll find that you have to write *more* code to restrict the area
that the crawler will trawl.

I'd break it down into three bits. Think about them separately (the order
doesn't matter), and see how far you can get with each. I don't know your
course or your teachers, but I imagine that you'd get *some* credit for solving
any one of the bits, and if you manage all three then you're home and dry.

+ If you've been given an URL like "http://java.sun.com/" how do you download
it from the net ?

+ If you have a String (or a file) containing the text of a webpage, how do
you parse it to find the URLs in it ?

+ If you imagine that you've magically solved the first two problems, how do
you structure your code to make a whole crawler ? You'll need to keep a list
of URLs to download as you loop: {take an URL off the list; download the
webpage; find the URLs in that, add them to the list; repeat}

My earlier post had pointers to the Java library classes which will help you do
the first two things, or maybe you've already done something in your classes
which will help.

But I repeat: go talk to your teachers. They *want* to help you learn (or they
wouldn't be teaching). Also they know *what* help you need (I can only guess),
and can give it without cheating, whereas I think *I've* said all that I can
without cheating.

So. Don't panic (very important!). Break the problem down into small bits.
Do as much as you can of as many of the bits as you can manage. Ask for help
in the right place (school).

And stop wasting time on Usenet ;-)

-- chris


Noni

unread,
Apr 9, 2004, 9:06:00 AM4/9/04
to
Andrew Thompson <SeeMy...@www.invalid> wrote in message news:<bdtssxlqbrh3$.12xn3bzge2wze$.d...@40tude.net>...

> On 8 Apr 2004 16:37:48 -0700, Noni wrote:
>
> > Your right that this is my course work. I have left it to the
> > last minute by no ones but my own fault.
> >
> > How about I not build a full crawler but something that will go to
> > maybe a couple of sites and grab what i need and display it ???
>
> How about you vacate the position at
> the educational institution of which
> you are obviously wasting their (and
> your) time, so that someone who deserves
> the position might have the opportunity.
>
> > I cannot stress how much I appreciate the help !!!!
>
> ...is that supposed to get us beyond the
> fact that you are lying to people publicly,
> cheating on your homework, and lazy besides???


I guess people such as Andrew Thompson are here to kick people while
there down and laugh at them, rather than lending a helping hand !!!!

Andrew you dont know anything about me, Its rude to judge people like
you have above!!!

Everybody else!! Is it safe for me to turn away from here or do Ihave
any chance of some assistance ??

Regards
Noni

Andrew Thompson

unread,
Apr 9, 2004, 10:58:43 AM4/9/04
to
On 9 Apr 2004 06:06:00 -0700, Noni wrote:

> Andrew Thompson <SeeMy...@www.invalid> wrote in message news:<bdtssxlqbrh3$.12xn3bzge2wze$.d...@40tude.net>...
>> On 8 Apr 2004 16:37:48 -0700, Noni wrote:

...


>>> I cannot stress how much I appreciate the help !!!!
>>
>> ...is that supposed to get us beyond the
>> fact that you are lying to people publicly,
>> cheating on your homework, and lazy besides???
>
>
> I guess people such as Andrew Thompson are here to kick people while
> there down and laugh at them, rather than lending a helping hand !!!!

Your suppositions, like your efforts
thus far, are piss-poor.

You ignored the advice I gave you in..
<http://groups.google.com/groups?th=caf1c7b9b30e3d2b>
advising you to head on over to c.l.j.h.,
which is where I advised you to go when
I realised you were yet another student
who wanted us to do your homework.

Posters on c.l.j.h. get some of the same people
answering qns as appear here or on c.l.j.gui,
but those people are altogether more patient
than when they reply on the other groups.

> Andrew you dont know anything about me, Its rude to judge people like
> you have above!!!

It's rude to lie to people.

> Everybody else!! Is it safe for me to turn away from here or do Ihave
> any chance of some assistance ??

Yes, ideed turn away from here!

Get yourself over to c.l.j.help.

But do that _only_ if you actually intend to put
some effort in, hack out some code, however bad,
and actually try to learn Java.

Otherwise you will achieve nothing (nobody there
intends to do your homework for you) and waste
everybody's time and bandwidth.

It is up to _you_.

Christophe Vanfleteren

unread,
Apr 9, 2004, 11:01:14 AM4/9/04
to
Noni wrote:

> Everybody else!! Is it safe for me to turn away from here or do Ihave
> any chance of some assistance ??

You'll have assistance if you ask specific questions.
You've already been told what classes are usefull for implementing what you
need. I suggest you try to start working with them and see how far you get.
If you have a specific question, feel free to ask, but just don't expect
that anyone will do your homework for you.

mroma...@rogers.com

unread,
Apr 18, 2004, 11:19:55 AM4/18/04
to

import java.net.*;
import java.io.*;
import java.util.regex.*;
import java.util.*;
public class PullUrl3
{
final static boolean DEBUG=false;
static Hashtable urls = new Hashtable();
public static void main(String [] args)
{
String rootString = "http://etext.lib.virginia.edu/koran.html";
ArrayList baseListing = getLinks(rootString,rootString);
if(!baseListing.isEmpty())
{
Driller(rootString, baseListing);
}
System.out.println("Done");
}

public static void Driller(String thebase, ArrayList urlListing)
{
for(Iterator c = urlListing.iterator();c.hasNext();)
{
String singleURL="";
String newBaseString = "";
singleURL=(String) c.next();
Pattern pattern = Pattern.compile("http://.*?/", Pattern.DOTALL);
Matcher matcher = pattern.matcher(singleURL);
if(matcher.find())
{
newBaseString = matcher.group();
//System.out.println("newBaseString" + newBaseString);
}
else
{
continue;
}
ArrayList newBase = getLinks(newBaseString, singleURL);
if(!newBase.isEmpty())
{
//System.out.println("newBaseString" + newBaseString);
//System.out.println(singleURL);
Driller(newBaseString, newBase);
}
else
{
//System.out.println("newBaseString" + newBaseString);
//System.out.println(singleURL);
}
//if have listing get it and pass back to driller
//if does not have listing leave alone and show
}

}

public static ArrayList getLinks(String baseString, String theurl)
{
ArrayList returnThis = new ArrayList();
StringBuffer strbuffer = new StringBuffer();
try
{
URL u = new URL( baseString);
HttpURLConnection huc = (HttpURLConnection) u.openConnection();
huc.setRequestMethod("GET");
huc.setDoInput(true);
huc.setDoOutput(false);
huc.setUseCaches(false);
huc.connect();
InputStream inputStream = huc.getInputStream();
BufferedInputStream bis = new BufferedInputStream(inputStream);
while(true)
{
int cint = bis.read();
if(cint == -1)
{
break;
}
strbuffer.append((char)cint);
}
huc.disconnect();
Pattern pattern = Pattern.compile("href=\".*?\"", Pattern.DOTALL);
Matcher matcher = pattern.matcher(strbuffer);
String fullUrl = "";
while(matcher.find())
{
fullUrl = fullURL(baseString, removeHref(matcher.group()));
//System.out.println(fullUrl);
// check if anchor
if(fullUrl.indexOf('#') == -1)
{
// check if in database
if(urls.put(fullUrl, fullUrl) == null)
{
System.out.println(fullUrl);
returnThis.add(fullUrl);
}
else
{
//System.out.println(fullUrl + ": already there");
}
}
}
}
catch (IOException e)
{
System.out.println("Error : "+e);
}
return returnThis;
}



public static String fullURL(String baseString, String value)
{
// case # anchor in page - # at char 0
// case relateive url - virtual directory ~ - remove ~.*?/
// case relative url - /at the beginning
// case full url - http:// at the beginning
// case non http protocol urls - mailto ftp
// make sure to check if slash at end of string before appending
// if find url foundation/blah.html should check to see if
// - contains forward slash
baseString = (baseString.charAt(baseString.length()-1) == '/') ? baseString:baseString+"/";
String returnVal = "";
value = value.trim();
if(value.length() > 1)
{
switch(value.charAt(0))
{
case '#':
System.out.print(((DEBUG) ? "#\n" :"" ));
break;
case '/':
if(value.charAt(1)=='~')
{
Pattern patternVirtual = Pattern.compile("/~.*?/", Pattern.DOTALL);
Matcher matcherVirtual = patternVirtual.matcher(value);
value = matcherVirtual.replaceFirst("");
returnVal = baseString+value;
System.out.print(((DEBUG) ? "/1\n" :"" ));
break;
}
if(value.charAt(value.length()- 1) == '/')
{
System.out.print(((DEBUG) ? "/2\n" :"" ));
returnVal = baseString+value.substring(1,value.length());
}
else
{
System.out.print(((DEBUG) ? "/3\n" :"" ));
returnVal = baseString+value.substring(1,value.length());
}
break;
case 'h':
if(value.startsWith("http://"))
{
returnVal = value;
System.out.print(((DEBUG) ? "http\n" :"" ));
break;
}
case '~':
Pattern patternVirtual = Pattern.compile("~.*?/", Pattern.DOTALL);
Matcher matcherVirtual = patternVirtual.matcher(value);
value = matcherVirtual.replaceFirst("");
returnVal = baseString+value;
System.out.print(((DEBUG) ? "~\n" :"" ));
break;
default:
if(value.charAt(value.length()- 1) == '/')
{
System.out.print(((DEBUG) ? "~def1\n" :"" ));
returnVal = baseString+value.substring(1,value.length());
}
else
{
System.out.print(((DEBUG) ? "def2\n" :"" ));
returnVal = baseString+value;
}
}
}
return returnVal;
}

public static String removeHref(String value)
{
return value.substring(6,value.length() - 1 );
}
}

0 new messages