infinite loop with http requests

yawnmoth

unread,

Nov 20, 2006, 12:37:12 PM11/20/06

to

I'm trying to write something that'll let me output the contents of a
given webpage while skipping over the headers. Since I'm trying to
learn raw HTTP, I'm using Sockets and not URL.

Anyway, the header of an HTTP response ends when you have "\r\n\r\n".
BufferedReader's readLine treats that as two lines since it considers
"\r\n" to be a line terminating character. Since it also strips off
the line terminating characters, readLine should return the second line
as "".

Per that, I've written a program that will loop, continuously, until ""
is encountered. Unfortunately, "" never appears to be encountered and
thus I have an infinite loop.

Here's my code:

import java.net.*;
import java.io.*;

public class HttpRequestor
{
public static void main(String[] args) {
try {
Socket sock = new Socket("www.google.com", 80);
String httpRequest = "GET / HTTP/1.0\r\nHost:
www.google.com\r\n\r\n";
sock.getOutputStream().write(httpRequest.getBytes());
BufferedReader text = new BufferedReader(new
InputStreamReader(sock.getInputStream()));

String line, output = "";
while (text.readLine() != "");
while ((line = text.readLine()) != null) {

System.out.println("\r\n'"+URLEncoder.encode(line)+"'\r\n");
}
}
catch (Exception e) {
e.printStackTrace();
}
}
}

To confirm that I was indeed getting "" back from readLine, I wrote the
following:

import java.net.*;
import java.io.*;

public class HttpRequestor
{
public static void main(String[] args) {
try {
Socket sock = new Socket("www.google.com", 80);
String httpRequest = "GET / HTTP/1.0\r\nHost:
www.google.com\r\n\r\n";
sock.getOutputStream().write(httpRequest.getBytes());
BufferedReader text = new BufferedReader(new
InputStreamReader(sock.getInputStream()));

String line, output = "";
while ((line = text.readLine()) != null) {

System.out.println("\r\n'"+URLEncoder.encode(line)+"'\r\n");
}
}
catch (Exception e) {
e.printStackTrace();
}
}
}

This shows that "" is indeed being returned by readLine. So why
doesn't the while loop in the first program terminate when "" is
received?

Any insights would be appreciated - thanks!

Robert Klemme

unread,

Nov 20, 2006, 12:41:39 PM11/20/06

to

Because you compare strings with == (identity) instead with equals()
(equivalence).

robert

Oliver Wong

unread,

Nov 20, 2006, 12:55:05 PM11/20/06

to

"yawnmoth" <terr...@yahoo.com> wrote in message
news:1164044232.1...@k70g2000cwa.googlegroups.com...

> I'm trying to write something that'll let me output the contents of a
> given webpage while skipping over the headers. Since I'm trying to
> learn raw HTTP, I'm using Sockets and not URL.

[snip most of the code]

> Socket sock = new Socket("www.google.com", 80);

I recommend against using google as your test server. Google does some
funky stuff when it detects that Java is connecting to it, which may give
you unexpected results.

- Oliver

Daniel Pitts

unread,

Nov 20, 2006, 1:22:38 PM11/20/06

to

Good suggestion except for two things, He isn't using Java's URL API,
which is what's responsible for setting the User-Agent string. Second,
you can override the User-Agent string, and google couldn't possible
know the difference.

In any case, his problem is that the OP is comparingwith line == "",
when he should use line.equals(""), or better yet line.size() == 0

HTH,
Daniel.

yawnmoth

unread,

Nov 20, 2006, 1:27:47 PM11/20/06

to

Robert Klemme wrote:
> <snip>

> Because you compare strings with == (identity) instead with equals()
> (equivalence).

That was it - thanks! :)

Chris Uppal

unread,

Nov 20, 2006, 2:29:37 PM11/20/06

to

Daniel Pitts wrote:

> Oliver Wong wrote:
> > I recommend against using google as your test server. Google does
> > some funky stuff when it detects that Java is connecting to it, which
> > may give you unexpected results.

[...]

> Good suggestion except for two things, He isn't using Java's URL API,
> which is what's responsible for setting the User-Agent string. Second,
> you can override the User-Agent string, and google couldn't possible
> know the difference.

I agree with Oliver's advice. Google is perfectly at liberty to treat requests
differently depending on how they /appear/ to have been submitted.

If I were them I would group requests into at least three categories: ones that
appear to be legit (as far as we can tell from the various meta-info in a
request); those that appear to come from frequently abused clients (such as the
Java stuff); and those where we can't tell much. I would be less aggressive
about -- say -- shutting off an over-eager client IP address if the requests
appeared to be from a normal browser than if they appeared to come from
uncontrolled code. And I'd put the "can't tell" ones somewhere in the middle.

But the bottom line is not that Google /can/ treat requests differently
depending on apparently immaterial meta stuff, but that it /does/ do so --
which makes it a very poor example domain for a beginner (to HTTP) to test
against.

-- chris

Daniel Pitts

unread,

Nov 20, 2006, 3:29:29 PM11/20/06

to

Okay, while my point was that you can "trick" google into thinking that
it is probably a legit client, your point is well taken.

I suppose a good way to learn HTTP is to set up a webserver in your own
development environment (such as apache, resin, etc...), and use it
instead of a third party website. That way you also have control over
the content being produced.

- Daniel.