Getting text from a URL

0 views
Skip to first unread message

mic...@gmail.com

unread,
Oct 21, 2006, 7:11:43 PM10/21/06
to
I am trying to read the text of a website using a URL object and a data
stream
It works well on CNN.com for example, but doesn't work well on:
http://www.collegehumor.com:80/video:1674301

How should I interpret the stream I'm getting?


I'm using the following code:

URL u;
InputStream is = null;
DataInputStream dis;
String s;

try {

u = new URL("http://www.collegehumor.com:80/video:1674301");
is = u.openStream(); // throws an IOException
dis = new DataInputStream(new BufferedInputStream(is));
while ((s = dis.readLine()) != null) {
System.out.println(s);
}
}
catch (MalformedURLException mue) {
} catch (IOException ioe) {
} finally {
try {
is.close();
} catch (IOException ioe) {
}

} // end of 'finally' clause

} // end of main

Régis Décamps

unread,
Oct 21, 2006, 7:40:29 PM10/21/06
to

On Oct 22, 1:11 am, mic...@gmail.com wrote:
> I am trying to read the text of a website using a URL object and a data
> stream
> It works well on CNN.com for example, but doesn't work well on:http://www.collegehumor.com:80/video:1674301
>

What makes you think it does not work?

> How should I interpret the stream I'm getting?

As HTML?

I don't get exactly what you want to do, but have you considered
Jakarta HttpClient?
--
Régis

mic...@gmail.com

unread,
Oct 21, 2006, 8:05:49 PM10/21/06
to

Régis Décamps wrote:
> On Oct 22, 1:11 am, mic...@gmail.com wrote:
> > I am trying to read the text of a website using a URL object and a data
> > stream
> > It works well on CNN.com for example, but doesn't work well on:http://www.collegehumor.com:80/video:1674301
> >
>
> What makes you think it does not work?
The fact instead of normal HTML text I'm getting gibbrish like this:
<?s?6²¿w¦? ??E?9 ¿$J´-e ?I|/N|¶?^s???$$1¦ ??l«???· ? IQ²?v??¼d ? X` ?? ?~8?tr??? ?e??\~~????hm]??>????S??÷7 ??1?MB?4?B ?H×?>jD?e??@×???;÷v?'S??J @X&vV??¬?³ d?6??»#| ¿x?h
¯?,£ ?¶?o??n¨??8cq?¾Y-?F|y7?2??? ???3??, ?)o =·m
? RL?l¨?e6?I©7

>
> > How should I interpret the stream I'm getting?
>
> As HTML?
>
> I don't get exactly what you want to do, but have you considered
> Jakarta HttpClient?
Thanks for the tip - will give it a shot
> --
> Régis

Arne Vajhøj

unread,
Oct 21, 2006, 9:10:31 PM10/21/06
to
mic...@gmail.com wrote:
> The fact instead of normal HTML text I'm getting gibbrish like this:
> <?s?6²¿w¦? ??E?9 ¿$J´-e ?I|/N|¶?^s???$$1¦ ??l«???· ? IQ²?v??¼d ? X` ?? ?~8?tr??? ?e??\~~????hm]??>????S??÷7 ??1?MB?4?B ?H×?>jD?e??@×???;÷v?'S??J @X&vV??¬?³ d?6??»#| ¿x?h
> ¯?,£ ?¶?o??n¨??8cq?¾Y-?F|y7?2??? ???3??, ?)o =·m
> ? RL?l¨?e6?I©7

Look as if that URL are returning its content GZIP'ed.

Try wrap the InputStream in a GZIPInputStream.

Arne

Tor Iver Wilhelmsen

unread,
Oct 22, 2006, 5:23:43 AM10/22/06
to
mic...@gmail.com writes:

> http://www.collegehumor.com:80/video:1674301
>
> How should I interpret the stream I'm getting?

I guess it's a video stream , so you should read it as binary and pass
it to a media library if you want to show it.

> while ((s = dis.readLine()) != null) {

Last I checked, video formats were not line-oriented.

Andrew Thompson

unread,
Oct 22, 2006, 5:41:54 AM10/22/06
to
mic...@gmail.com wrote:
> I am trying to read the text of a website using a URL object and a data
> stream
> It works well on CNN.com for example, but doesn't work well on:
> http://www.collegehumor.com:80/video:1674301

This source loads and displays (crudely) the web page
at that address.

<sscce>
import javax.swing.*;
import java.net.URL;

public class ShowURL {
public static void main(String[] args) {
String address = null;
if (args.length==0) {
address = JOptionPane.showInputDialog(null, "URL?");
} else {
address = args[0];
}
JEditorPane jep = null;
try {
URL url = new URL(address);
jep = new JEditorPane(url);
} catch(Exception e) {
jep = new JEditorPane();
jep.setText( e.toString() );
}
JScrollPane jsp = new JScrollPane(jep);
jsp.setPreferredSize(new java.awt.Dimension(400,300));
JOptionPane.showMessageDialog(null, jsp);
}
}
</sscce>

..so the data is readable, and it is a web-page.

Andrew T.

William Brogden

unread,
Oct 22, 2006, 11:03:52 AM10/22/06
to
On Sat, 21 Oct 2006 19:05:49 -0500, <mic...@gmail.com> wrote:

>
> Régis Décamps wrote:
>> On Oct 22, 1:11 am, mic...@gmail.com wrote:
>> > I am trying to read the text of a website using a URL object and a
>> data
>> > stream
>> > It works well on CNN.com for example, but doesn't work well
>> on:http://www.collegehumor.com:80/video:1674301
>> >
>>
>> What makes you think it does not work?
> The fact instead of normal HTML text I'm getting gibbrish like this:
> <?s?6²¿w¦? ??E?9 ¿$J´-e ?I|/N|¶?^s???$$1¦ ??l«???·? IQ²?v??¼d ? X` ?? ?~8?tr??? ?e??\~~????hm]??>????S??÷7 ??1?MB?4?B ?H×?>jD?e??@×???;÷v?'S??J @X&vV??¬?³ d?6??»#| ¿x?h
> ¯?,£ ?¶?o??n¨??8cq?¾Y-?F|y7?2??? ???3??, ?)o =·m
> ? RL?l¨?e6?I©7
>>

As another poster already said, this is gzip encoded.

When I do this sort of thing I just grab the data stream to a byte[] -
then take a look at the headers to see what the encoding is when I have
the whole message.

I found that it is necessary to search for the GZIP signature bytes
to locate the start of the gzip stream after the headers.

Bill

Reply all
Reply to author
Forward
0 new messages