Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Parsing HTML

11 views
Skip to first unread message

James Gralton

unread,
Oct 21, 2002, 11:52:43 AM10/21/02
to
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0038)http://www.mydigiguide.com/dgx/wbl.dll -->
<HTML><HEAD><TITLE>DigiGuide: The Best TV Guide - myDigiGuide Online
Listings</TITLE>
<META content="text/html; charset=windows-1252" http-equiv=Content-Type>
<META content="Copyright © 2001 GipsyMedia Limited, All Rights Reserved."
name=Copyright>
<META content="DigiGuide - the only television guide to use!"
name=description>

Above is the start of a HTML file I am trying to parse using Java's
HTMLEditorKit and when I run the program I get the following output:

At position:64 we have the comment: saved from
url=(0038)http://www.mydigiguide.com/dgx/wbl.dll
At position: 134 We have start Tag: html
At position: 140 We have start Tag: head
At position: 146 We have start Tag: title
At position: 153 We have text: DigiGuide: The Best TV Guide - myDigiGuide
Online Listings
At position: 211 We have end Tag: title
At position: 294 We have error msg: req.att contentmeta?
At position: 221 We have end Tag: head
At position: 221 We have end Tag: html
At position: 221 We have start Tag: html
Attribute: _implied_
At position: 221 We have start Tag: head
Attribute: _implied_
At position: 221 We have end Tag: head
At position: 221 We have start Tag: body
Attribute: _implied_
At position: 294 We have error msg: ioexception???
At position: 221 We have end Tag: body
At position: 221 We have end Tag: html
IO Exception

Can anyone tell me why I am getting this exception or how I can fix it.

Thank you

If you need any further info or the code please ask

Christian Kaufhold

unread,
Oct 22, 2002, 11:46:35 AM10/22/02
to
Hello!

James Gralton <jim...@yahoo.com> wrote:

[...]


> At position: 134 We have start Tag: html
> At position: 140 We have start Tag: head
> At position: 146 We have start Tag: title
> At position: 153 We have text: DigiGuide: The Best TV Guide - myDigiGuide
> Online Listings
> At position: 211 We have end Tag: title
> At position: 294 We have error msg: req.att contentmeta?
> At position: 221 We have end Tag: head
> At position: 221 We have end Tag: html
> At position: 221 We have start Tag: html

Now this looks as if the HTML is severely broken.

Please post the complete (or a shortened version that still shows)
error) HTML document.


> Attribute: _implied_
> At position: 221 We have start Tag: head
> Attribute: _implied_
> At position: 221 We have end Tag: head
> At position: 221 We have start Tag: body
> Attribute: _implied_
> At position: 294 We have error msg: ioexception???

What is position "294"? Where does the Reader come from?
What is the IOException's stack trace?

The HTML parser also gets quite confused if any of the handleXXX methods
throw (unchecked) exceptions.


Christian

James Gralton

unread,
Oct 24, 2002, 8:14:49 AM10/24/02
to
 
James Gralton <jim...@yahoo.com> wrote in message news:...
Here is the error with the stack trace:

C:\jbuilder5\jdk1.3\bin\javaw -classpath "C:\Documents and Settings\James
Gralton\My
Documents\Work\Project\htmlEditor\htmlEditor\classes;C:\jbuilder5\jdk1.3\dem
o\jfc\Java2D\Java2Demo.jar;C:\jbuilder5\jdk1.3\jre\lib\i18n.jar;C:\jbuilder5
\jdk1.3\jre\lib\jaws.jar;C:\jbuilder5\jdk1.3\jre\lib\rt.jar;C:\jbuilder5\jdk
1.3\jre\lib\sunrsasign.jar;C:\jbuilder5\jdk1.3\lib\dt.jar;C:\jbuilder5\jdk1.
3\lib\tools.jar"  htmleditor.Editor

At position:64 we have the comment:  saved from
url=(0038)http://www.mydigiguide.com/dgx/wbl.dll
At position: 134 We have start Tag: html
At position: 140 We have start Tag: head
At position: 146 We have start Tag: title
At position: 153 We have text: DigiGuide: The Best TV Guide - myDigiGuide
Online Listings
At position: 211 We have end Tag: title
At position: 294 We have error msg: req.att contentmeta?
At position: 221 We have end Tag: head
At position: 221 We have end Tag: html
At position: 221 We have start Tag: html
At position: 221 We have start Tag: head
At position: 221 We have end Tag: head
At position: 221 We have start Tag: body
At position: 294 We have error msg: ioexception???
At position: 221 We have end Tag: body

At position: 221 We have end Tag: html
IO Exception
javax.swing.text.ChangedCharSetException
 at
javax.swing.text.html.parser.DocumentParser.handleEmptyTag(DocumentParser.ja
va:172)
 at javax.swing.text.html.parser.Parser.startTag(Parser.java:327)
 at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1786)
 at javax.swing.text.html.parser.Parser.parseContent(Parser.java:1821)
 at javax.swing.text.html.parser.Parser.parse(Parser.java:1980)
 at
javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java:109)
 at
javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.java:74)
 at htmleditor.Editor.main(Editor.java:32)

The full html file is long so I have attatched it here. But the error is
occuring on about the third line. Alos I am not sure what you mean by where
does the reader come from so I have posted the java files.

And position 294 is the 294th character in the HTML file it is the first
instance of the word META.

Thank you for your help it is much apreciated.


Christian Kaufhold

unread,
Oct 24, 2002, 11:21:03 AM10/24/02
to
Hello!

James Gralton <jim...@yahoo.com> wrote:

> At position: 134 We have start Tag: html
> At position: 140 We have start Tag: head
> At position: 146 We have start Tag: title
> At position: 153 We have text: DigiGuide: The Best TV Guide - myDigiGuide
> Online Listings
> At position: 211 We have end Tag: title
> At position: 294 We have error msg: req.att contentmeta?
> At position: 221 We have end Tag: head
> At position: 221 We have end Tag: html
> At position: 221 We have start Tag: html

There still must be something incorrect so that "html" is closed and
reopened again.

> IO Exception
> javax.swing.text.ChangedCharSetException

Pass "true" as third argument to parse().


> The full html file is long so I have attatched it here. But the error is

I don't find it ...?


> And position 294 is the 294th character in the HTML file it is the first
> instance of the word META.

<meta http-equiv="Content-Type" content="text/html; charset=XXX">

causes the parser to throw ChangedCharSetException.


Christian

Craig Raw

unread,
Oct 24, 2002, 11:28:43 AM10/24/02
to
Try this:


//Get the kit
HTMLEditorKit kit = getKit();

// Create default doc
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
doc.putProperty( "IgnoreCharsetDirective", Boolean.TRUE );

Reader r = new InputStreamReader( new FileInputStream( .....
kit.read( r, doc, 0 );


The important line here is the IgnoreCharsetDirective property set.

hth,
Craig


"James Gralton" <jim...@yahoo.com> wrote in message

news:QwRt9.648$Ao.58838@newsfep2-gui...

James Gralton

unread,
Oct 24, 2002, 11:28:09 AM10/24/02
to
Thank you for your help it has now sorted the problem.

Much appreciated

James

Christian Kaufhold <use...@chka.de> wrote in message
news:3t3db80eb...@simia.chka.de...

Chris Smith

unread,
Oct 24, 2002, 1:03:15 PM10/24/02
to
I'd love to help, but your post is so long that my newsreader refuses to
open it. In general, it's best to post the simplest possible code that
will demonstrate the problem you are having. Most people won't run
arbitrary code from USENET that they don't understand, and your code is
far too long to read through and understand. If you must post a longer
piece of code, put it on a web site or something.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

0 new messages