As my first mission n my first job i have to check the wellformedness
of about 1000 html files ...
I assume there must already be some java-classes/packages/libs on the
net that do this ??? It cannot be that I am the first one who has to do
this ...
So, does anybody know any online libs that do this???
Thanks !
it will be a console program, so i need classes that accept a html a
file and check it, i guess.
See hiwa's reply, and also consider JTidy.
- Oliver
The original HTML Tidy is a C command line utility but there are Java
and Perl versions (Jtidy is one of them), all referenced from the
project. Its worth a visit: there are other useful things too, such HTML
editors which integrate HTML Tidy.
--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |
for example: this is a sample text from the html files:
<table border=1 width="100%" >
<tr>
<td width=20%><noindex>Betreft :</noindex></td>
<td colspan=3>
<betreft><P><A NAME="b_betreft"></A>Kinderrechten: implementatie van
het VN-verdrag<BR>Jaarlijkse verslaggeving van de Vlaamse regering aan
het Vlaams Parlement en aan de kinderrechtencommissaris omtrent de
implementatie van het VN-verdrag van 20 november 1989 inzake de rechten
van het kind<BR>Tweede verslag d.d. 29 september 2000 <A
NAME="e_betreft"></A></betreft>
</td></tr>
Per html file i need to extract the contents of these special tags ...
<betreft> (and others), (& create xml files out of them), is it
possible to read a html file as a xml file and do some xpath stuff on
it ???
Or just extract tags from a simple text file ...
" JTidy provides a DOM interface to the document that is being
processed, which effectively makes you able to use JTidy as a DOM
parser for real-world HTML."
but no where i can find a good reference to jtidy ...
I still don't know how I'm gonna do it, maybe write it all myself ....
greetings
> As my first mission n my first job i have to check the wellformedness
> of about 1000 html files ...
Why use Java? The usual tool for this is HTML Tidy, which you can
drive perfectly adequately from the command line with a couple of lines
of shell script.
Have a look at javacc help files and documentations.
This url will help you...
https://javacc.dev.java.net/servlets/ProjectDocumentList?folderID=110
Regards,
Sachin
The HTMLEditorKit contains a parser I used as the basis for a URL
checker. This extracts <A> tags from HTML pages, Sets up a URL instance
from the href attribute and sees if it is accessible. Access failures
are reported for manual examination and fixes.
This is possible if and only if the HTML file actually is an XML file
(the HTML file format and the XML file format overlap, but are not identical
to each other). Otherwise, first you'll need something like "XMLTidy" (a
fictional product I just made up) to fix the broken XML -- things like
making sure every open tag is balanced by a closing tag, etc. I noticed in
your example, the <table>, <P> and <BR> tags are never closed, for example.
- Oliver