Issues when defining a personalized schema class.

27 views
Skip to first unread message

sebastien...@gmail.com

unread,
May 2, 2014, 11:05:06 AM5/2/14
to tagsoup...@googlegroups.com

Hello all,

I am willing to adapt TagSoup to correct some input files which are not html files.
This is an exemple how the files are structured:
<fulltext-document>
<bibliographic-data>
<publication-reference>
<document-id>
<country>IT</country>
<doc-number>0000000</doc-number>    <kind>some data</kind>
<date>20000920</date>
</document-id>
</publication-reference>
<application-reference doc-id="0000000001">
<document-id>
<country>IT</country>
<doc-number>00000000</doc-number>    <kind>some data</kind>
<date>20000215</date>
</document-id>
</application-reference>
</bibliographic-data>
<description id="desc" lang="it" section="000000071" source="002">
<p num="0001" id="p0001">Some text .</p>
<p num="0008" id="p0008">Some text<!-- comment <DP n="4"> --></p>
</description>
</fulltext-document>

I understood I had to implement a class from HTMLModels
So I created this class :
package tagsoupExt;
public class MiniSchema extends Schema implements  HTMLModels{
    public MiniSchema() {
        // Start of Schema calls
        setURI("");
        setPrefix("");

        /**elementType is from Shema : Add or replace an element type for this schema.
         @param name Name (Qname) of the element
         @param model Models of the element's content as a vector of bits
         @param memberOf Models the element is a member of as a vector of bits
         @param flags Flags for the element **/
        elementType("<pcdata>", M_EMPTY, M_PCDATA, 0);
        elementType("<root>", M_ROOT, M_EMPTY, 0);
        elementType("fulltext-document",  M_HTML, M_ROOT, 0);
        elementType("bibliographic-data",  M_PCDATA | M_INLINE | M_BLOCK, M_HTML | M_BODY,0);
        elementType("publication-reference", M_PCDATA | M_INLINE | M_BLOCK, M_INLINE | M_NOLINK,0);
        elementType("document-id",  M_PCDATA | M_INLINE | M_BLOCK, M_INLINE | M_NOLINK,0);
        elementType("country",  M_PCDATA | M_INLINE | M_BLOCK, M_INLINE | M_NOLINK,0);
        elementType("doc-number",  M_PCDATA | M_INLINE | M_BLOCK, M_INLINE | M_NOLINK,0);
        elementType("kind",  M_PCDATA | M_INLINE | M_BLOCK, M_INLINE | M_NOLINK,0);
        elementType("date",  M_PCDATA | M_INLINE | M_BLOCK, M_INLINE | M_NOLINK,0);

        elementType("application-reference", M_PCDATA | M_INLINE | M_BLOCK, M_INLINE | M_NOLINK,0);
        elementType("description", M_PCDATA | M_INLINE | M_BLOCK, M_HTML | M_BODY, 0);

        elementType("P", M_PCDATA | M_INLINE | M_BLOCK, M_INLINE | M_NOLINK, 0);

        /** parent is from Schema :  Specify natural parent of an element in this schema.
         @param name Name of the child element
         @param parentName Name of the parent element
         **/
        //parent("<pcdata>", "<root>"); //parent name was "spin"
        //parent("fulltext-document", "<root>");//parent name was "spin"
        parent("<pcdata>", "fulltext-document");
        parent("bibliographic-data", "fulltext-document");
        parent("description", "fulltext-document");
        parent("publication-reference", "bibliographic-data");
        parent("document-id", "publication-reference");
        parent("country", "document-id");
        parent("doc-number", "document-id");
        parent("kind", "document-id");
        parent("date", "document-id");


        parent("application-reference", "bibliographic-data");
        parent("document-id", "application-reference");

        parent("p", "description");

        attribute("application-reference", "doc-id", "NMTOKEN", null); /* string ok */

        attribute("description", "id", "NMTOKEN", null); /* string ok */
        attribute("description", "lang", "NMTOKEN", null); /* string ok */
        attribute("description", "section", "NMTOKEN", null); /* string ok */
        attribute("description", "source", "NMTOKEN", null); /* string ok */

        attribute("p", "num", "NMTOKEN", null); /* string ok */
        attribute("p", "id", "NMTOKEN", null); /* string ok */
    }
}

I also wrote this class to load the test file and parse it :
import java.io.*;


import org.ccil.cowan.tagsoup.Parser;
import org.ccil.cowan.tagsoup.XMLWriter;
import org.xml.sax.InputSource;
import org.xml.sax.SAXNotRecognizedException;
import org.xml.sax.SAXNotSupportedException;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import tagsoupExt.MiniSchema;

public class Main {
    public static void main(String[] args) {

        try{
            FileReader input = new FileReader("inputFile.txt");
            FileWriter output = new FileWriter("outputFile.txtx");
            //Attempt to initialize parser
            XMLReader r = new Parser();
            MiniSchema theSchema = new MiniSchema();
            try {
                r.setProperty(Parser.schemaProperty, theSchema);
                InputSource s = new InputSource(input);
                s.setEncoding("UTF-8");
                XMLWriter x = new XMLWriter(output);
                x.setOutputProperty(XMLWriter.ENCODING, "UTF-8");
                x.setPrefix(theSchema.getURI(), "");
                r.setContentHandler(x);
                try {
                    r.parse(s);
                }
                catch (SAXException saxExcept) {
                    System.out.println("Error : "+saxExcept.getMessage());
                }
                catch(IOException ioexcept) {
                    System.out.println("Error : "+ioexcept.getMessage());
                }
            }
            catch(SAXNotRecognizedException notRecongnizedExcept){
                System.out.println("Error : "+notRecongnizedExcept.getMessage());
            }
            catch (SAXNotSupportedException notSupportExcept){
                System.out.println("Error : "+notSupportExcept.getMessage());
            }
        }
        catch (FileNotFoundException fnfe) {
            System.out.println("File not found exception : "+fnfe.getMessage());

        }
        catch (IOException ioexcept){
            System.out.println("Error ioexcept2 : "+ioexcept.getMessage());
        }
    }
}


When I submit to this class a well-formed file, it returns almost the same thing than the input file except the comment tag is discarded.
When I submit an ill-formed file, it returns a file with a closed tag, but not at the correct place.


So I have the following questions:
* As the comment tag is standard html tag, do I have to put it my schema class ?

* Is there a way to fix the bad tag placement ?
Here is an exemple:
My ill-formed input file looks like (missing ending </kind>)
<fulltext-document>
<bibliographic-data>
<publication-reference>
<document-id>
<country>IT</country>
<doc-number>0000000</doc-number>    <kind>some data
<date>20000920</date>
</document-id>
</publication-reference>

</fulltext-document>

The output will be like this :
<fulltext-document>
<bibliographic-data>
<publication-reference>
<document-id>
<country>IT</country>
<doc-number>0000000</doc-number>    <kind>some data
<date>20000920</date> </kind>
</document-id>
</publication-reference>

</fulltext-document>

* I built my Schema class with many try-error attempts and the elementType method is still very obscure to me, I could not find any documentation about how the (M_..) bit fields like are handled by the tagSoup Parsed even by looking the source code.
I would be very grateful if someone can help me on this topic, maybe the cause of the misplacement of the closing tag lies here.

Thank you very much for the pieces of advice.

Reply all
Reply to author
Forward
0 new messages