Issues when defining a personalized schema class.

27 views

Skip to first unread message

sebastien...@gmail.com

unread,

May 2, 2014, 11:05:06 AM5/2/14

to tagsoup...@googlegroups.com

Hello all,

I am willing to adapt TagSoup to correct some input files which are not html files.
This is an exemple how the files are structured:
<fulltext-document>
<bibliographic-data>
<publication-reference>
<document-id>
<country>IT</country>
<doc-number>0000000</doc-number> <kind>some data</kind>
<date>20000920</date>
</document-id>
</publication-reference>
<application-reference doc-id="0000000001">
<document-id>
<country>IT</country>
<doc-number>00000000</doc-number> <kind>some data</kind>
<date>20000215</date>
</document-id>
</application-reference>
</bibliographic-data>
<description id="desc" lang="it" section="000000071" source="002">
<p num="0001" id="p0001">Some text .</p>
<p num="0008" id="p0008">Some text</p>
</description>
</fulltext-document>

I understood I had to implement a class from HTMLModels
So I created this class :
package tagsoupExt;
public class MiniSchema extends Schema implements HTMLModels{
    public MiniSchema() {
        // Start of Schema calls
        setURI("");
        setPrefix("");

        /**elementType is from Shema : Add or replace an element type for this schema.
         @param name Name (Qname) of the element
         @param model Models of the element's content as a vector of bits
         @param memberOf Models the element is a member of as a vector of bits
         @param flags Flags for the element **/
        elementType("<pcdata>", M_EMPTY, M_PCDATA, 0);
        elementType("<root>", M_ROOT, M_EMPTY, 0);
        elementType("fulltext-document", M_HTML, M_ROOT, 0);
        elementType("bibliographic-data", M_PCDATA | M_INLINE | M_BLOCK, M_HTML | M_BODY,0);
        elementType("publication-reference", M_PCDATA | M_INLINE | M_BLOCK, M_INLINE | M_NOLINK,0);
        elementType("document-id", M_PCDATA | M_INLINE | M_BLOCK, M_INLINE | M_NOLINK,0);
        elementType("country", M_PCDATA | M_INLINE | M_BLOCK, M_INLINE | M_NOLINK,0);
        elementType("doc-number", M_PCDATA | M_INLINE | M_BLOCK, M_INLINE | M_NOLINK,0);
        elementType("kind", M_PCDATA | M_INLINE | M_BLOCK, M_INLINE | M_NOLINK,0);
        elementType("date", M_PCDATA | M_INLINE | M_BLOCK, M_INLINE | M_NOLINK,0);

elementType("P", M_PCDATA | M_INLINE | M_BLOCK, M_INLINE | M_NOLINK, 0);

        /** parent is from Schema : Specify natural parent of an element in this schema.
         @param name Name of the child element
         @param parentName Name of the parent element
         **/
        //parent("<pcdata>", "<root>"); //parent name was "spin"
        //parent("fulltext-document", "<root>");//parent name was "spin"
        parent("<pcdata>", "fulltext-document");
        parent("bibliographic-data", "fulltext-document");
        parent("description", "fulltext-document");
        parent("publication-reference", "bibliographic-data");
        parent("document-id", "publication-reference");
        parent("country", "document-id");
        parent("doc-number", "document-id");
        parent("kind", "document-id");
        parent("date", "document-id");

parent("application-reference", "bibliographic-data");
parent("document-id", "application-reference");

parent("p", "description");

attribute("application-reference", "doc-id", "NMTOKEN", null); /* string ok */

        attribute("description", "id", "NMTOKEN", null); /* string ok */
        attribute("description", "lang", "NMTOKEN", null); /* string ok */
        attribute("description", "section", "NMTOKEN", null); /* string ok */
        attribute("description", "source", "NMTOKEN", null); /* string ok */

        attribute("p", "num", "NMTOKEN", null); /* string ok */
        attribute("p", "id", "NMTOKEN", null); /* string ok */
    }
}

I also wrote this class to load the test file and parse it :
import java.io.*;

import org.ccil.cowan.tagsoup.Parser;
import org.ccil.cowan.tagsoup.XMLWriter;
import org.xml.sax.InputSource;
import org.xml.sax.SAXNotRecognizedException;
import org.xml.sax.SAXNotSupportedException;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import tagsoupExt.MiniSchema;

public class Main {
public static void main(String[] args) {

        try{
            FileReader input = new FileReader("inputFile.txt");
            FileWriter output = new FileWriter("outputFile.txtx");
            //Attempt to initialize parser
            XMLReader r = new Parser();
            MiniSchema theSchema = new MiniSchema();
            try {
                r.setProperty(Parser.schemaProperty, theSchema);
                InputSource s = new InputSource(input);
                s.setEncoding("UTF-8");
                XMLWriter x = new XMLWriter(output);
                x.setOutputProperty(XMLWriter.ENCODING, "UTF-8");
                x.setPrefix(theSchema.getURI(), "");
                r.setContentHandler(x);
                try {
                    r.parse(s);
                }
                catch (SAXException saxExcept) {
                    System.out.println("Error : "+saxExcept.getMessage());
                }
                catch(IOException ioexcept) {
                    System.out.println("Error : "+ioexcept.getMessage());
                }
            }
            catch(SAXNotRecognizedException notRecongnizedExcept){
                System.out.println("Error : "+notRecongnizedExcept.getMessage());
            }
            catch (SAXNotSupportedException notSupportExcept){
                System.out.println("Error : "+notSupportExcept.getMessage());
            }
        }
        catch (FileNotFoundException fnfe) {
            System.out.println("File not found exception : "+fnfe.getMessage());

        }
        catch (IOException ioexcept){
            System.out.println("Error ioexcept2 : "+ioexcept.getMessage());
        }
    }
}

When I submit to this class a well-formed file, it returns almost the same thing than the input file except the comment tag is discarded.
When I submit an ill-formed file, it returns a file with a closed tag, but not at the correct place.

So I have the following questions:
* As the comment tag is standard html tag, do I have to put it my schema class ?

* Is there a way to fix the bad tag placement ?
Here is an exemple:
My ill-formed input file looks like (missing ending </kind>)
<fulltext-document>
<bibliographic-data>
<publication-reference>
<document-id>
<country>IT</country>
<doc-number>0000000</doc-number> <kind>some data
<date>20000920</date>
</document-id>
</publication-reference>

</fulltext-document>

The output will be like this :
<fulltext-document>
<bibliographic-data>
<publication-reference>
<document-id>
<country>IT</country>
<doc-number>0000000</doc-number> <kind>some data
<date>20000920</date> </kind>
</document-id>
</publication-reference>

</fulltext-document>

* I built my Schema class with many try-error attempts and the elementType method is still very obscure to me, I could not find any documentation about how the (M_..) bit fields like are handled by the tagSoup Parsed even by looking the source code.
I would be very grateful if someone can help me on this topic, maybe the cause of the misplacement of the closing tag lies here.