Re: Section Extractor

Ashok Hariharan

unread,

May 19, 2009, 12:21:15 PM5/19/09

to akomant...@googlegroups.com

On Tue, May 19, 2009 at 5:28 PM, Luca Cervone <cervo...@gmail.com> wrote:
> I have integrated the Sections Extractor into the translator.
> Since now the messages are not written in a "very human readable" way.
> But you can try it, so you can see all the information that, since now, we
> can display about the error.
> Let me know if there are enough, or what we can add.
> For example, the problem with the large sections is that we cannot point to
> the exact paragraph in witch the error occurs

Hi Luca,

" the problem with the large sections is that we cannot point to the
exact paragraph in witch the error occurs"

The issue is some sections can be as large as 5-6 pages ... just
indicating an error generically in a section is not really helping the
user.

I took a look at the source for what you had done, for e.g. you have
used regular expression matching to determine the section name from
the error message ... :

<code>
....
//check what type of text the exception launch
if(exceptionMessage.matches("(.*)Attribute '(.*)' must
appear on element '(.*)'."))
{
//compile the regex
Pattern p = Pattern.compile("(.*)Attribute
'(.*)' must appear on element '(.*)'");
//set the input
Matcher m = p.matcher(exceptionMessage);
//the attribute name
String attribute = "";
//the elemet name

....
</code>

This seems very fragile and liable to break if we switch jdk versions
or try to use it for different locales / regions ...
Have you looked at ValidationEventLocator and ValidationEventCollector
class that allows resolving Validatin / SaxException to xml source ?
It returns the dom node having the error and the column and line
number with the error ...

Ashok

Luca Cervone

unread,

May 19, 2009, 12:35:56 PM5/19/09

to akomant...@googlegroups.com

Dear Ashok

This seems very fragile and liable to break if we switch jdk versions
or try to use it for different locales / regions ...
Have you looked at ValidationEventLocator and ValidationEventCollector
class that allows resolving Validatin / SaxException to xml source ?
It returns the dom node having the error and the column and line
number with the error ...

The regular expression is not fragile. That's because it is based on the "original language" of the validator.

That means that it takes always the message outputted by the validator in english.

Moreover the only thing that the regular expression do is to find the word "attribute" or "element" in order to understand what type of error we are handling.

It is true that the sax parser supplies the two class that you says but there are two things to say:

1) The class are included only in the commercial version of the parser.

2) The things that that classes do are the some of that one that I can do with the validator I'm using.

And, as you know, the method that retrieves the line and the column works on the XML version not on ODF original document. But we have to tell to the user where is the problem in the ODF and not in the XLM produced version :-(

And last but not least, the method getCurrentElementNode in the validator (both SAX and JAXP) points to the first parent of the element in witch the error occurs.

So we have to study another method to point to the single paragraph.

I'll try to search an idea reviewing the code, but in the meanwhile you can tell me your opinion.

Ciao

Luca

Luca Cervone

Web and XML solutions designer

e-mail: cervo...@gmail.com

luca.c...@unibo.it

lcer...@cs.unibo.it

mobile phone: 0039 348 26 27 545

home phone: 0039 051 199 82 854

skype: cervoneluca

Ashok Hariharan

unread,

May 20, 2009, 2:48:19 AM5/20/09

to akomant...@googlegroups.com

On Tue, May 19, 2009 at 7:35 PM, Luca Cervone <cervo...@gmail.com> wrote:
> It is true that the sax parser supplies the two class that you says but
> there are two things to say:
> 1) The class are included only in the commercial version of the parser.
> 2) The things that that classes do are the some of that one that I can do
> with the validator I'm using.

Actually I wasnt suggesting using Saxon for this part...

You are validating the output document using an XML parser (Xerces)
and then trapping the exception. I am suggesting exactly the same,
except that when you trap the exception you use a source locator to
identify the XML node responsible for the problem. Once we identify
the problematic node it is easy to walk the parent tree and identify
the originating paragraph and section. We know the mapping between
text:section/@name and AN container elements and also between text:p
and the AN paragraph representation... so AN and ODF containers are
relatively straightforward to map to each other.

> And, as you know, the method that retrieves the line and the column works on
> the XML version not on ODF original document. But we have to tell to the
> user where is the problem in the ODF and not in the XLM produced version :-(
> And last but not least, the method getCurrentElementNode in the validator
> (both SAX and JAXP) points to the first parent of the element in witch the
> error occurs.

No no.. i am not suggesting using getCurrentElementNode() since that
doesnt provide the information we need...
Here is some sample code to explain what i am suggesting ....

public static void main(String[] args} {
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(true);
///set the schema somewhere
SAXParser parser = factory.newSAXParser();
parser.parse("sample_an.xml", new AnXmlLocator());
}
.....

public class AnXmlLocator extends DefaultHandler {
Locator locator;

public void setDocumentLocator(Locator locator) {
this.locator = locator;
}

//for testing we allow only 'clause'
//ideally we should be trapping saxexception and doing the source
location stuff there
public void startElement(String uri, String localName,
String qName, Attributes at) throws SAXException {

if (qName.equals("clause")) {
// ... do nothing
} else {
//for every other element raise an exception and identify
the source of the error
//do mapping staff between AN xml error and ODF error
String location = "";
if (locator != null) {
location = locator.getSystemId(); //name of xml doc
location += " line " + locator.getLineNumber();
location += ", column " + locator.getColumnNumber();
location += ": ";
}

throw new SAXException(location + "Illegal element");
}
}

Ashok

Luca Cervone

unread,

May 20, 2009, 5:48:59 AM5/20/09

to akomant...@googlegroups.com

Dear Ashok,

I don't know if I understood what you are suggesting.

Let my try it.

Ciao

Luca

Luca Cervone

unread,

May 21, 2009, 6:56:23 AM5/21/09

to akomant...@googlegroups.com

Hi Ashok,

Maybe you can help me.

I have implemented the Parser using the Location handler and everything works fine. It means that I'm able to list all the current processing node during the parsing.

The problem is that now the code do not validate the document.

I have committed the current version of the code.

The involved lines are these (In the file SchemaValidator):

SAXParserFactory factory = SAXParserFactory.newInstance();

SchemaFactory sfactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);

Schema schema = sfactory.newSchema(new File(aSchemaPath));

factory.setValidating(true);

factory.setSchema(schema);

///set the schema somewhere

SAXParser domParser = factory.newSAXParser();

//parse the document and validate it

domParser.parse(new InputSource(aDocument.toURI().toString()), LocationHandler.getInstance());

Did you never had this problem before?

Thanks

Luca

Ashok Hariharan

unread,

May 21, 2009, 7:06:30 AM5/21/09

to akomant...@googlegroups.com

On Thu, May 21, 2009 at 1:56 PM, Luca Cervone <cervo...@gmail.com> wrote:
> I have implemented the Parser using the Location handler and everything
> works fine. It means that I'm able to list all the current processing node
> during the parsing.
> The problem is that now the code do not validate the document.
> I have committed the current version of the code.
> The involved lines are these (In the file SchemaValidator):
>    SAXParserFactory factory = SAXParserFactory.newInstance();
>    SchemaFactory sfactory =
> SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
>      Schema schema = sfactory.newSchema(new File(aSchemaPath));
>    factory.setValidating(true);
>    factory.setSchema(schema);

> //parse the document and validate it
> domParser.parse(new InputSource(aDocument.toURI().toString()),
> LocationHandler.getInstance());
> Did you never had this problem before?

Are you getting a validation failure or an entirely different exception ... ?

Also did you set the sax driver explicitly to use Apache xerces ... ?
I have had a lot of errors in the past if this is not set explicitly
as the jaxp saxparserfactory tends to pick up the jdk's own sax parser
(which is also xerces, but a different version)....

I ll also checkout the code you have committed and test it at my end....

Ashok

Luca Cervone

unread,

May 21, 2009, 7:09:51 AM5/21/09

to akomant...@googlegroups.com

No no. I don't have no exceptions.

The file is parsed even if I'm sure that there are validation issues.

I'll try to set the Apache Xerces.

Thanks.

Ciao

Luca Cervone

unread,

May 21, 2009, 7:28:47 AM5/21/09

to akomant...@googlegroups.com

Hey Ashok,

I resolved the problem .

Of course it was a stupid thing :-D

I forgot to insert factory.setNamespaceAware(true); in my code.

Ciao

Luca

Luca Cervone

unread,

May 21, 2009, 7:44:57 AM5/21/09

to akomant...@googlegroups.com

Ok Ashok,

Take a look now to the code that I have committed.

Now we have the information about the exact point in which a problem occurs.

But our issue still remains here. How can we pin point to the ORIGINAL document?

I'm trying various ideas, let me know if you have a good one.

Ciao

Luca

Luca Cervone

unread,

May 21, 2009, 8:47:10 AM5/21/09

to akomant...@googlegroups.com

Hy Ashok,

I think that I had a very good idea. I'm implementing it.

I need an information. Do you know if there is a way to insert in the ExtractSection output also the line of the section in the ORIGINAL ODF document?

Thanks

Luca

Ashok Hariharan

unread,

May 21, 2009, 8:55:00 AM5/21/09

to akomant...@googlegroups.com

On Thu, May 21, 2009 at 2:44 PM, Luca Cervone <cervo...@gmail.com> wrote:
> Ok Ashok,
> Take a look now to the code that I have committed.
> Now we have the information about the exact point in which a problem
> occurs.
> But our issue still remains here. How can we pin point to the ORIGINAL
> document?
> I'm trying various ideas, let me know if you have a good one.
> Ciao
> Luca

Luca,

Can you send me the document that you are testing with (the one you
are using to test the error handler ) ...

If I have it i can try and make a small demo of how it could work ....

Ashok

Luca Cervone

unread,

May 21, 2009, 8:55:59 AM5/21/09

to akomant...@googlegroups.com

Of course, it is the attached one.

debaterecord_ken_eng_2008_12_17_main.odt

Luca Cervone

unread,

May 21, 2009, 9:33:49 AM5/21/09

to akomant...@googlegroups.com

Hi Ashok,

Look at the new version that I have just committed.

Look at how the error are explained.

I think that we are very near to the final solution. We have to add the the line of the section in the original document and then I think the exception is perfect.

The only issue can came out when a issue is raised in an element that has not an Id (like a paragraph ...text:p) but in this cases I think that it is sufficient to show the information relative the parent

of the element.

Let me know as soon as possible, so we can hope to finish during today or tomorrow.

Ciao

Luca

Ashok Hariharan

unread,

May 21, 2009, 10:22:39 AM5/21/09

to akomant...@googlegroups.com

On Thu, May 21, 2009 at 4:33 PM, Luca Cervone <cervo...@gmail.com> wrote:
> Hi Ashok,
> Look at the new version that I have just committed.
> Look at how the error are explained.
> I think that we are very near to the final solution. We have to add the the
> line of the section in the original document and then I think the exception
> is perfect.

Hi Luca,

I looked at your code... let me test it with the document you had sent me.
The line of the section will not help since openoffice renders the
document with 'virtual' line numbers they are not related to the line
number in the xml document.
I had another idea of how to do this using the locator... but let me
first test the output of your code, and then get back to you ,

Ashok

> The only issue can came out when a issue is raised in an element that has
> not an Id (like a paragraph ...text:p) but in this cases I think that it is
> sufficient to show the information relative the parent
> of the element.
> Let me know as soon as possible, so we can hope to finish during today or
> tomorrow.
> Ciao
> Luca
>
> Luca Cervone
> Web and XML solutions designer
> e-mail:     cervo...@gmail.com
>   luca.c...@unibo.it
> lcer...@cs.unibo.it
> mobile phone:    0039 348 26 27 545
> home phone: 0039 051 199 82 854
> skype:   cervoneluca
>
>
>
> >
>

--
++++ Ashok Hariharan ++++

Reply all

Reply to author

Forward