Jumbo Converters

20 views
Skip to first unread message

Oliver Stueker

unread,
Feb 11, 2015, 3:12:47 PM2/11/15
to quixot...@googlegroups.com, Peter Murray-Rust
Hi Peter and the others from the Quixote team,

Last summer I have started working on the Jumbo-Converters and so far improved the parsing of Gaussian
(mainly added support for thermo-chemistry, G3 and G4 composite methods and fixes for Gaussian 09)
and last month stated with a student here to write templates for GAMESS-US parsing.
(This is all going on a forked repository at [1])

Now I have reached a point where I don't know how to parse a table of bond-orders.

e.g. take this snippet from a GAMESS output file:

                   BOND                       BOND                       BOND
  ATOM PAIR DIST  ORDER      ATOM PAIR DIST  ORDER      ATOM PAIR DIST  ORDER
    1   2  1.542  0.923        1   6  1.084  0.929        1   7  1.084  0.929
    1   8  1.084  0.929        2   3  1.084  0.929        2   4  1.084  0.929
    2   5  1.084  0.929

It contains data of three bonds per line and possibly only for one or two bonds in the last line.

I know how to specify a <record> to read, say one integer followed by 1 to 5 floats:
<record id="test1" repeat="*">{I,x:foo}{1_5F,x:bar}</record>

But all my tries to use regular expressions to say "find me one or more of '{I,x:a1}{I,x:a2}{F,x:dist}{F,x:bo}' "
have failed. (leading to RegExp Errors).
<record id="test2" repeat="*">\s*({I,x:a1}{I,x:a2}{F,x:dist}{F,x:bo})+</record>
<record id="test3" repeat="*">\s*({I,x:a1}{I,x:a2}{F,x:dist}{F,x:bo}){1,3}</record>

I've also tried to use formatType="FORTRAN" records, but even those don't seem to support repetitions.

And If I use:
<record id="bo1" repeat="*" names="a1 a2 dist bo">{I}{I}{F}{F} {I}{I}{F}{F} {I}{I}{F}{F}</record>
<record id="bo2" repeat="*" names="a1 a2 dist bo">{I}{I}{F}{F} {I}{I}{F}{F}</record>
<record id="bo3" repeat="*" names="a1 a2 dist bo">{I}{I}{F}{F} </record>


the second record seems to eat (but not match) the line with only one bond and leaves nothing for the third.

So far I've spent already too much time on trying myself, and grep-ing through existing templates and Record- and TransformTests and now I ran out of things to try.

So can anybody give me a hint?

Thanks,
Oliver


[1] https://bitbucket.org/oliver_stueker/jumbo-converters

-- 
Oliver Stueker, Dr. rer. nat.
Department of Chemistry
Memorial University of Newfoundland
Canada 

Peter Murray-Rust

unread,
Feb 12, 2015, 4:17:34 PM2/12/15
to Quixote mail list, Jens Thomas, Mark Williamson
Oliver,
Great to see your involvement. I'll try to have a look today.

I've copied Jens, who did a lot of work and may be able to reply more rapidly, and MarkW.

P.


--
You received this message because you are subscribed to the Google Groups "Quixote project on QC databases" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quixote-qcdb...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Oliver Stueker

unread,
Feb 18, 2015, 3:45:15 PM2/18/15
to quixot...@googlegroups.com, Peter Murray-Rust, Jens Thomas, Mark Williamson, Sam Adams
Hi Everyone,


I found a solution for parsing the bond orders.
Basically I had to introduce another <templateList> with three <template>s each matching either a line with three, two or one column(s) and containing just a single record for just such a line. (commit [1])
Luckily it's only the last line that can have less than three columns.


But now I have a new problem:

I want to create a new transform in org.xmlcml.cml.converters.text.TransformElement to extract the lower triangle from a matrix that otherwise would contain redundant information (e.g. inter-atomic distances or force-constant matrix)

<transform process='createTriangularMatrix' xpath='.' from='cml:array' dictRef='x:y' />

should convert:
  <array dataType='xsd:double' size='5'>0.0 0.1 0.2 0.3 0.4</array>
  <array dataType='xsd:double' size='5'>1.0 1.1 1.2 1.3 1.4</array>
  <array dataType='xsd:double' size='5'>2.0 2.1 2.2 2.3 2.4</array>
  <array dataType='xsd:double' size='5'>3.0 3.1 3.2 3.3 3.4</array>
  <array dataType='xsd:double' size='5'>4.0 4.1 4.2 4.3 4.4</array>
</list>

to:
  <array dictRef='x:y' dataType='xsd:double' size='15'>0.0 1.0 1.1 2.0 2.1 2.2 3.0 3.1 3.2 3.3 4.0 4.1 4.2 4.3 4.4</array>
</list>

So far I have followed the implementation of createVector3() but when I iterate over the array-elements inside the lists and want to cast them to CMLArray I get: 
ClassCastException: nu.xom.Element cannot be cast to org.xmlcml.cml.element.CMLArray

(Please see my code below.)

I would really appreciate if someone could give me a hint on how to do that.

Thanks,
Oliver


private void createTriangularMatrix() {
    assertRequired(DICT_REF, dictRef);
    assertRequired(FROM, from);
    assertRequired(XPATH, xpath);

    List<Node> nodeList = getXpathQueryResults();
    for (Node node : nodeList) { // all cml:list nodes
    
        Element element = (Element)node;
        Nodes arrayNodes = TransformElement.queryUsingNamespaces(element, from);
        CMLArray fullArray = new CMLArray("xsd:double");
        Node node0 = null;

        for (int j = 0; j < arrayNodes.size(); j++) 
        {
            // this fails with: java.lang.ClassCastException: nu.xom.Element cannot be cast to org.xmlcml.cml.element.CMLArray
            CMLArray lineArray = (CMLArray) arrayNodes.get(j);

            if (lineArray.getArraySize() >= j){
                fullArray = fullArray.plus(lineArray.createSubArray(0, j ));
            } else {
                CMLUtil.debug(element, "PAR");
                throw new RuntimeException("Lower Triangle of matrix cannot be extracted.");
            }

            if (j == 0) {
                node0 = lineArray;
            } else {
                lineArray.detach();
            }

        } // for all cml:arrays in current cml:list node

        fullArray.setDictRef(dictRef);
        node0.getParent().replaceChild(node0, fullArray);

    }// for all cml:list nodes
}


Peter Murray-Rust

unread,
Feb 18, 2015, 11:13:05 PM2/18/15
to Oliver Stueker, Quixote mail list, Jens Thomas, Mark Williamson, Sam Adams
Hi Oliver,
Great to see you are doing this and I'll be happy to help. Unfortunately I'm in NZ at presnet so 12 hours out of phase. Sometimes these things are solvable by a conversation.

I think you are largely there! The 3 templates is probably the way to go and what I would have suggested. I'm not sure why you get a ClassCastException as <array> (and all other elements) should create a CMLArray, not a XOM Element.

My suggestion would be to run some of the examples in jumbo-converters and make sure that they create CMLArray's.

It's ca 4 years since I wrote this so it will take time to remember. I *think* it should be possible to do this without writing Java code - jumbo-converters is written with that in mind. If you have to write Java yourself then something needs changing.


Who wrote the code below, you or me :-) ?

In general if you are writing java code the transformation will probably create Element/s and Node/s. These can be converteed in to CMLArray, etc. through constructors.  If this is my code then I'm guessing that it didn't get finished and tested. ?Is there a test?

P.

Peter Murray-Rust

unread,
Feb 19, 2015, 12:13:56 AM2/19/15
to Oliver Stueker, Quixote mail list, Jens Thomas, Mark Williamson, Sam Adams
I've been revisiting the code; my comments and suggestions are based on the assumption that (a) I wrote createVector3 and (b) you wrote createTriangularMatrix();

I suggest that you write a test testCreateTriangularMatrix similar to testCreateVector3() [which is
    @Test
    public void testCreateVector3() {
        CMLList list = new CMLList();
        CMLScalar scalar = new CMLScalar(1.1);
        list.appendChild(scalar);
        scalar = new CMLScalar(2.2);
        list.appendChild(scalar);
        scalar = new CMLScalar(3.3);
        list.appendChild(scalar);
        String listXML =
            "<list xmlns='http://www.xml-cml.org/schema'>" +
            "  <scalar dataType='xsd:double'>1.1</scalar>" +
            "  <scalar dataType='xsd:double'>2.2</scalar>" +
            "  <scalar dataType='xsd:double'>3.3</scalar>" +
            "</list>";
        JumboTestUtils.assertEqualsIncludingFloat("test", listXML, list, true, 0.000001);
        runTest("createVector3",
            "<transform process='createVector3' xpath='.' from='cml:scalar' dictRef='x:y'/>",
            list,
            "<list xmlns='http://www.xml-cml.org/schema'><vector3 dictRef='x:y'>1.1 2.2 3.3</vector3></list>"
            );
    }
, then commit it and we will debug it till it works. (EITHER You should fork and create a pull request and allow me to work on the fork, OR send me the additions to the code and I'll get it working on my machine and commit.

Put in some debug statements to your code so we can see what's coming through. It's just possible that we shall need to create new CMLArray rather than cast them. If you look at createMolecule or createTorsion you'll see that we are creating new CML elements. Debug with LOG.debug(element.toXML());

P.

What's your timezone? I find http://www.zoominfo.com/p/Oliver-Stueker/1861144380 - if so we have considerable overlap. But I am fairly busy tomorrow...

   

Oliver Stueker

unread,
Feb 19, 2015, 10:42:23 AM2/19/15
to Peter Murray-Rust, Quixote mail list, Jens Thomas, Mark Williamson, Sam Adams
Hi Peter,

thanks for the help.
By the way, my timezone here in Newfoundland is GMT-03:30.

Yes, createVector3() is written by you and createTriangularMatrix() by us.

I've just isolated the commits creating the new transform and re-based them on the current state of the jumbo-converters at https://bitbucket.org/wwmm/jumbo-converters and created a pull-request.

Now that I'm reading the testCreateVector3() I think I see what is going wrong and that my test is just too simple.
I had assumed that the listXML is parsed with the proper name-spaces, but now I see that I have to build the XML manually using cmlxom types and compare that against the text-representation (listXML).

I will check that out.


Actually I don't insist on writing Java-Code for that but I simply can't imagine a way to achieve the result with existing XML transforms.

My main target is to store the lower triangle of a force constant matrix.

e.g.: 
(vastly simplified version with mock numbers and only N x N instead of 3N x 3N)

      1 C  2 H  3 H  4 H  5 H
1 C   0.0  
2 H   1.0  1.1
3 H   2.0  2.1  2.2
4 H   3.0  3.1  3.2  3.3
5 H   4.0  4.1  4.2  4.3  4.4

And this lower triangle is what Gaussian reports.

GAMESS-US however prints it split like this:

      1 C  2 H
1 C   0.0  1.0
2 H   1.0  1.1
3 H   2.0  2.1
4 H   3.0  3.1
5 H   4.0  4.1

      3 H  4 H
3 H   2.2  3.2
4 H   3.2  3.3
5 H   4.2  4.3

      5 H
5 H   4.4

which corresponds to this matrix, once it is re-assembled:

      1 C  2 H  3 H  4 H  5 H
1 C   0.0  1.0
2 H   1.0  1.1
3 H   2.0  2.1  2.2  3.2
4 H   3.0  3.1  3.2  3.3
5 H   4.0  4.1  4.2  4.3  4.4

And one can see the redundant 1.0 and 3.2 in this example.

I think it should be possible with the existing transforms to re-construct this last matrix from the segments above with existing transforms (though I haven't tried it so far) by reading the numbers into arrays and giving each array an ID-attribute denoting the atom (e.g. "1 C", "2 H", ...) and then merging those arrays with the same ID-attribute.

So much for the background.

Thanks for the input!


Cheers,
Oliver

Peter Murray-Rust

unread,
Feb 19, 2015, 2:16:46 PM2/19/15
to Quixote mail list, Jens Thomas, Mark Williamson, Sam Adams
This looks great Oliver!

I'm busy for the next 12 hours but see you have sent a pull request.

Hope to touch base later

Oliver Stueker

unread,
Feb 19, 2015, 2:26:59 PM2/19/15
to quixot...@googlegroups.com
Hi Peter,

I've figured it out after reading (and this time understanding) how the test actually works.

For now I'm good (I think).
Time for me to figure our how to re-assemble a non-complete matrix that has been broken down into several columns.

Thanks,
Oliver

Peter Murray-Rust

unread,
Feb 20, 2015, 2:00:48 AM2/20/15
to Quixote mail list, Jens Thomas, Mark Williamson, Sam Adams
Oliver,

Shall I give you admin rights to the main branch - then you can manage the pull requests, etc.

Oliver Stueker

unread,
Feb 20, 2015, 7:58:54 AM2/20/15
to Peter Murray-Rust, Quixote mail list, Jens Thomas, Mark Williamson, Sam Adams
Hi Peter,

Yes, that might be good at some point, though it's not urgent yet.

But once the gamess-us converter produces valid CML/CompChem it might be a good time to push the work we have done so far to the main jumbo-converter repo (wwmm) and maybe even increment the version by 0.1.

Apropos valid CML: The validator at http://validator.xml-cml.org/ seems to be down. (not urgent though).

Later I will also have some questions on how to properly expand and publish the package specific CML dictionaries e.g.: http://www.xml-cml.org/dictionary/gamessus

Cheers,
Oliver

Oliver Stueker

unread,
Mar 5, 2015, 10:29:06 AM3/5/15
to Peter Murray-Rust, Quixote mail list, Jens Thomas, Mark Williamson, Sam Adams
​Hi,

I have another ​question:

I want to make my jumbo-converter-compchem-gamessus produce CML/CompChem compliant CML and therefore add proper IDs to e.g. the <module dictRef="cc:job>. I want to make these IDs unique by adding a running number (e.g. "job_1", "job_2", ...) but I can't figure out how.

I assume it should be possible to use <transform process="addAttribute" name="id"  ... /> and then use value="$number(someXPATH)" to  determine the position of the current node in the jobList or count the number of preceding job-modules, but I just can't get it to work.

some of my tries are:

  <transform process="addAttribute" name="id"
             xpath=".//cml:module[@dictRef='cc:job']"
             value="job_$number( position( self::node() ) )" />
resulting in a:
InvocationTargetException: java.lang.RuntimeException: bad query:  position( self::node(: XPath error: Expected: ) -> [Help 1]
...
Caused by: class nu.xom.jaxen.saxpath.XPathSyntaxException:  position( self::node(: 22: Expected: )



  <transform process="addAttribute" name="id" 
             xpath=".//cml:module[@dictRef='cc:job']"
             value="job_$number( self::position() )" />
raising:
InvocationTargetException: java.lang.RuntimeException: bad query:  self::position(: XPath error: Expected node-type -> [Help 1]
...
Caused by: class nu.xom.jaxen.saxpath.XPathSyntaxException:  self::position(: 16: Expected node-type



  <transform process="addAttribute" name="id" 
             xpath=".//cml:module[@dictRef='cc:job']"
             value="job_$number( count( ./preceding-sibling::*)+1 )" />
 
leading to:
InvocationTargetException: java.lang.RuntimeException: bad query:  count( ./preceding-sibling::*: XPath error: Expected: ) -> [Help 1] 
...
Caused by: class nu.xom.jaxen.saxpath.XPathSyntaxException:  count( ./preceding-sibling::*: 30: Expected: )


Has anyone an idea on how to evaluate the position or node-count as a number?


Oliver​


Peter Murray-Rust

unread,
Mar 5, 2015, 11:27:27 AM3/5/15
to Oliver Stueker, Quixote mail list, Jens Thomas, Mark Williamson, Sam Adams
Personally I try to avoid this sort of computation in XSL if possible. I suspect that the expression in 'value' is missing the context. I'd probably add the IDs after the XML was formed. If you have to do it, suggest you make a stylesheet independent of JUMBO, make sure it works - if it doesn't ask on Stackoverflow and then try to reinsert it in the JUMBO Xml.

Thanks for sticking with this. Very pleased to see it being used.

Reply all
Reply to author
Forward
0 new messages