CML Validation

10 views
Skip to first unread message

Oliver Stueker

unread,
Mar 31, 2015, 4:20:31 PM3/31/15
to Quixote mail list, cml-d...@lists.sourceforge.net, Peter Murray-Rust, Egon Willighagen
Hi everyone,
I've cross-posted to the cml-discuss and quixote lists as I think I need input from both groups.

I have just started trying to validate my CML/CompChem files using a standalone cml-validator.jar (from Bitbucket).

And I get errors like:
<?xml version="1.0" encoding="UTF-8"?>
    <final-report>
        <well-formed-test>
            <valid>xml is well formed</valid>
        </well-formed-test>
        <schema-validation-test>
            <error>cvc-pattern-valid: Value 'x:QCISD(T)' is not facet-valid with respect to pattern '[A-Za-z][A-Za-z0-9_]*:[A-Za-z][A-Za-z0-9_\.\-]*' for type 'namespaceRefType'.</error>
        </schema-validation-test>
    </final-report>
</report>

​complaining that my dictRef="x:QCISD(T)" contains parenthesis.
This is a pity because I would really like to use things like:

<scalar dataType="xsd:double" dictRef="cc:Energy(0K)" units="nonSi:hartree">-40.422101</scalar>
<scalar dataType="xsd:double" dictRef="cc:Energy(T)" units="nonSi:hartree">-40.419229</scalar>
<scalar dataType="xsd:double" units="nonSi:hartree" dictRef="g:energy(MP2/G3Bas1)">-40.3325515</scalar>
<scalar dataType="xsd:double" units="nonSi:hartree" dictRef="g:energy(QCISD(T)/G3Bas1)">-40.3559402</scalar>
<scalar dataType="xsd:double" units="nonSi:hartree" dictRef="g:energy(G3MP2)">-40.4221009</scalar>

As I think they are more instructive than:

<scalar dataType="xsd:double" dictRef="cc:Energy_0K" units="nonSi:hartree">-40.422101</scalar>
<scalar dataType="xsd:double" dictRef="cc:Energy_T" units="nonSi:hartree">-40.419229</scalar>
<scalar dataType="xsd:double" units="nonSi:hartree" dictRef="g:energy_MP2_G3Bas1">-40.3325515</scalar>
<scalar dataType="xsd:double" units="nonSi:hartree" dictRef="g:energy_QCISD_T_G3Bas1">-40.3559402</scalar>
<scalar dataType="xsd:double" units="nonSi:hartree" dictRef="g:energy_G3MP2_">-40.4221009</scalar>

​Especially I find the "dictRef="g:energy(QCISD(T)/G3Bas1)" ​is much more readable than ​dictRef="g:energy_QCISD_T_G3Bas1" .​

​Is there a good reason to restrict ​the allowed characters of namespaceRefType ​to '[A-Za-z0-9_\.\-]' ?

I'm open for comments and suggestions.​


​Cheers,
Oliver​


--
Oliver Stueker, Dr. rer. nat.
Department of Chemistry, Memorial University

Peter Murray-Rust

unread,
Mar 31, 2015, 5:14:47 PM3/31/15
to Quixote mail list, cml-d...@lists.sourceforge.net, Egon Willighagen
I am afraid it is the W3C specification for XML names and we have adopted that .

if you use any character like / or () it will break XML compliant software where this is a name. As you have it , it's a data item but some tools attempt to validate it.

It may be more human-readable but it's machine unreadable. see http://www.xml.com/pub/a/2001/07/25/namingparts.html or the XML spec.

And these names can expand into RDF URLs where again punctuation characters would break them.

Great to have your interest
P.





--
You received this message because you are subscribed to the Google Groups "Quixote project on QC databases" group.
To unsubscribe from this group and stop receiving emails from it, send an email to quixote-qcdb...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Oliver Stueker

unread,
Apr 1, 2015, 10:36:21 AM4/1/15
to Quixote mail list
Hi Peter,

Indeed that is a good reason to not use those characters.
I had some suspicion that this was enforced for a good reason, but I had no idea why.

​Great to have your interest

Great to have the existing JUMBO-converters as a starting point. 

Since January we made great progress with the GAMESS-US converters. Once those produce valid CML/CompChem I will create a pull-request add it to the upstream. 

Best,
Oliver

Marcus D. Hanwell

unread,
Apr 1, 2015, 11:28:46 AM4/1/15
to quixot...@googlegroups.com
Oliver,

Really looking forward to seeing the GAMESS-US progress. It will be interesting to see what it looks like, and where we might take them in the future.

Marcus

Peter Murray-Rust

unread,
Apr 1, 2015, 12:56:47 PM4/1/15
to Quixote mail list
That's great. We have a new framework (norma/ami) see http://github.com/ContentMine/ami-plugin.git - this should make processing of large amounts of material easier.

Also I would be interested in helping people build input tools from CML.

And finally is anyone interested in extracting comp chem from the literature?

Karol Langner

unread,
Apr 1, 2015, 2:08:53 PM4/1/15
to quixot...@googlegroups.com
Hi Peter,

I have been following this discussion and I am quite interested in extracting comp chem data from the literature.

- Karol

Peter Murray-Rust

unread,
Apr 2, 2015, 4:26:48 AM4/2/15
to Quixote mail list
Excellent. What sort of data?

Karol Langner

unread,
Apr 2, 2015, 9:27:53 AM4/2/15
to quixot...@googlegroups.com
Hi Peter,

Mainly basic strutural and energetic data at this point. The sort of thing that JUMBO convertors and cclib parse, I suppose.

- Karol

Peter Murray Rust

unread,
Apr 2, 2015, 12:12:46 PM4/2/15
to quixot...@googlegroups.com, quixot...@googlegroups.com
Jumbo does log files ami does text. E.g the full text pdf What would you like from full text


Sent from my iPhone

Karol M. Langner

unread,
Apr 2, 2015, 12:41:05 PM4/2/15
to quixot...@googlegroups.com
I understand that, and I think getting the same types of information from the two sources is interesting. To be specific, this article contains loads of data in the article itself and in the supporting information, but we don't have access to the raw log files:
http://onlinelibrary.wiley.com/doi/10.1002/qua.24903/suppinfo

So, I see the two approaches as complementary, which could be combined somewhere down the road.

- Karol
written by Karol M. Langner
Thu Apr 2 12:31:47 EDT 2015

Peter Murray Rust

unread,
Apr 2, 2015, 1:06:03 PM4/2/15
to quixot...@googlegroups.com, quixot...@googlegroups.com
Ok send a short example or post links

Sent from my iPhone

Karol M. Langner

unread,
Apr 2, 2015, 1:40:21 PM4/2/15
to quixot...@googlegroups.com
Examples for what?
Thu Apr 2 13:37:35 EDT 2015

Peter Murray Rust

unread,
Apr 2, 2015, 3:25:08 PM4/2/15
to quixot...@googlegroups.com
Papers from which you need stuff extracted

Sent from my iPhone

Karol M. Langner

unread,
Apr 2, 2015, 3:37:48 PM4/2/15
to quixot...@googlegroups.com
I have no specific need at the moment. I am thinking more about the process in general.
Thu Apr 2 15:34:10 EDT 2015
Reply all
Reply to author
Forward
0 new messages