Parsing XML files

1,056 views
Skip to first unread message

Ben Ward

unread,
Jul 5, 2013, 9:49:26 PM7/5/13
to juli...@googlegroups.com
Hi,

Does Julia or a library for Julia provide a convenient way to read in data from an XML document? I know little about XML beyond it being a format - somewhat human readable some of my programs I use sometimes require. Theres a schema for phyloXML that I'm currently writing functions for to read the data in and process it, but I'm brining it in a strings using the textIO and just reading around the format of PhyloXML and using matching and regex to process the input. I was wondering if there's something that will make this task easier like some kind of parser that extract the elements for example and the data they contain, based on the schema. 

Best,
Ben.

Isaiah Norton

unread,
Jul 5, 2013, 10:49:44 PM7/5/13
to juli...@googlegroups.com
Probably the most complete parser/query tool is Amit's expat wrapper. However, I can't tell whether it will let you write documents, if that is important for your application.

I started a libxml2 wrapper and dom-1 implementation built on that. The wrapper is complete, but the dom part needs some more work to be usable (mostly writing C accessors for struct elements that do not have accessors in the library, and which we can't get to from Julia yet).

Amit Murthy

unread,
Jul 6, 2013, 6:34:58 AM7/6/13
to juli...@googlegroups.com
LibExpat is just a parser at this stage. Though I cannot vouch for the performance for this at this stage since the parser creates an "element tree" during the parse and the search queries are run against the element tree.

A stream parsing interface that will avoid the creation of the element tree is on my todo list.

Ben Ward

unread,
Jul 6, 2013, 10:00:16 AM7/6/13
to juli...@googlegroups.com
Thanks, I'll give LibExpat a go, if I run into trouble it may be easier if I continue my parsing function specifically for PhyloXML.

Best,
Ben.

Ben Ward

unread,
Jul 6, 2013, 11:07:43 AM7/6/13
to juli...@googlegroups.com
Hi, so Im giving libexpat a go to see if I can make my PhyloXML parsing job easier. So far if I do the following to the file on my desktop containing the xml:

using LibExpat
filepath = "~/Desktop/phyxml"
instream = open(expanduser(filepath))
instring = readall(instream)
close(instream)
xmltree = xp_parse(instring)

This works and I get my xmltree variable which looks like:

julia> xmltree
    <phylogeny rooted="true">
        <name>Alcohol dehydrogenases</name>
        <description>contains examples of commonly used elements</description>
        <clade>
            <events>
                <speciations>1</speciations>
            </events>
            <clade>
                <taxonomy>
                    <id provider="ncbi">6645</id>
                    <scientific_name>Octopus vulgaris</scientific_name>
                </taxonomy>
                <sequence>
                    <accession source="UniProtKB">P81431</accession>
                    <name>Alcohol dehydrogenase class-3</name>
                </sequence>
            </clade>
            <clade>
                <confidence type="bootstrap">100</confidence>
                <events>
                    <speciations>1</speciations>
                </events>
                <clade>
                    <taxonomy>
                        <id provider="ncbi">1423</id>
                        <scientific_name>Bacillus subtilis</scientific_name>
                    </taxonomy>
                    <sequence>
                        <accession source="UniProtKB">P71017</accession>
                        <name>Alcohol dehydrogenase</name>
                    </sequence>
                </clade>
                <clade>
                    <taxonomy>
                        <id provider="ncbi">562</id>
                        <scientific_name>Escherichia coli</scientific_name>
                    </taxonomy>
                    <sequence>
                        <accession source="UniProtKB">Q46856</accession>
                        <name>Alcohol dehydrogenase</name>
                    </sequence>
                </clade>
            </clade>
        </clade>
    </phylogeny>
</phyloxml>

Say now I want to work through the clade elements (tree structure in PhyloXML is written recursively so child <clade> elements are written within the <clade> element that is it's parent. I want to get the ETree of the first <clade> element. I've tried to do the following:

xmltree[xpath"phyloxml/phylogeny/clade"]

Because the first <clade> element comes under the <phyloxml> element and the <phylogeny> element. But I get an empty array returned. I have read the xpath page on wikipedia to understand the syntax and my string should find the clade elements under the parent phylogeny which is itself under the parent phyloxml. I feel like I'm missing something though if I'm getting a zero element array back.

Best,
Ben.

Amit Murthy

unread,
Jul 6, 2013, 12:11:44 PM7/6/13
to juli...@googlegroups.com
xmltree[xpath"/phyloxml/phylogeny/clade"] should work.

Amit Murthy

unread,
Jul 6, 2013, 12:15:43 PM7/6/13
to juli...@googlegroups.com
Or xmltree[xpath"phylogeny/clade"]

Amit Murthy

unread,
Jul 6, 2013, 12:16:57 PM7/6/13
to juli...@googlegroups.com
In case you do not want to read up on xpath, the "find" API has limited functionality but is good enough for simple use cases - it is documented in the README.

Ben Ward

unread,
Jul 6, 2013, 12:58:16 PM7/6/13
to juli...@googlegroups.com
Hi, thanks, I've found both useful, . If I have the following:

julia> speciations = find(allClades[i], "events/speciations/")
1-element ETree Array:
 <speciations>1</speciations>

How do I get the value 1 from between the two >< ? I don't see the value in the elements array (where I first guessed it might be) of the ETree or in the attar variable?

Best,
Ben.

Jameson Nash

unread,
Jul 6, 2013, 3:03:37 PM7/6/13
to juli...@googlegroups.com
It should be in the elements array of speciations. (The only item?)

From within xpath, you should be able to apply various functions from
the XPath1 spec:
xmltree[xpath"//clade"][xpath"number(events/speciations)"]

in this example, number will convert the string-value of the first
element in the list of nodes matching events/speciations to a number
(float) and will do this for all clade nodes, in document order

Ben Ward

unread,
Jul 6, 2013, 4:51:02 PM7/6/13
to juli...@googlegroups.com
If I try that I get an error:

julia> currentClade[xpath"number(events/speciations)"]
ERROR: no method colon(Float64,DataType)
 in xpath_expr at /Users/wardb/.julia/LibExpat/src/xpath.jl:782
 in getindex at /Users/wardb/.julia/LibExpat/src/xpath.jl:1240

although the following works for me:

speciations = int32(currentClade[xpath"events/speciations"][1].elements[1])

Amit Murthy

unread,
Jul 7, 2013, 2:41:00 AM7/7/13
to juli...@googlegroups.com
or using "find" from the root node....,

find(xmltree, "/phyloxml/phylogeny/clade/clade[2]/events/speciations#string")

will give you the text portion of the xml tag.

Jameson Nash

unread,
Jul 7, 2013, 3:10:52 AM7/7/13
to juli...@googlegroups.com
@ben ward: I think I fixed that a few days ago, it should be fixed if
you do Pkg.update() (I mistyped one of my `::` type assertions as a
`:`, which then got interpreted as a strange attempt at making a
range). The alternatives given are probably just as good.

Ben Ward

unread,
Jul 8, 2013, 9:04:31 AM7/8/13
to juli...@googlegroups.com
Hi,

I'll try and do an update and see how I get on after I've finished writing some stuff, then I'll see if it works and change it.

Best,
Ben.

Ben Ward

unread,
Jul 8, 2013, 4:30:04 PM7/8/13
to juli...@googlegroups.com
Hi Guys,

So I made sure I was up to date and I've found that indeed the number() function does work. A quick question. Is it possible for a person to specify what type of number they are expecting / want? Say in the documentation of the xml language I am writing the parser for I see the value I'm getting is a: xs:nonNegativeInteger [1]"
Can I specify I want it as an integer? at the moment if I use number() in the xpath to retrieve the value - in this case a integer 9. I get the value 9.0 for my Julia variable. It's not a major thing because I can convert it to an int - checking first the value is not a NaN - which as far as I know is of type float64 and you can't do int64(NaN). The nice thing being if I read a file and the value does not exist in the xml file NaN is filled in for stuff not present in the file, so it acts as a convenient NA or no value indicator for my purposes.

Best,
Ben.

Jameson Nash

unread,
Jul 8, 2013, 11:46:18 PM7/8/13
to juli...@googlegroups.com
I haven't (and don't plan on) implementing XML namespaces or XPath2,
since I don't need them and it's perfectly reasonable (and faster) to
do the conversions in julia. The number function is pretty nearly
implemented as xpath"number(a/b/c)" =
parsefloat(xpath"string(a/b/c)"), which means you can just use the int
function on the string-value of the node (via the string() function,
or any of the other accessors).
Reply all
Reply to author
Forward
0 new messages