Importing XML into neo4j

2,077 views
Skip to first unread message

kodo

unread,
Jun 14, 2012, 5:53:21 AM6/14/12
to Neo4j
Hi!

I wonder if there are any documents describing on how to go on with
importing XML into neo4j? I have an XML structure where each "object"
is represented by an arbitrary set of "attributes" - a combination of
attributes and sub-elements really. As part of the object-structure,
there are relationship-elements including "pointers" (id:s) to other
objects in the same XML-structure.

I will probably use an ETL-tool to transform my generic XML into
another XML-format which complies better to my domain model...

I'd very much appreciate to hear from anybody who has been in the same
situation as myself.

Cheers

Nigel Small

unread,
Jun 14, 2012, 5:56:01 AM6/14/12
to ne...@googlegroups.com
It's only experimental at this stage but you can throw a sample of data at my conversion app to see if it works for you...

http://geoff.nigelsmall.net/xml2graph/

Cheers
Nige


Michael Hunger

unread,
Sep 17, 2012, 7:09:10 PM9/17/12
to ne...@googlegroups.com
Nikolai,

if you want to preserve order you have to add additional links in your db that represent that order (or add numeric positions which is less preferable)
If you want to keep namespace info you can:

# add a property for the namespace and add elements to a namespace index
# relate elements to a namespace node

What is your code so far? Perhaps you can share it and we can give you some feedback?

Michael

Am 17.09.2012 um 11:40 schrieb Nikolai Varankine:

Nigel, 

I tried your application but got error "A parse error has occured - please check your XML". My XML is still being developed and and has very complicated schema - for example, multi-namespace is widely used.

I simply want to import XML into neo4j embedded database from within Java app. I found no publicly available solution and met with couple problems: order of elements dissolves in db and namespace aware attributes require additional structure. 

Could you please tell couple words, share some ideas, web links on this subject?

Thanks

четверг, 14 июня 2012 г., 12:56:01 UTC+3 пользователь Nigel Small написал:
--
 
 

Duane Nickull

unread,
Sep 17, 2012, 10:16:47 PM9/17/12
to ne...@googlegroups.com
Nikolai:

Since XML is generally non-deterministic in terms of orders elements occur and is structured by a hierarchy, the correct assumption is that preserving the order is not feasible nor should be tried.  This is written into the core XML specification AFAIK.

In your case however, a general parsing error has occurred.  What library are you using to parse the XML?  If using Xerces, then you will want to understand why the error is being thrown.  I can help if you send me a copy of your XML.  If you can write a simple test to parse it and it passes but does not in the context of your other code, tracking down the error should be easy.

After that, the process is simple:
  1. Parse the XML, catching any errors;
  2. Tokenize all the objects (the XML Infoset provides a list of these objects);
  3. Write the logic to insert the objects you want to preserve in Neo.
Since yours is a parse error, the stack trace from the error would be useful.  Also – have you validated that your XML is in fact well formed (or conforms to a schema/DTD)?  

If you are in Java, using JAXB might be useful too.  It is great at serializing XML into java objects.  We are developing a mobile forms solution based on Neo4J and XML. 

Cheers!

Duane Nickull
***********************************
Technoracle Advanced Systems Inc.
Consulting and Contracting; Proven Results!
i.  Neo4J, PDF, Java, LiveCycle ES, Flex, AIR, CQ5 & Mobile
t.  @duanechaos
"Don't fear the Graph!  Embrace Neo4J"



--
 
 

RickBullotta

unread,
Sep 18, 2012, 7:28:09 AM9/18/12
to ne...@googlegroups.com, du...@technoracle-systems.com
Duane, while I agree that XML is "in theory" not deterministic regarding the ordering of elements, "in practice" the ordering of child elements implies an specific ordering.  Otherwise, structures such as an array (typically rendered as child elements) could not be represented easily without additional hacks such as an index attribute (though these are necessary for data structures such as sparse arrays).  That said, this "hack" is basically required for representing the data in a graph, since unfortunately there is no concept of ordering of relationships (which kinda sucks).  Overall, we have found the lack of ordering and a clean/performant way to maintain indices one of the major limitations of graph DBs.  

Lasse Westh-Nielsen

unread,
Sep 18, 2012, 8:01:44 AM9/18/12
to ne...@googlegroups.com
I agree with Rick. In the specific case of XML, order matters for elements (not for attributes though).

So a good test would be, can I round-trip XML to Neo and back again, i.e. can I reconstruct the XML (if not syntactically then at least semantically).

Lasse




--
 
 

Duane Nickull

unread,
Sep 18, 2012, 8:01:24 PM9/18/12
to ne...@googlegroups.com
Rick:

I know you are a huge supporter of Neo4J and the only reason I am responding is that I wanted to make sure others who read this don't think this is a shortcoming of Neo4J.   My (emphasis here on "my") opinion is that the order is not part of the XML processing model. This has been echoed a lot by others (http://www.ibm.com/developerworks/xml/library/x-eleord/index.html).   In practice, this does of course happen a lot because many SAX implementations will read and move elements in a FIFO manner and the DOM's commonly implement a way to preserve this.  In an attempt to be helpful, one of the problems I had encountered a lot with XML is that it is commonly not designed from a good model.  If the order is important, the model should capture this and then the XML expression of instances of that model could then implement the mechanism to preserve it.  If the order of an element is required (sounds like it is), an attribute that denotes the place in such a list would be better. 

I just wanted to state this so no one thought it was a shortcoming of Neo4J. One way to solve this would be to add extra attributes (order="001") to the XML which could then be used to determine order which would allow the round trip to perform seamlessly.   I am not sure which project you are working on, but would this be a viable solution?

One other hack would be to intercept SAX events at a very low level in the parser and timestamp them then pass that into Neo with the time included for ordering.

Duane Nickull
***********************************
Technoracle Advanced Systems Inc.
Consulting and Contracting; Proven Results!
i.  Neo4J, PDF, Java, LiveCycle ES, Flex, AIR, CQ5 & Mobile
t.  @duanechaos
"Don't fear the Graph!  Embrace Neo4J"


--
 
 

Duane Nickull

unread,
Sep 18, 2012, 8:05:04 PM9/18/12
to ne...@googlegroups.com
Lasse:

On top of my response to Rick, I would point out that XML has nothing do to with semantics.  Elements names may infer semantic principles and this is where some argue that XML therefore instills pragmatics (J. Sowa et al, ONtolog Forum).  XML is not considered semantic in nature by most ontologists and the authors of the specifications.

There is however a lot of really great work going on with RDF graphs and XML expressions (http://www.w3.org/DesignIssues/RDF-XML.html) that might be interesting to this conversation.

Cheers (and beers one day at a global neo conference!).

Duane
***********************************
Technoracle Advanced Systems Inc.
Consulting and Contracting; Proven Results!
i.  Neo4J, PDF, Java, LiveCycle ES, Flex, AIR, CQ5 & Mobile
t.  @duanechaos
"Don't fear the Graph!  Embrace Neo4J"


From: Lasse Westh-Nielsen <lasse.wes...@neopersistence.com>
Reply-To: <ne...@googlegroups.com>
--
 
 

Dmitriy Shabanov

unread,
Sep 19, 2012, 1:12:20 AM9/19/12
to ne...@googlegroups.com
Hello,

I have simplest question: why you want to use graph storage for xml? there are a lot of naive xml storages that optimize for that and show much better performance then graph storage can.
--
Dmitriy Shabanov

Rick Bullotta

unread,
Sep 19, 2012, 8:57:24 AM9/19/12
to ne...@googlegroups.com
Actually, ordering can be implemented in a graph DB using tree techniques, without the need for decorating nodes or relationships with additional attributes.  In fact, using the attribute approach requires the equivalent of "a full table scan" to walk the ordered list or to get the first "n" or last "n".  Indices are another approach, but that would not fit well for ordering a localized set of nodes (it would require many indices).  In general, I think a sorted tree model is probably the only workable approach for maintaining locally ordered node sets.  It also has the advantage of not creating a node "hot spot" (e.g. a single node that has many relationships).

--
 
 

Craig Taverner

unread,
Sep 19, 2012, 9:23:21 AM9/19/12
to ne...@googlegroups.com
What we do for sorted trees is have NEXT relationships between all children of a particular node. Then we only bother with the CHILD relationship to the first child. This removes the super-node issue Rick hinted at below. You can also write traversal descriptions that understand how to traverse trees made of only CHILD relationships or mixed CHILD-NEXT relationships, supporting both styles.

In some models we have FIRST and LAST relationships, depending on what kind of queries we need to perform on the tree.

Another model we use is when we have two or more tree structures relating to the same data nodes, and then the tree is composed of proxy nodes that reference original data. So one particular data point can be found by traversing any of a number of trees. The value of this is largely in that the trees then form a kind of domain specific index. So, for example, one tree might describe the physical composition of things, while another might be more of a logical composition.

--
 
 

Duane Nickull

unread,
Sep 19, 2012, 11:32:40 AM9/19/12
to ne...@googlegroups.com
While not my ambition to do this, my answer is that both Peter and Lasse
have probably made the (correct) assumption that Neo store data as
key-value pairs much more efficiently than many DOM-based XML native DB's.
My company, XML Global, used to have one of the first XML DB's back in
1998. To me a tree is a specialized type of graph.

I would presume (hopefully someone will correct me if I am wrong) that
querying XML is less efficient than traversing graph nodes given the
requirements to hold the entire tree in memory. A tree is also a
specialized type of graph IMO and storing elements as nodes seems to make
sense (caveat - I have not done it personally). Querying for a specific
node over multiple instances of a tree structure would probably not scale
as well either.

I would love to hear from Peter on this. It would make a good topic for a
tutorial.

Duane Nickull

***********************************
Technoracle Advanced Systems Inc.
Consulting and Contracting; Proven Results!
i. Neo4J, PDF, Java, LiveCycle ES, Flex, AIR, CQ5 & Mobile
b. http://technoracle.blogspot.com
t. @duanechaos
"Don't fear the Graph! Embrace Neo4J"





From: Dmitriy Shabanov <shab...@gmail.com>
Reply-To: <ne...@googlegroups.com>
Date: Tuesday, 18 September, 2012 10:12 PM
To: <ne...@googlegroups.com>
Subject: Re: [Neo4j] Importing XML into neo4j


--



default[4].xml

Duane Nickull

unread,
Sep 19, 2012, 11:34:45 AM9/19/12
to ne...@googlegroups.com
Very cool. This makes a lot of sense.

THank you.

D
**
***********************************
Technoracle Advanced Systems Inc.
Consulting and Contracting; Proven Results!
i.  Neo4J, PDF, Java, LiveCycle ES, Flex, AIR, CQ5 & Mobile
t.  @duanechaos
"Don't fear the Graph!  Embrace Neo4J"


--
 
 

Dmitriy Shabanov

unread,
Sep 19, 2012, 4:56:47 PM9/19/12
to ne...@googlegroups.com
On Wed, Sep 19, 2012 at 8:32 PM, Duane Nickull <du...@technoracle-systems.com> wrote:
While not my ambition to do this, my answer is that both Peter and Lasse
have probably made the (correct) assumption that Neo store data as
key-value pairs much more efficiently than many DOM-based XML native DB's.
 My company, XML Global, used to have one of the first XML DB's back in
1998. To me a tree is a specialized type of graph.

A lot of thing get out from that time -) There are a huge set of native xml db out there. Give them a try.

I would presume (hopefully someone will correct me if I am wrong) that
querying XML is less efficient than traversing graph nodes given the
requirements to hold the entire tree in memory.  A tree is also a
specialized type of graph IMO and storing elements as nodes seems to make
sense (caveat - I have not done it personally).  Querying for a specific
node over multiple instances of a tree structure would probably not scale
as well either.

I would love to hear from Peter on this.  It would make a good topic for a
tutorial.

At Animo project, we did implement "ordered trees" by bdb-index and https://github.com/animotron/core/blob/master/src/main/java/org/animotron/expression/StAXExpression.java to store & https://github.com/animotron/core/blob/master/src/main/java/org/animotron/graph/traverser/MLResultTraverser.java to stream back.

also same staff for json: https://github.com/animotron/core/blob/master/src/main/java/org/animotron/expression/JSONExpression.java & MLResultTraverser.java (link above)

That is fastest way we was able to get, but the problems that it several mega times slow that eXist-db (tested on 200Mb xml file) (querying also fails ... very slow) ... right now we using it to solve quite complex problems with help of animo technologies (combination of small json or xml as messages & traverses to detect conflicts & resolve dependencies + calculations with global caching). So, we using graph because of second part, not because of xml & querying it.

--
Dmitriy Shabanov

Nikolai Varankine

unread,
Sep 23, 2012, 8:20:03 AM9/23/12
to ne...@googlegroups.com
Dmitry,

XML is considered as a format to exchange data for my application, namely for import/export, as related to local repository. Data processing is performed through faster interface to this abstract storage. Neo4j sounds good implementation for it.

Thanks for idea!

среда, 19 сентября 2012 г., 8:12:23 UTC+3 пользователь Dmitriy Shabanov написал:

Nikolai Varankine

unread,
Sep 23, 2012, 8:34:52 AM9/23/12
to ne...@googlegroups.com
Nigel,

Thank you for ideas! Thanks to everybody responded! Guess I stay step forward to understanding of the problem and variants of solution.

Nikolay


четверг, 14 июня 2012 г., 12:56:01 UTC+3 пользователь Nigel Small написал:
It's only experimental at this stage but you can throw a sample of data at my conversion app to see if it works for you...

http://geoff.nigelsmall.net/xml2graph/

Cheers
Nige



четверг, 14 июня 2012 г., 12:56:01 UTC+3 пользователь Nigel Small написал:
It's only experimental at this stage but you can throw a sample of data at my conversion app to see if it works for you...

http://geoff.nigelsmall.net/xml2graph/

Cheers
Nige


Reply all
Reply to author
Forward
0 new messages