Upload GraphMl to Neo4j

Simon Gibson

unread,

Mar 20, 2012, 2:09:04 AM3/20/12

to ne...@googlegroups.com

Guys,

New to Neo4j but seriously looking into it after seeing talk at YOW! conference in Brisbane late last year. I really want this to be a solution for us.

I am wanting to upload data to Neo4j using GraphML. Bottom of the email is a snippet/sample of the XML data file. I upload this to Neo4j using blueprints with code code such as:

   try {
      Graph graph = new Neo4jGraph("/home/gib40f/development/aibl/vact/neo4j-db-write");
       InputStream in = new FileInputStream("/home/gib40f/development/aibl/vact/input/graphml-output.xml");
       GraphMLReader reader = new GraphMLReader(graph);
       reader.inputGraph(in);
       in.close();
    } catch (Exception ex) {
      throw new VactException("Error occurred loading/saving neo4j graph db.", ex);
    }

All seems to work a treat. I can then see the structure in Neoclipse. The problem comes about when I want to query using cypher.

I can do something like:

start n = node(3)
return n

and will get a result finding the Subject node. However if I do something like:

start n = node:Subject(name="SS_0002")
return n

I get no result. Is there some linkage missing in my GraphMl file? Is there some way in Neo4j to check that it is correctly loaded and configured? It is really weird that I can see the nodes in neoclipse but the only way I can query is by directly using the node's id. I get the same query results if I use neo4j-shell.

Thanks Simon

GraphMl
<?xml version='1.0' encoding='UTF-8'?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns">
    <key id="name" for="node" attr.name="name" attr.type="string" />
    <key id="__type__" for="node" attr.name="__type__" attr.type="string" />
    <key id="value" for="node" attr.name="value" attr.type="string" />
    <graph id="G" edgedefault="directed">
        <node id="0" />
        <node id="1">
            <data key="__type__">Study</data>
            <data key="name">Default Study</data>
        </node>
        <node id="2">
            <data key="__type__">Subject</data>
            <data key="name">SS_0002</data>
        </node>
        <node id="3">
            <data key="__type__">Event</data>
            <data key="name">Collection 1</data>
        </node>
        <edge id="1" source="3" target="1" label="study" />
        <node id="4">
            <data key="__type__">Category</data>
            <data key="name">Demographics</data>
        </node>
        <node id="5">
            <data key="__type__">ItemData</data>
            <data key="name">Classification</data>
            <data key="value">Healthy</data>
        </node>
        <edge id="2" source="5" target="3" label="event" />
        <edge id="3" source="5" target="2" label="owner" />
        <edge id="4" source="5" target="4" label="category" />

    </graph>
</graphml>

Peter Neubauer

unread,

Mar 20, 2012, 2:40:43 AM3/20/12

to ne...@googlegroups.com

Hi Simon,
what you are doing right now is an index lookup, see
http://docs.neo4j.org/chunked/snapshot/query-start.html#start-node-by-index-lookup
. Importing a raw GraphML file does not put the node properties in any
indexes, just the data into the graph.

What you want to do is to check for a property I guess,
http://docs.neo4j.org/chunked/snapshot/query-where.html#where-filter-on-node-property

HTH

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Neo4j 1.6 released - dzone.com/6S4K
The Neo4j Heroku Challenge - http://neo4j-challenge.herokuapp.com/

Simon Gibson

unread,

Mar 20, 2012, 7:55:01 PM3/20/12

to ne...@googlegroups.com

Peter,

Thanks for replying. I am really confused by this issue. Maybe some more background might be helpful.

I first tried creating the database using spring-data persisting annotated POJOs. This worked a treat. Problem came when trying to import 1.5M nodes, it was just too slow, overnight and it was still not complete. So I then tried importing a GraphML representation. This seemed to work I just cannot query anything apart from using the actual node id.

Using the simple example GraphML from original post it imports and I can see the nodes and relationships in Neoclipse I just cannot query anything. Here is a trace using neo-shell:

neo4j-sh (0)$ start n=node(3) return n
+----------------------------------------------+
| n                                            |
+----------------------------------------------+
| Node[3]{name->"SS_0002",__type__->"Subject"} |
+----------------------------------------------+
1 rows, 1 ms

neo4j-sh (0)$ start n = node:Subject("name:*") where n.name="SS_0002" return n
+---+
| n |
+---+
+---+
0 rows, 1 ms

neo4j-sh (0)$ start n=node:__types__(className="Subject") return n
+---+
| n |
+---+
+---+
0 rows, 0 ms

Is it possibly an index problem? I tried adding the following config when creating the neo db:

      Map<String, String> config = new HashMap<String, String>();
      config.put(Config.NODE_KEYS_INDEXABLE, "name,__type__");
      config.put(Config.RELATIONSHIP_KEYS_INDEXABLE, "category,event,owner,study");
      config.put(Config.NODE_AUTO_INDEXING, "true");
      config.put(Config.RELATIONSHIP_AUTO_INDEXING, "true");
      config.put(Config.DUMP_CONFIGURATION, "true");
      Graph graph = new Neo4jGraph("/home/sgibson/development/aibl/vact/neo4j-db-write", config);

This doesn't seem to make any difference.

I am running short of ideas :)

Simon

On Tuesday, 20 March 2012 16:40:43 UTC+10, Peter Neubauer wrote:

Hi Simon,
what you are doing right now is an index lookup, see
http://docs.neo4j.org/chunked/snapshot/query-start.html#start-node-by-index-lookup
. Importing a raw GraphML file does not put the node properties in any
indexes, just the data into the graph.
What you want to do is to check for a property I guess,
http://docs.neo4j.org/chunked/snapshot/query-where.html#where-filter-on-node-property
HTH
Cheers,
/peter neubauer
G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer
Neo4j 1.6 released - dzone.com/6S4K
The Neo4j Heroku Challenge - http://neo4j-challenge.herokuapp.com/

Peter Neubauer

unread,

Mar 21, 2012, 2:30:33 AM3/21/12

to ne...@googlegroups.com

Yes Simon,
This is am index problem. As stated, graphml does not export or import indexes. There is https://github.com/neo4j/neo4j-geoff by Nigel Small to cope with this, or, you could setup auto indexing to index your properties, see http://docs.neo4j.org/chunked/snapshot/rest-api-auto-indexes.html

Does that make sense?

Simon Gibson

unread,

Mar 21, 2012, 4:01:02 AM3/21/12

to ne...@googlegroups.com

Peter,

I did try the auto indexes and that didn't seem to work. I am using blueprints 1.2 so maybe I need to try the 1.2-SNAPSHOT. I will give that a try tomorrow. Not sure what I do with the neo4j-geoff.

Maybe I will just have to try another method of import.

Thanks for help, I appreciate it.

Simon

Nigel Small

unread,

Mar 21, 2012, 4:44:19 AM3/21/12

to ne...@googlegroups.com

Hi Simon

There is a write-up of Geoff at http://geoff.nigelsmall.net/

Cheers

Nige

Michael Hunger

unread,

Mar 21, 2012, 12:17:27 PM3/21/12

to ne...@googlegroups.com

How did you set up your auto-indexes? If you have uniform data it should work reasonably well.

I would love to look into your SDN import issue as well. 1.5M nodes are not too much even with the overhead of creating additional objects on import. Can you share your SDN domain entity + import code (and the data generator). If you'd like to also off-line.

Thanks

Michael

Simon Gibson

unread,

Mar 21, 2012, 6:53:59 PM3/21/12

to ne...@googlegroups.com

Michael,

There is a copy of the test graphml in the earlier posts feel free to try it. I have generated the full data set in graphml but need to have the test case working first. Also there is private data in the full graphml so I cannot share that. It imports to Neo4j ok but cannot query it apart from using the actual node ids, which makes things a bit tough. I tried to set the auto indexes by the following code:

Map<String, String> config = new HashMap<String, String>();
      config.put(Config.NODE_KEYS_INDEXABLE, "name,__type__");
      config.put(Config.RELATIONSHIP_KEYS_INDEXABLE, "category,event,owner,study");
      config.put(Config.NODE_AUTO_INDEXING, "true");
      config.put(Config.RELATIONSHIP_AUTO_INDEXING, "true");
      config.put(Config.DUMP_CONFIGURATION, "true");
      Graph graph = new Neo4jGraph("/home/sgibson/development/aibl/vact/neo4j-db-write", config);

This doesn't appear to work. I was using blueprints 1.2 so I might try with the latest snapshot. The generated neo4j database is zipped up at:

http://dl.dropbox.com/u/27651004/neo4j-db.tar.gz

if you want to look at that.

Looks like I will have to try the bulk importer next.

Thanks

Simon

Michael Hunger

unread,

Mar 21, 2012, 7:53:09 PM3/21/12

to ne...@googlegroups.com

Simon,

you can use for instance luke to look at your index http://code.google.com/p/luke/downloads/detail?name=lukeall-3.5.0.jar&can=2&q=

if you do that, you can see that all is there.

Only the auto-index is named: node_auto_index

so you would do with cypher:

start n = node:node_auto_index(name="SS_0002") return n

also the __type__ property went into that index.

Unfortunately right here the neo4j-auto-index is a bit limited for what you want to do.

Might probably be easier to set it up with the tinkerpop blueprints auto-indexing: https://github.com/tinkerpop/blueprints/wiki/Graph-Indices

to generate all the data structures that SDN might need to work (or you can also configure the NoopTypeRepresentationStrategy if you don't want to store type information in the graph).

You have to create the following indexes:

Event, ItemData, Study, Subject, Category

probably each with the correct name properties indexed for the correct nodes.

And each of the nodes also being indexed as part of the index named __types__ with the key className and the value of the FQN e.g. com.example.domain.Event

(or if you use @TypeAlias("Event") @NodeEntity class Event {....} then only "Event" is stored for the node in the index. So in the __type__ index the single key "className" has all nodes indexed with the same value aka. their class-name or alias.

HTH

Michael

Peter Neubauer

unread,

Mar 22, 2012, 11:51:26 AM3/22/12

to ne...@googlegroups.com

Hi guys,
just updated the GraphML example to show that auto-indexing is
working, for your reference.

http://docs.neo4j.org/chunked/snapshot/gremlin-plugin.html#rest-api-load-a-sample-graph

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Neo4j 1.6 released - dzone.com/6S4K
The Neo4j Heroku Challenge - http://neo4j-challenge.herokuapp.com/

Simon Gibson

unread,

Apr 4, 2012, 4:10:14 AM4/4/12

to ne...@googlegroups.com

As an update for closure I have tried with the batchinserter and that seems to work really well. It inserts the 1.5M nodes in just under a minute. May even be able to tweak that some more but at 1 minute that is satisfactory. I created and added to the indexes as I went and searches are now working.

I suspect that I need to do some more homework and determine that my graph is structured in the best way for our application. The searches are taking longer than I would have hoped. The structure we have is a bunch (ie nearly all nodes) of data items that have a relationship to subjects, events and categories. I need to do searches retrieving nodes filtered on events, subjects and categories. I have been using cypher and it seems to work but is slow. Maybe an indexing issue.

I will continue playing around with it but would welcome any tips and ideas.

Thanks for all your help so far.

Simon

Peter Neubauer

unread,

Apr 4, 2012, 4:15:13 AM4/4/12

to ne...@googlegroups.com

Simon,
do you know if the index searches are slow or the traversals? Could
you isolate them by doing something like

start n = nodes:index_name(...) return count(n)

and then a traversal and compare the times? Also, make sure you have a
warm graph, warming up with something like

start n = node(*) return count(n)

or so ...

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Neo4j - Graphs rule.
Program or be programmed - Computer Literacy for kids.
http://foocafe.org/#CoderDojo

Simon Gibson

unread,

Apr 4, 2012, 5:25:54 AM4/4/12

to ne...@googlegroups.com

Peter,

I have only been doing index searches using cypher. Will need to read up on
traversals and try that as a comparison. Seems to be too many options :)
Will give it a go when I get to the office tomorrow.

Simon

Peter Neubauer

unread,

Apr 4, 2012, 5:29:03 AM4/4/12

to ne...@googlegroups.com

Well,
then it seems this might be an issue with Lucene performance and we
need to drill into that. You are doing a combined query over the whole
index of Lucene? What do the index queries look like?

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Neo4j - Graphs rule.
Program or be programmed - Computer Literacy for kids.
http://foocafe.org/#CoderDojo

On Wed, Apr 4, 2012 at 11:25 AM, Simon Gibson <simonb...@gmail.com> wrote:
> Peter,
>

Simon Gibson

unread,

Apr 4, 2012, 6:35:19 AM4/4/12

to ne...@googlegroups.com

Here is a sample:

START i=node:ItemData("name:*")

MATCH (i)-[:event]->(event),(i)-[:owner]->(owner)

WHERE event.name="Collection 1"

RETURN i

It works but takes minutes to complete. I have set up indexes for ItemData, Event, Category and Subject.

Simon

Peter Neubauer

unread,

Apr 4, 2012, 6:37:23 AM4/4/12

to ne...@googlegroups.com

Can you do

START i=node:ItemData("name:*")

RETURN count(i)

And see what that gives you? That would be almost a pure index lookup.

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Neo4j - Graphs rule.
Program or be programmed - Computer Literacy for kids.
http://foocafe.org/#CoderDojo

Michael Hunger

unread,

Apr 4, 2012, 7:04:40 AM4/4/12

to ne...@googlegroups.com

Would it be sensible to look rather event up via the index (as it might only be one or a few event).

START event=node:EventData(name ="Collection 1")

MATCH (i)-[:event]->(event),(i)-[:owner]->(owner)

RETURN i

Michael Hunger

unread,

Apr 4, 2012, 7:05:17 AM4/4/12

to ne...@googlegroups.com

Simon,

what version are you running?

Thanks

Michael

Am 04.04.2012 um 12:35 schrieb Simon Gibson:

Simon Gibson

unread,

Apr 4, 2012, 7:46:15 AM4/4/12

to ne...@googlegroups.com

neo4j-sh (0)$ START i=node:ItemData("name:*")

> RETURN count(i)

+----------+

| count(i) |

+----------+

| 1475429 |

+----------+

1 rows, 10309 ms

What sort of time should I expect?

Simon

Peter Neubauer

unread,

Apr 4, 2012, 7:47:50 AM4/4/12

to ne...@googlegroups.com

Is that first time or after a couple of runs?

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Neo4j - Graphs rule.
Program or be programmed - Computer Literacy for kids.
http://foocafe.org/#CoderDojo

Michael Hunger

unread,

Apr 4, 2012, 7:52:28 AM4/4/12

to ne...@googlegroups.com

Simon,

was this the first run? What does the second one look like? Does your neo4j instance have enough heap for caches and such?

this pulls the 1.5M nodes out of lucene and via neo4j through 1.5M cypher query rounds to the aggregation.

Would be interesting to see where the time is spent. Do you by chance have any means of profiling? Otherwise I can take that too (over the weekend).

What does your graph look like number wise? 1.5M items, how many owners/events ?

Michael

Simon Gibson

unread,

Apr 4, 2012, 8:07:45 AM4/4/12

to ne...@googlegroups.com

neo4j-sh (0)$ START i=node:ItemData("name:*") RETURN count(i)

+----------+

| count(i) |

+----------+

| 1475429 |

+----------+

1 rows, 10163 ms

neo4j-sh (0)$ START i=node:ItemData("name:*") RETURN count(i)

+----------+

| count(i) |

+----------+

| 1475429 |

+----------+

1 rows, 7183 ms

neo4j-sh (0)$ START i=node:ItemData("name:*") RETURN count(i)

+----------+

| count(i) |

+----------+

| 1475429 |

+----------+

1 rows, 6725 ms

It is improving. Should I expect faster than this?

It is neo4j-community-1.6.1

If I look up from event how do I get the ItemData objects?

At the moment the graph is that ItemData has relationship to Event, Subject and Category but I can change the graph structure to some other configuration. I am new to this and not sure if there is another structure that might work better. There are only 4 events and 1000 subjects and 10 or 20 categories.

Simon

Peter Neubauer

unread,

Apr 4, 2012, 8:14:41 AM4/4/12

to ne...@googlegroups.com

Could you try 1.7.M02 and see if things get faster?

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Neo4j - Graphs rule.
Program or be programmed - Computer Literacy for kids.
http://foocafe.org/#CoderDojo

Michael Hunger

unread,

Apr 4, 2012, 8:26:30 AM4/4/12

to ne...@googlegroups.com

The question is: what is your use-case on that the type of query you want to do depends in the end.

> START event=node:EventData(name ="Collection 1")

> MATCH (i)-[:event]->(event),(i)-[:owner]->(owner)

> RETURN count(i)

Michael

Simon Gibson

unread,

Apr 5, 2012, 2:11:50 AM4/5/12

to ne...@googlegroups.com

Maybe some context and a bit of a recap might be in order. I am wanting to build a web app on top of a bunch of data that has been collected as part of a clinical trial. This data is in an XML format. I was planning to use the spring data neo4j stack (2.0.1.RELEASE which pulls in neo4j-kernel-1.6).

So, I have the batch inserter working a treat and can process the XML and insert the 1.5M nodes in about a minute which is great.

I can make the graph any schema that fits but for now it's the following:

ItemData---------owner----------> Subject
              |--------event-----------> Event ---------study---------------> Study
              |--------category------> Category

So ItemData is the root node in the schema and has pretty well all the nodes (1.5M). There are 1000 Subjects 10-20 categories and 4 events. So ItemData has a relationship to Subject, Event and Category.

I have been trying searches such as:

START i=node:ItemData("name:*") MATCH (i)-[:category]->(category),(i)-[:event]->(event),(i)-[:owner]->(owner) WHERE owner.name="0002" and event.name="Collection 1" RETURN i

Getting all ItemData for a subject and event. May want to do for a category as well and other similar combinations. This search takes minutes to run.

Below is a link to a tar zipped copy of the neo4j database in case that is any use. The data has been deidentified but structurally its the same

http://dl.dropbox.com/u/27651004/neo4j-batch.tar.gz

I am open to any advice if I can speed up the searching. Maybe the indexing can be tweaked but I think they are being created correctly. I insert them as I go with the batchinserter. There are indexes for ItemData, Event, Category and Subject. I have certainly exhausted my level of knowledge on graph databases and Neo4j. I would really like this to work but maybe our usage is not the right fit and I will just have to go back to a relational database. The original choice to use a graph was that I was hoping that I can add more relationships between the ItemData nodes dynamically later on and the graph notion seemed to fit conceptually with how I picture the data structure

Thanks

Simon

Michael Hunger

unread,

Apr 5, 2012, 5:30:21 AM4/5/12

to ne...@googlegroups.com

Simon,

I looked at your use-case, thanks a lot for sharing it.

I wrote a small implementation using the core-api which was faster 1.6 seconds.

Then I indexed the owners too and used the looked up owner node to identify the correct items, it went down to 15 ms.

For cypher the difference between 1.6 and 1.7 is really big. For the query with 2 starting points it is several orders of magnitude due to the faster pattern matcher in 1.7.

But even so, interestingly cypher queries that only bound one node (e.g. the owner and then matched the relationships and checked the second condition (event-name) in the where clause where 10 times faster.

I think this data + use-case gives us a lot to work on. But you get your fast results (15ms with the java API) or 120ms with cypher in 1.7.M2 so no need to switch :)

I share the code.

QueryRunner.java

Simon Gibson

unread,

Apr 5, 2012, 9:48:32 AM4/5/12

to ne...@googlegroups.com

Michael,

Thanks so much for taking the time to help me out. I will have a good look over the weekend and see how I go. This all sounds promising.

Simon

Michael Hunger

unread,

Apr 5, 2012, 10:01:37 AM4/5/12

to ne...@googlegroups.com

You're welcome.

I would be very happy if you could publish your results/findings in a blog post or screencast so that others can benefit from it too.

Happy Easter,

Cheers

Michael

Simon Gibson

unread,

Apr 12, 2012, 12:13:56 AM4/12/12

to ne...@googlegroups.com

I will be more than happy to publish results.

I have had a good look at the sample code (QueryRunner). From this it seems that natively searching is much quicker than using cypher. It now raises the question how I should best approach it for my web app. I was planning on using spring-data and using cypher queries but it would appear that I might be better to build a layer that natively does the searches and cast/build the result nodes into pojos myself. It is a simple graph so doing this might not be too hard to cover most cases. I guess it comes down to a best practice question. I will also need to be able to build dynamic queries as well and was planning on using DSL but this builds cypher queries which comes back to the same problem.

I have upgraded spring data to 2.1.0.M1. I realise this is using neo4j 1.6 and will use 1.7 soon, so that might increase speed in the future.