Re: Approach for using labels during batch import

217 views
Skip to first unread message

Michael B.

unread,
Jun 4, 2013, 9:31:59 AM6/4/13
to ne...@googlegroups.com
Check out my blog entry on batch imports: http://michaelbloggs.blogspot.com/2013/05/importing-ttl-turtle-ontologies-in-neo4j.html

Labels are a bit complicated. You shouldn't commit to indices during batch imports (but you can add stuff to them) - they'll make everything incredibly slow. Michael Hunger suggested to use MapDB as a temporary index. That's what I'd do in your place. Either do it like I did (for small data sets a HashMap is more than enough) and use a java.util.Map implementation + index as fallback for the nodes that are in the DB, but haven't been imported by your application or use a MapDB instead.

Regards,
Michael

On Tuesday, 4 June 2013 11:47:25 UTC+2, Jennifer Smith wrote:
Hi there,

I have been looking at the docs for 2.0 particularly around support for labels during batch import.

I see there is support for adding labels to nodes during batch import, directly querying labels for nodes and so on. However, unless I am missing something I don't see that there is support for locating a node by label and ID. I have found I have needed to do this when I import a large dataset where the relationships come separately from the nodes (say a dump from a relational database) and I need to use an external ID to find the nodes for the relationship.

 I wondered what the intended approach for looking up a node by label and ID is during batch import. I can see the following choices:

- Use the standard EmbeddedGraphDatabase (making sure to have shut down the batch inserter of course) to look up the nodes for a bunch of relationship inserts before going into insert mode.
- Use the BatchInserterIndexProvider to somehow hack into the underlying index that I believe is created for labels
- Be patient and wait for support to appear in the batch API for querying nodes by label and ID :)

Thanks

Jen

Jennifer Smith

unread,
Jun 7, 2013, 1:41:34 AM6/7/13
to ne...@googlegroups.com
Hi Michael,

Yes I was considering using MapDB. We actually do use the standard lucene indexes during our existing 1.9x batch insertion. We also do a pre-existing data check when inserting nodes and entities that uses the index. So far it's been fast enough - by that I mean taking 2/3 hours for about 50 million nodes, 90 million relationships! But when we need more performance, I am happy to explore mapdb as an option at import time. I would also probably be interested in using this as a permanent index too, rather than just at import time.

Thanks

Jen

Michael B.

unread,
Jun 7, 2013, 4:10:29 AM6/7/13
to ne...@googlegroups.com
Michael Hunger has actually written a blog entry on this. Check his
blog out: http://jexp.de/blog/

Standard Lucene performs poorly in many cases. The only thing it's good
at is full text search with N-Gram. If you don't need that, any
key-value storm performs better, e.g. MapDB or Voldemort.

On Freitag, 7. Juni 2013 07:41:34, Jennifer Smith wrote:
> Hi Michael,
>
> Yes I was considering using MapDB. We actually do use the standard
> lucene indexes during our existing 1.9x batch insertion. We also do a
> pre-existing data check when inserting nodes and entities that uses
> the index. So far it's been fast enough - by that I mean taking 2/3
> hours for about 50 million nodes, 90 million relationships! But when
> we need more performance, I am happy to explore mapdb as an option at
> import time. I would also probably be interested in using this as a
> permanent index too, rather than just at import time.
>
> Thanks
>
> Jen
>
> On Tuesday, 4 June 2013 14:31:59 UTC+1, Michael B. wrote:
>
> Check out my blog entry on batch imports:
> http://michaelbloggs.blogspot.com/2013/05/importing-ttl-turtle-ontologies-in-neo4j.html
> <http://michaelbloggs.blogspot.com/2013/05/importing-ttl-turtle-ontologies-in-neo4j.html>
>
> Labels are a bit complicated. You shouldn't /commit /to indices
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Neo4j" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/neo4j/eq_2fD2BlQU/unsubscribe?hl=en.
> To unsubscribe from this group and all its topics, send an email to
> neo4j+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>


Michael Hunger

unread,
Jun 7, 2013, 4:26:47 AM6/7/13
to ne...@googlegroups.com
Actually I want to update the CSV batch inserter to support index lookups and use real "csv" that means I'll put MapDB in there, we'll see how it goes.

You can also see if just a standard HashMap is good enough for you or a Trove-primitive Map. Otherwise there is still that trick with the array of unique values which you can sort and then use the array index as node-id. inserter.createNode(index, props) and then the id-lookup for rels is just Arrays.binarySearch(array, value)

I also have to update the batch-importer to 2.0 but that's a bigger piece of work. As lots of the internals changed in between.

Michael



For more options, visit https://groups.google.com/groups/opt_out.




--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+unsubscribe@googlegroups.com.

Michael B.

unread,
Jun 7, 2013, 4:35:26 AM6/7/13
to ne...@googlegroups.com, Michael Hunger
I checked that out in my batch importer (have a look at it on github).
MapDB performs pretty good, but in the end, the index look-ups aren't
the big bottleneck. If you need to make normal index operation at any
point (to make sure you're not importing duplicates) or iterate over
relationships of nodes to create unique relationships, everything's
becoming way slower.

As far as Batch imports go, I think an in-memory MapDB ist the best
option. You might want to include some kind of function to create an
in-memory index on specific Labels/keys to allow for fast access to
whatever's desired for batch loads.

Here's what I did for Batch loads:
https://github.com/mybyte/tools/blob/master/Turtle%20loader/src/de/miba/neo4j/loader/turtle/Neo4jMapDBBatchHandler.java
The import went fine, pretty fast I'd say. The bigger problem is
overall performance on all the node operations...
> http://michaelbloggs.blogspot.__com/2013/05/importing-ttl-__turtle-ontologies-in-neo4j.__html
> <http://michaelbloggs.blogspot.com/2013/05/importing-ttl-turtle-ontologies-in-neo4j.html>
>
> <http://michaelbloggs.__blogspot.com/2013/05/__importing-ttl-turtle-__ontologies-in-neo4j.html
> https://groups.google.com/d/__topic/neo4j/eq_2fD2BlQU/__unsubscribe?hl=en
> <https://groups.google.com/d/topic/neo4j/eq_2fD2BlQU/unsubscribe?hl=en>.
> To unsubscribe from this group and all its topics, send an
> email to
>
> neo4j+unsubscribe@__googlegroups.com
> <mailto:neo4j%2Bunsu...@googlegroups.com>.
> For more options, visit
> https://groups.google.com/__groups/opt_out
> <https://groups.google.com/groups/opt_out>.
>
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to neo4j+unsubscribe@__googlegroups.com
> <mailto:neo4j%2Bunsu...@googlegroups.com>.
> For more options, visit https://groups.google.com/__groups/opt_out
> <https://groups.google.com/groups/opt_out>.
>
>
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Neo4j" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/neo4j/eq_2fD2BlQU/unsubscribe?hl=en.
> To unsubscribe from this group and all its topics, send an email to
> neo4j+un...@googlegroups.com.

Qi Song

unread,
Oct 15, 2015, 6:09:37 AM10/15/15
to Neo4j, michael...@neotechnology.com
Hello Michael,
I try to use your Turtleloader to import Yago(https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/) into neo4j. But I met some weird problems when importing. I can import YagoFacts.ttl and YagoTypes.ttl well separably. But when I tried to import both of them I got this error. I'm not sure what's the reason. There is some limit for TurtleLoader or BatchImporter?

Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58)
Caused by: java.lang.RuntimeException: Panic called, so exiting
at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.assertHealthy(AbstractStep.java:200)
at org.neo4j.unsafe.impl.batchimport.staging.ProducerStep.process(ProducerStep.java:78)
at org.neo4j.unsafe.impl.batchimport.staging.ProducerStep$1.run(ProducerStep.java:54)
Caused by: java.lang.IllegalArgumentException
at sun.misc.Unsafe.allocateMemory(Native Method)
at org.neo4j.unsafe.impl.internal.dragons.UnsafeUtil.malloc(UnsafeUtil.java:324)
at org.neo4j.unsafe.impl.batchimport.cache.OffHeapNumberArray.<init>(OffHeapNumberArray.java:41)
at org.neo4j.unsafe.impl.batchimport.cache.OffHeapLongArray.<init>(OffHeapLongArray.java:34)
at org.neo4j.unsafe.impl.batchimport.cache.NumberArrayFactory$2.newLongArray(NumberArrayFactory.java:122)
at org.neo4j.unsafe.impl.batchimport.cache.NumberArrayFactory$Auto.newLongArray(NumberArrayFactory.java:154)
at org.neo4j.unsafe.impl.batchimport.RelationshipCountsProcessor.<init>(RelationshipCountsProcessor.java:60)
at org.neo4j.unsafe.impl.batchimport.ProcessRelationshipCountsDataStep.processor(ProcessRelationshipCountsDataStep.java:73)
at org.neo4j.unsafe.impl.batchimport.ProcessRelationshipCountsDataStep.process(ProcessRelationshipCountsDataStep.java:60)
at org.neo4j.unsafe.impl.batchimport.ProcessRelationshipCountsDataStep.process(ProcessRelationshipCountsDataStep.java:36)
at org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep$4.run(ProcessorStep.java:120)
at org.neo4j.unsafe.impl.batchimport.staging.ProcessorStep$4.run(ProcessorStep.java:102)
at org.neo4j.unsafe.impl.batchimport.executor.DynamicTaskExecutor$Processor.run(DynamicTaskExecutor.java:237)

Bests~
Qi Song
>         <mailto:neo4j%2Bu...@googlegroups.com>.
>         For more options, visit
>         https://groups.google.com/__groups/opt_out
>         <https://groups.google.com/groups/opt_out>.
>
>
>
>
>     --
>     You received this message because you are subscribed to the Google
>     Groups "Neo4j" group.
>     To unsubscribe from this group and stop receiving emails from it,
>     send an email to neo4j+unsubscribe@__googlegroups.com

Michael Bach

unread,
Oct 15, 2015, 5:07:08 PM10/15/15
to ne...@googlegroups.com
Hi!

My best guess would be that the algorithm neo4j uses is just can't cope with the vast amount of labels this sort of use case would produce. Anyhow, the code is very, very old...
The better approach to this would be to actually model RDF-like relationships with nodes and introduce only a few labels for class, individual, maybe a couple data types.

Von meinem iPad gesendet

To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Qi Song

unread,
Oct 15, 2015, 5:12:36 PM10/15/15
to Neo4j
Hi Michael,
Thanks for your reply :) I noticed that the code is old and use some old APIs. However, the label is a bottleneck for loading RDF files. In my work, the label is very important. I'll try to find some way to handle labels more effective. 

Bests~
Qi Song

Michael Bach

unread,
Oct 16, 2015, 5:40:57 PM10/16/15
to ne...@googlegroups.com
I did a couple of experiments today. For all it's worth: the labels are a means to index different document sets, since property indexes are built on node label basis. I wouldn't try and introduce a label for each class in yago. As mentioned before, I'd rather try and model is-a relationships with nodes rather than labels.

Is there a particular reason why you're trying your luck with neo4j instead of virtuoso or jena?

Von meinem iPad gesendet

Michael Hunger

unread,
Oct 16, 2015, 6:26:53 PM10/16/15
to ne...@googlegroups.com
Labels are roles or tags on nodes.

Which can be used to represent types as well.

That you can attach metadata like indexes is just a benefit.

The is-a relationships might be fine on a theoretical model, but will not perform that well if you have many millions or billions of them and query across them.

How many types are there in yago?

Michael

To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.

Qi Song

unread,
Oct 17, 2015, 12:38:58 AM10/17/15
to ne...@googlegroups.com
Each instance in Yago have a type, and there are millions instances.
Qi Song
Machine learning and Knowledge Discovery Group
EECS Washington State University

Qi Song

unread,
Oct 17, 2015, 12:41:00 AM10/17/15
to ne...@googlegroups.com
We are using Neo4j as a database and build some graph mining algorithms based on it.
I think I can try that using is-a or has-type relationship to represent a label rather labels inside a node.

Michael Hunger

unread,
Oct 17, 2015, 4:13:41 AM10/17/15
to ne...@googlegroups.com
How many different types?

Von meinem iPhone gesendet

Michael B.

unread,
Oct 17, 2015, 5:04:21 AM10/17/15
to ne...@googlegroups.com
Yago has roughly 350,000 different classes, 10 million entities and 120 million facts (which would be either relationships or properties).

As mentioned previously, I'd rather go with few labels are model entity types as their own nodes (which is the case in RDF). You could query for it with something like this:
match (x:Individual)-[t:is_a]->(c:Class{type:wikicat_Songwriters_from_Louisiana}) return x

Michael Hunger

unread,
Oct 17, 2015, 5:11:16 AM10/17/15
to ne...@googlegroups.com
This looks scary like denomalization

wikicat_Songwriters_from_Louisiana

Shouldn't that be 3 nodes linked to it rather than a type node

Von meinem iPhone gesendet

Michael B.

unread,
Oct 17, 2015, 6:50:23 AM10/17/15
to ne...@googlegroups.com

Yago has a ridiculously deep taxonomy. Most ontologies have several thousands of classes though; due to the nature of any RDF store out there. Traversal and property queries (in SPARQL) are complicated and very slow because lots of things are postfiltered (collect nodes first, filter by property later). Querying by class/type and relationships on the other hand is strongly optimized and very fast. That's why most ontologies have lots of classes (are multiclassing).

Aside from that: isn't denormalization the main point of NoSQL stores? Although stuff like this shouldn't exist in a proper triple store; just found it in a yago sample data set and found it funny...

Mahesh Lal

unread,
Oct 17, 2015, 8:50:41 AM10/17/15
to Neo4j
Hi,

Started following this thread a bit late, but the last bit about denormalization caught my interest. 

Denormalization might be the point in Document stores and possiblye Key-Value, but in stores that allow storage and seamless retrieval of data(like Graphs and RDBMS - not seamless), what Michael Hunger suggested makes more sense. 

The idea is to have a node lets say (x:Songwriter)-[:LIVES_IN]->(:State{name:"Louisiana"}). The beauty of this is that it can have multiple labels, like :Individual. This approach however might not be useful when you want to find what all roles does  (x:Individual) have. In case you foresee such queries, a better way model it would be (x:Individual)-[:IS_A]->(:Role{name:"Songwriter"}) (x:Individual)-[:LIVES_IN]->(c:City)-[:LOCATED_IN]->(:State{name:"Louisiana"}). 

Though I have never worked with RDF stores, and have a very limited understanding of them from discussions with colleagues, I'm assuming the limitation of RDF stores of having one "Type" makes the ontology deep. Also, in suggesting the above, I'm assuming there is some way, in your use case, to break the types into Labels, Nodes and Relationships.

Cheers!
Mahesh Lal

-- Thanks and Regards
   Mahesh Lal

Michael B.

unread,
Oct 17, 2015, 9:39:14 AM10/17/15
to ne...@googlegroups.com
Mahesh,

RDF is by no means limited to single class inheritance. As exec summary of RDF/OWL and so on:

Everything (even class definitions, property definitions etc.) is an object. Objects have typed properties, which can either be primitives (data properties) or have pointers to other objects (object properties). This is pretty similar to properties and relations in Neo4j.
There are a few predefined types in RDF and OWL. The most important ones would be rdf:type ("is a") and rdfs:subClassOf properties. These are the ones you use to build taxonomies (subClassOf for subclasses of classes) and assign types to objects.
The interesting thing is: there's no difference between a class definition, a property definition, an actual object representation. The way those two are distincted is that classes have a "subclassOf" property while every "individual" has at least one type (a rdf:type) property to a class.

Since everything in RDF is pretty arbitrary, you can pretty much assign any property to any object. That also means that an individual can have as many classes as you'd like.

Query performance with the awful SPARQL language has always been pretty bad. Like in any other data storage, people started to model data around individual storage systems (RDF/triple stores) instead of what might be logical. This sometimes leads to very interesting side effects, like funny, unnormalized classes in addition to the logical one. In most cases, these classes are automatically generated by reasoning (there are logical rules engines for RDF).

This is mainly a performance things, because querying for a class is very fast while querying for multiple conditions can sometimes take exponential proportions.

Long story short: take a look at Barack Obama http://dbpedia.org/page/Barack_Obama in dbpedia. Look at the rdf:type section. It has dozens of yago classes like Person, Politician, Writer, Statesman etc.

Mahesh Lal

unread,
Oct 18, 2015, 3:18:44 AM10/18/15
to Neo4j
Hi Michael,

Thanks for taking the time and effort to explain this in detail. Never had a chance to work with RDF, but looks like a great weekend project to work on. Thanks once again :)

Cheers!
Mahesh

-- Thanks and Regards
   Mahesh Lal


Reply all
Reply to author
Forward
0 new messages