SDN + BatchInserter

717 visualitzacions
Ves al primer missatge no llegit

Tero Paananen

no llegida,
8 de març 2012, 13:35:288/3/12
a Neo4j
I'm doing a pretty massive data migration effort in the next week or
two.

The app is based on SDN (2.1.M1, 1.6.1 Neo4j).

The application has a RESTful API on top of SDN.

Normal use cases can pump a whole bunch of data through that API, but
the volume of data involved with the data migration just isn't going
to work.

BatchInserter is a pretty simple concept, but using it to populate a
neo4j database used via SDN is a bit more interesting.

I have a few questions I want to make sure I understand before
proceeding with this.

1. My node and relationship entities sometimes use multiple indexes
(e.g. full text indexes, and exact indexes)
Some properties are not indexed. I do not use auto indexing.

Is it correct to assume I'll have to manage those indexes in my batch
inserter on my own, a little like this:

BatchInserter inserter = new BatchInserterImpl( "target/neo4jdb-
batchinsert" );
BatchInserterIndexProvider indexProvider = new
LuceneBatchInserterIndexProvider( inserter );
BatchInserterIndex actorsfull = indexProvider.nodeIndex( "actorsfull",
MapUtil.stringMap( "type", "fulltext" ) );
BatchInserterIndex actorsexact =
indexProvider.nodeIndex( "actorsexact", MapUtil.stringMap( "type",
"exact" ) );

Map<String, Object> nonIndexedProperties = MapUtil.map( "woah" :
"yea!" );
Map<String, Object> fullProperties = MapUtil.map( "name", "Keanu
Reeves" );
Map<String, Object> exactProperties = MapUtil.map( "gender", "male" );
Map<String, Object> allProperties = new HashMap<String, Object>();
allproperties.putAll(fullProperties);
allproperties.putAll(exactProperties);
allProperties.putAll(nonIndexedProperties);
long node = inserter.createNode( allProperties );
actorsexact.add( node, exactProperties );
actorsfull.add( node, fullProperties );

2. template.postEntityCreation() is deprecated. Is that still the
recommended way to add the __type__ properties to entities to be used
by SDN created externally?

3. What's the best way to make sure the __types__ and __rel_types__
indexes are populated correctly?

-TPP

Michael Hunger

no llegida,
8 de març 2012, 14:59:128/3/12
a ne...@googlegroups.com
Tero,

yes you have to create the index-entries manually

#1 make sure to shutdown all the batch-indexes when you shutdown the inserter
#2 template.postEntityCreation() is still the way to go, but not usable with the batch-inserter
#3 you have to add the property __type__ with the FQN as value, and add the index-entry for the node to index __types___, key "className", value FQN (or __rel_types__)

Please try it on a small dataset first.

How big is the dataset you want to import?

HTH

Michael

Tero Paananen

no llegida,
8 de març 2012, 15:31:538/3/12
a ne...@googlegroups.com
> #1 make sure to shutdown all the batch-indexes when you shutdown the inserter
> #2 template.postEntityCreation()  is still the way to go, but not usable with the batch-inserter

Yea, I figured. I have the BatchInserter implementation complete, and it is
orders of magnitude faster, but to do the postEntityCreation() I'd have to
introduce SDN (and Spring) stack on top of this thing. Or post-process all
nodes on a SDN-aware app.

> #3 you have to add the property __type__ with the FQN as value, and add the index-entry for the node to index __types___,  key "className", value FQN (or __rel_types__)

Ok. Let me give it a go. It shouldn't be too complicated really.

> Please try it on a small dataset first.
>
> How big is the dataset you want to import?

About 200M nodes, and upto 300M relationships.

-TPP

Tero Paananen

no llegida,
8 de març 2012, 16:11:138/3/12
a ne...@googlegroups.com
>> #3 you have to add the property __type__ with the FQN as value, and add the index-entry for the node to index __types___,  key "className", value FQN (or __rel_types__)
>
> Ok. Let me give it a go. It shouldn't be too complicated really.

Looks like this is working just fine.

I ran a quick test run, and compared the Lucene indexes created by SDN
app to the indexes created by the BatchInserter with Luke. They're
looking identical.

The BatchInserter is doing roughly 10,000 nodes per second.

-TPP

Michael Hunger

no llegida,
9 de març 2012, 3:35:369/3/12
a ne...@googlegroups.com
Nice,

if you give it more memory so that the all of the node- and some of the rel-store files fit into memory it should be able to insert up to 1-3 million nodes per second (ok, w/o lucene).

I would love if you could write a blog post about that.

Michael

Tero Paananen

no llegida,
22 de jul. 2012, 7:38:2822/7/12
a ne...@googlegroups.com
> I'm a bit late to this thread, but hope that you could clarify the
> following: (I'm also using SDN)
>
> I assume you mean that the following need to be added?
>
> node.setProperty("__type__", "com.x.x.Class");
> Index<Node> typeIndex = indexManager.forNodes("__types___");
> typeIndex.add(node, "className", "com.x.x.Class");
> }
>
> If so, then the associated spring repos aren't finding the data.

I wrote a blog post about this, which might help you:

http://code.paananen.fi/2012/04/05/neo4j-batchinserter-and-spring-data-for-neo4j/

-TPP

imamc

no llegida,
23 de jul. 2012, 11:48:2023/7/12
a ne...@googlegroups.com
Thanks Tero.

It's probably better if I explain what I'm trying to achieve.  I'm creating a component which is responsible for creating and removing test data for integration tests. The 'seed handler' is wrapped with Spring test execution listeners and a custom annotation. The objective is to make the seeding of data transparent. The seed handler that i've created works well with non SDN backed apps. When the app is SDN backed the correct entities load with the custom spring data repositories. However when I attempt to get the entities back using particular properties or an index then the match fails.

I've created a project where the tests have been ignored due to this problem (please refer to com.shedhack.testing.neo4j.SpringDataExampleTest). Git:

https://github.com/imamchishty/neo4j-seed-spring-data

Thanks

Lasse Westh-Nielsen

no llegida,
25 de jul. 2012, 9:33:0225/7/12
a ne...@googlegroups.com
Imamc,

I downloaded and built your project. All green - all tests work, and there do not seem to be any ignored ones?

Lasse

imamc

no llegida,
25 de jul. 2012, 18:20:4025/7/12
a ne...@googlegroups.com,la...@neotechnology.com
Thanks for taking a look. I managed to get this working earlier today. All seems well. Am now using this for a pet project. SDN seems to be playing nicely. The two critical pieces were as we expected, __type__ and the index __types__.

The test component will suffice for now and I'll be doing JIT for other features which maybe required. The seeding (insertion/deletion) works quite well.

Michael Hunger

no llegida,
14 de des. 2012, 20:31:4614/12/12
a ne...@googlegroups.com
Did you shut down the lucenebatchinserterindex correctly?

Am 14.12.2012 um 23:21 schrieb Sanjay Dalal:

I followed the steps at http://code.paananen.fi/2012/04/05/neo4j-batchinserter-and-spring-data-for-neo4j/After creating index using BatchInserterIndex, SDN repository fails to recognize the same index. Query using repository with @Query throws org.neo4j.cypher.MissingIndexException: Index <created index name, e.g. actors> does not exist. I am using 1.8.M07 and SDN 2.1.0.RC3. The same query works find in shell. 

MyRepository extends GraphRepository<MyType>, RelationshipOperationsRepository<MyType>

What am I missing?
--
 
 

Sanjay

no llegida,
15 de des. 2012, 18:51:3115/12/12
a ne...@googlegroups.com
Thanks for a quick reply. Yes I do shut it down correctly. I resolved it. It was an environment issue. Repository was looking at database in different location (from spring context) than the database where the nodes were stored by the batchinserter. Dang. 

While chasing this, I did notice that if fully qualified database location is given in constructor of org.neo4j.kernel.impl.batchinsert.BatchInserterImpl, it fails with NPE. When I had spring context for neo4j in the classpath, I did not see this error.I am using Ubuntu and JDK 1.7. 

Michael, Is there any plan to support batch insertion in o.s.d.n?  

Michael Hunger

no llegida,
15 de des. 2012, 20:06:3715/12/12
a ne...@googlegroups.com
What is the fully qualified location you passed to the batch-inserter?

Michael

--
 
 

Sanjay

no llegida,
28 de des. 2012, 19:41:1228/12/12
a ne...@googlegroups.com
Michael, Sorry for lateness in response. Let's say I passed /home/sanjay/work/db/data/graph.db. My importer utility was running from a different location such as /home/sanjay/work/importer/ for example. I would get NPE.

Also, What should be done for index with unique constraint and a spatial index while importing in batch? Thanks. 

Michael Hunger

no llegida,
28 de des. 2012, 20:19:4128/12/12
a ne...@googlegroups.com
Both, unique indexes and spatial import are not yet supported, but Peter and I worked on some speeding up of the spatial import (and less memory requirements).

Do you have the full stacktrace of the NPE ?

Thanks

--
 
 

Sanjay

no llegida,
28 de des. 2012, 20:35:1328/12/12
a ne...@googlegroups.com
Here you go. I am using 1.8.M07. 

Michael, Need unique constraint enforcement while importing. Spatial could be 2nd priority, i.e. done post import if possible. Let me know if you have any suggestion for workaround for unique constraint.

java.lang.NullPointerException
at org.neo4j.kernel.impl.util.FileUtils.fixSeparatorsInPath(FileUtils.java:272)
at org.neo4j.graphdb.factory.GraphDatabaseSetting$AbstractPathSetting.valueOf(GraphDatabaseSetting.java:464)
at org.neo4j.graphdb.factory.GraphDatabaseSetting$AbstractPathSetting.valueOf(GraphDatabaseSetting.java:417)
at org.neo4j.kernel.configuration.Config.get(Config.java:113)
at org.neo4j.graphdb.factory.GraphDatabaseSetting$AbstractPathSetting.valueOf(GraphDatabaseSetting.java:471)
at org.neo4j.graphdb.factory.GraphDatabaseSetting$AbstractPathSetting.valueOf(GraphDatabaseSetting.java:417)
at org.neo4j.kernel.configuration.Config.get(Config.java:113)
at org.neo4j.kernel.impl.nioneo.store.NeoStore.checkVersion(NeoStore.java:144)
at org.neo4j.kernel.impl.nioneo.store.CommonAbstractStore.<init>(CommonAbstractStore.java:115)
at org.neo4j.kernel.impl.nioneo.store.AbstractStore.<init>(AbstractStore.java:77)
at org.neo4j.kernel.impl.nioneo.store.NeoStore.<init>(NeoStore.java:82)
at org.neo4j.kernel.impl.nioneo.store.StoreFactory.attemptNewNeoStore(StoreFactory.java:88)
at org.neo4j.kernel.impl.nioneo.store.StoreFactory.newNeoStore(StoreFactory.java:77)
at org.neo4j.unsafe.batchinsert.BatchInserterImpl.<init>(BatchInserterImpl.java:117)
at org.neo4j.unsafe.batchinsert.BatchInserterImpl.<init>(BatchInserterImpl.java:91)
at org.neo4j.unsafe.batchinsert.BatchInserters.inserter(BatchInserters.java:45)
at org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.<init>(BatchInserterImpl.java:39)
at com.wavelety.facebook.graph.NeoBatchInserter.init(NeoBatchInserter.java:130)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:80)
at org.testng.internal.Invoker.invokeConfigurationMethod(Invoker.java:525)
at org.testng.internal.Invoker.invokeConfigurations(Invoker.java:202)
at org.testng.internal.Invoker.invokeConfigurations(Invoker.java:130)
at org.testng.internal.TestMethodWorker.invokeBeforeClassMethods(TestMethodWorker.java:173)
at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:105)
at org.testng.TestRunner.runWorkers(TestRunner.java:1178)
at org.testng.TestRunner.privateRun(TestRunner.java:757)
at org.testng.TestRunner.run(TestRunner.java:608)
at org.testng.SuiteRunner.runTest(SuiteRunner.java:334)
at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:329)
at org.testng.SuiteRunner.privateRun(SuiteRunner.java:291)
at org.testng.SuiteRunner.run(SuiteRunner.java:240)
at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
at org.testng.TestNG.runSuitesSequentially(TestNG.java:1158)
at org.testng.TestNG.runSuitesLocally(TestNG.java:1083)
at org.testng.TestNG.run(TestNG.java:999)
at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:70)
at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:102)
at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.maven.surefire.booter.ProviderFactory$ClassLoaderProxy.invoke(ProviderFactory.java:103)
at $Proxy0.invoke(Unknown Source)
at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:150)
at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcess(SurefireStarter.java:74)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:69)

Michael Hunger

no llegida,
28 de des. 2012, 21:19:4228/12/12
a ne...@googlegroups.com
Looks like the trailing slash is the issue.

Sorry I was confusing, you can do unique imports by using the LuceneBatchInserterIndexProvider, esp. with index.setCacheCapacity() for performance (see: http://docs.neo4j.org/chunked/milestone/batchinsert.html#indexing-batchinsert)

If that's too slow for you an in-memory hashmap or (sorted array which provides the node-ids) or a preprocessing of your input data.



Michael

--
 
 

Sanjay

no llegida,
1 de gen. 2013, 11:22:141/1/13
a ne...@googlegroups.com
Do you mean, I need to put trailing slash? /home/sanjay/work/db/data/graph.db/ ? Please clarify.

I am expecting to import data incrementally. That is, I might have some nodes persisted in database already in the first run. In second incremental run, I would import more data and I would like to enforce unique constraints among data in cache as well as in database (persisted after earlier import). I am not sure what you are describing would work in that case but correct me if I am wrong. 

Michael Hunger

no llegida,
1 de gen. 2013, 16:48:081/1/13
a ne...@googlegroups.com
Leave off the trailing slash.

For the uniqueness of a batch-inserter import, if you use the index-lookup (e.g. using lucene, you would first check against the cache and if you already found it there then it is ok, otherwise the lookup would happen from the lucene-disk-index) but setCapacity should take care of that automatically as it configures an in-memory cache.

Incremental updates for batch-insertion are possible but a bit tricky to do.

Perhaps it would be sensible to get back to square 1 and discuss what it is you actually want to do. I..e your use-case and your insertion speed requirements, and where you stop short with SDN or the transactional CORE-API ?

Michael

--
 
 

Sanjay

no llegida,
3 de febr. 2013, 18:17:223/2/13
a ne...@googlegroups.com
Posting to the group on suggestion by Mike Hunger...

The batch insert directly using Neo4j APIs does not seem like a production ready approach since it requires to bring down a running Neo4j server and run the Neo4j batch inserter which starts the server in embedded mode. This affects the availability of the application. We have a situation where a part of our solution might be using the database (through SDN) for dashboard and reporting purposes while at the same time scheduled jobs might try to bulk insert data (using Neo4j batch insert apis) and also some other processes might try to access data using SDN. Ideally, if SDN supported batch insert (over RESTful APIs), we would not have this problem. Alternately, Neo4j batch inserter could support a mode where it could connect to a running Neo4j server using RESTful or other remoting APIs. I am looking for workarounds. Thanks in advance.

Michael Hunger

no llegida,
3 de febr. 2013, 18:22:003/2/13
a ne...@googlegroups.com
The batch inserter is thought to be used for initial inserts, so its constraints are ok.

Usually the REST API using rest-batch-operations and updating cypher with parameters is fast enough for normal update volumes.

What is your bulk-data size?

Michael

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Cheers

Michael

(neo4j.org) <-[:WORKS_ON]- (@mesirii) -[:TAKES_CARE_OF]-> (you) -[:WORKS_WITH]->(@Neo4j)




Bob Wilson

no llegida,
1 de març 2013, 9:43:521/3/13
a ne...@googlegroups.com
Michael,

We are new to Neo4j and SDN. Is there a blog post anywhere or other documentation that elaborates more on how to load data outside of SDN with an app that will be using SDN? For example, exactly what is in the __types__ index and how to create that outside of SDN? What else does SDN need that we would need to create outside? Other than response times, any downside to loading through SDN?

Also, we are using Neo4j Spatial - is there anything we need to be concerned about as far as Spatial working with SDN? Currently we are using 1.8.1.

We attended the Neo4j seminar in Cambridge yesterday and were told you were the guy to ask.

Thanks in advance.
Bob

Michael Hunger

no llegida,
1 de març 2013, 10:05:591/3/13
a ne...@googlegroups.com
Yes there are some,


Inserting spatial data with SDN might be needing a separate indexing step, otherwise just look in the manual. http://spring.neo4j.org/docs for index type POINT.

Ping me if you have any questions.

Cheers

Michael

(neo4j.org) <-[:WORKS_ON]- (@mesirii) -[:TAKES_CARE_OF]-> (you) -[:WORKS_WITH]->(@Neo4j)




Bob Wilson

no llegida,
4 de març 2013, 15:28:464/3/13
a ne...@googlegroups.com
Thanks Michael. We will look into getting that running.

Another question I had was, we can't figure out why we can't do Cypher queries against numeric index properties. For example, on our NetworkNode index built through SDN, we have a column defined as long named dbid. When trying to do a Cypher query like "start n=node:NetworkNode(dbid=123) return n;" or any other variation it will not return any rows. When searching for a string column it works fine. We are using 1.8.1 and even the instructor last week in Cambridge could not figure it out. Any ideas what the syntax would be in Cypher? We tried Lucene syntax which also did not work.

Thanks, Bob

Michael Hunger

no llegida,
4 de març 2013, 19:08:384/3/13
a ne...@googlegroups.com
That is a tricky issue, despite it is possible to index properties numerically, the lucene parser currently has to be configured for each numeric field individually. And as we don't know for custom queries what is in there (before parsing them which happens far far below SDN) we cannot reconfigure the parser. So querying numerically should work for repositories and also for derived queries but not for free-form cypher queries right now.

Bob Wilson

no llegida,
5 de març 2013, 10:20:025/3/13
a ne...@googlegroups.com
Thanks Michael. As long as we can use numeric properties we'll work around the Cypher query limitation for now.

Another issue we are seeing is, after loading some data using SDN, I'm trying to add in a Spatial layer index. However, I keep getting the message "Unable-to-lock-store-problem" - only when trying to add Spatial layer. I can add other nodes, and use the webadmin console fine. I saw a similar thread at this link that you were involved in but did not see a resolution to this issue. Please advise if any ideas. Thank you.

http://forum.springsource.org/showthread.php?123899-Unable-to-lock-store-problem

Sanjay

no llegida,
19 de març 2013, 1:56:5619/3/13
a ne...@googlegroups.com
datapoint: I faced this problem again today. When I removed write.lock from all lucene indices, the NPE went away. My database was not cleanly shutdown earlier. 

Michael Hunger

no llegida,
19 de març 2013, 2:53:1319/3/13
a ne...@googlegroups.com
Can you share your db path?

Sent from mobile device

Sanjay

no llegida,
19 de març 2013, 10:40:3219/3/13
a ne...@googlegroups.com
Michael, has no trailing slashes. I use relative path ../../../db/bengi/graph/dataimport/graph.db

Michael Hunger

no llegida,
19 de març 2013, 11:00:5119/3/13
a ne...@googlegroups.com
Can you try to use a non-relative path to test?

B/c the exception happens in the file-path handling.

Cheers

Michael

(neo4j.org) <-[:WORKS_ON]- (@mesirii) -[:TAKES_CARE_OF]-> (you) -[:WORKS_WITH]->(@Neo4j)




Sanjay

no llegida,
19 de març 2013, 11:16:0619/3/13
a ne...@googlegroups.com
Already tried. Same result. You have the exception trace. A fix would be nice. Thanks.

Michael Hunger

no llegida,
19 de març 2013, 11:38:3219/3/13
a ne...@googlegroups.com
What version are you on now?

Sent from mobile device

Sanjay Dalal

no llegida,
19 de març 2013, 12:47:1319/3/13
a ne...@googlegroups.com
Neo4j 1.8.1

Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
Java HotSpot(TM) Server VM (build 23.6-b04, mixed mode)

LSB Version: core-2.0-ia32:core-2.0-noarch:core-3.0-ia32:core-3.0-noarch:core-3.1-ia32:core-3.1-noarch:core-3.2-ia32:core-3.2-noarch:core-4.0-ia32:core-4.0-noarch
Distributor ID: Ubuntu
Description: Ubuntu 12.04.2 LTS
Release: 12.04
Codename: precise


--
You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/fPBka4Ld2YA/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

Michael Hunger

no llegida,
19 de març 2013, 14:05:3119/3/13
a ne...@googlegroups.com
Will be part of 1.8.3

Should only happen if the store was not cleanly shut down which renders it invalid for batch-inserter anyway.

Michael

Sanjay Dalal

no llegida,
19 de març 2013, 14:08:4019/3/13
a ne...@googlegroups.com
Thank you. 

Lasse Westh-Nielsen

no llegida,
20 de març 2013, 5:10:3720/3/13
a ne...@googlegroups.com
Sanjay,

Please be aware we only officially support Oracle JDK 6 for Neo4j 1.8 series.

 - Lasse
Respon a tots
Respon a l'autor
Reenvia
0 missatges nous