First, sorry for the long email...I have a lot of explaining to do so you have a good idea as to the project goals:
Your hunches are correct. Our data is currently housed in a SQL database. My goal with Neo4j is to create a READ database for read-only queries, pulling those away from our transactional database. Then, build an ad-hoc reporting engine (data mining) on top of that. For that reason, I want to try doing this by maintaining the pseudo-normalized data structure, and have the joins be the relationships.
ContactMembership is one of the more basic examples, and while I could use the ContactMembership matrix to apply labels to the Contact nodes, it can get very tricky very quickly. Right now, I'm trying to build an automated way of importing the data in bulk (and I think I'm at a good place with that so I can move to step 2 - update specific properties on nodes when data is updated in the transactional database).
Step 2 is where things can start to get tricky. If a contact's membership changes, I would need to remove the label from the contact (which I think, can be done without deleting the contact node altogether?). I think in the case of ContactMembership, it makes more sense to have that data be a label on the Contact entity. It get's a little tricky when contacts have multiple memberships (and those memberships have statuses and substatuses (like Active or inactive, etc). They are still part of that membership role, but may not be currently active in that role. Once I get the data updating using generic methods, I will start thinking more heavily about the model and what I want to store, where and how.
I'm going to build out graph-gist for you (had no idea this existed!).
As for creating the relationships when inserting the data in the first place, I would LOVE to do that, but there are a few caveats to that:
In the example of ContactMembership (which is just a join table), Not every contact is assigned a membership role, so I may miss some contact imports into the system, which could lead to problems later.
I would also have to create a query that passes in the contacts as a list param, the ContactMembership as another list param and any other related sub-tables like ContactMembershipType and ContactMembershipStatus. Then process all of these by looping through the ContactMembership table? I guess that would really only be 3 calls to the SQL database and then a lot of looping in the application. For performance, do you thin that would be faster than first doing a bulk insert of all related tables and sub-tables in the ContactMembership database? Right now, this is how I'm parsing things out (I'm going to draw this graph in ascii first, then figure out how to make a graph gist and send that later if that's alright):
ContactMembershipTable -[import]-> ContactTable -[import]-> ContactOccupationTable
- ContactId - ContactOccupationId - OccupationName
- ContactMembershipTypeId - ContactPrefixId - RenderingOrder
- ContactMembershipStatusId
- ContactMembershipSubStatusId
Above is a basic (first round) example of what my process does right now - described below
It takes the ContactMembershipTable -
1. Finds all the related sub-tables and their children
2. Imports all subtables
3. Relates the children to the parent tables
4. Imports the ContactMembership nodes (This is MUCH faster than passing in the ContactMembership table as a param and doing a foreach loop with multiple merges)
5. Iterates through each property that has a related node type and creates those relations to the primary table (in this case, Contact).
- So, skip ContactId (because we don't want to relate to ourselves...that's dumb)
- Relate Contact nodes by Id to the appropriate ContactMembershipType (and so on)
During the import, if it sees that the child table (POCO object) being imported has the property "Name", it assigns that as a property to the Relationship, so that I can query
(n)-[r:CONTACTMEMBERSHIPSTATUS]->()
where r.Name = 'Active'
Now, if I were to add the ContactMembershipType as a label to the contact, I could query (n:Player)-[r:CONTACTMEMBERSHIPSTATUS {Name:'Active'}]->() to return all active players. I have to be very careful doing this though, because there is a potential for label overlap. I need to be sure that there are no ContactMembershipTypes that have the same name as another primary node type. This is the primary reason why I haven't added the labeling logic based on xType table imports.
This entire process takes a little under 2 minutes (118 seconds) for ContactMembership and ContactAddress together, give or take, and is 100% automated now. All I do is pass in the matrix (i.e. ContactMembership or ContactAddress) and the PrimaryRelationNodeType (i.e Contact)
192320 Unique Nodes
1709926 Properties
7 RelationshipTypeId's
123909 Relationships
Right now, I'm building this as a C# console app, so that later I can pass in arguments to run targeted imports.
c:\ToNeo4j.exe -importMatrix ContactMembership -importChildren
or
c:\ToNeo4j.exe -relateOnly -source ContactPlayer -target Contact -sourceId ContactPlayerId -targetId ContactId
After all is said and done, I might build it as a windows form later so that I can have a GUI, listing all the POCO objects that can be imported and add flags to the import/relation queries. Then, after THAT is done, add a config option and release it as open source so others can migrate their own data to Neo4j (or make mine better). I'm primarily a web-front end developer so my c# code may not be as awesome as it could be.
If you can recommend a good method for importing the data and relating the nodes at the same time (using c# because I have never developed in JAVA), I'm all ears. I don't know what I don't know...but am more than happy to learn!
Here is the Console output:
(after reading through the output, I realized that I could also track relationships I've already created (not just POCO's inserted) so as to not do duplicate relations, shaving off more time)
This gets execute by running:
Program.cs
Migrator.processContactMembership();
Migrator.processContactAddresses();
Migrator.cs
public static class Migrator
{
private static INeo4jRepository neoRepo;
private static List<Type> alreadyInsertedTypes;
public static void initialize()
{
neoRepo = ObjectFactory.GetNamedInstance<INeo4jRepository>("NEO_RAIDER");
}
public static void processContactAddresses()
{
alreadyInsertedTypes = neoRepo.processRelationshipMatrix(new ContactAddress(), new Contact(), alreadyInsertedTypes);
}
public static void processContactMembership()
{
alreadyInsertedTypes = neoRepo.processRelationshipMatrix(new ContactMembership(), new Contact(), alreadyInsertedTypes);
}
public static void resetInsertTypesList()
{
alreadyInsertedTypes = null;
}
}
/* Console Output */
Process Starting
Starting to insert Contact items
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++]
________________________________________________________________________________
- Total time for 42149 Contact items: 19303.7806ms
- Average time per batch (100): 45.3104265402844ms
Starting to insert ContactPrefixType items
[+]
________________________________________________________________________________
- Total time for 5 ContactPrefixType items: 29.0203ms
- Average time per batch (100): 29ms
Starting to relate Contact to ContactPrefixTypeId
________________________________________________________________________________
- Total time taken to relate Contact to ContactPrefixTypeId: 5163.7479ms
Starting to insert ContactOccupationCategory items
[+]
________________________________________________________________________________
- Total time for 29 ContactOccupationCategory items: 39.0269ms
- Average time per batch (100): 39ms
Starting to relate Contact to ContactOccupationCategoryId
________________________________________________________________________________
- Total time taken to relate Contact to ContactOccupationCategoryId: 6081.6246ms
Starting to insert ContactMembershipType items
[+]
________________________________________________________________________________
- Total time for 19 ContactMembershipType items: 30.9198ms
- Average time per batch (100): 30ms
Starting to insert ContactMembershipStatus items
[+]
________________________________________________________________________________
- Total time for 6 ContactMembershipStatus items: 21.6644ms
- Average time per batch (100): 21ms
//ContactMembershipSubStatusId relates to ContactMembershipStatus table so no need to import it twice
Already inserted data for ContactMembershipStatus.
Starting to insert ContactMembership items
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++]
________________________________________________________________________________
- Total time for 50191 ContactMembership items: 13146.4121ms
- Average time per batch (100): 25.7111553784861ms
Processing (50191) relations for Contact --> ContactMembershipType
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++]
________________________________________________________________________________
- Total time: 8171.7013
- Average time per batch (500): 80.8613891089109ms
Processing (50191) relations for Contact --> ContactMembershipStatus
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++]
________________________________________________________________________________
- Total time: 7630.0961
- Average time per batch (500): 75.5192732673267ms
Processing (50191) relations for Contact --> ContactMembershipSubStatus
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++]
________________________________________________________________________________
- Total time: 7440.6537
- Average time per batch (500): 73.6403237623762ms
//Starting ContactAddress Matrix Import
Already inserted data for Contact.
Already inserted data for ContactPrefixType.
Starting to relate Contact to ContactPrefixTypeId
________________________________________________________________________________
- Total time taken to relate Contact to ContactPrefixTypeId: 1314.7656ms
Already inserted data for ContactOccupationCategory.
Starting to relate Contact to ContactOccupationCategoryId
________________________________________________________________________________
- Total time taken to relate Contact to ContactOccupationCategoryId: 3706.0297ms
Starting to insert Address items
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++]
________________________________________________________________________________
- Total time for 50243 Address items: 15247.0439ms
- Average time per batch (100): 29.8588469184891ms
Starting to insert Country items
[+]
________________________________________________________________________________
- Total time for 14 Country items: 20.4331ms
- Average time per batch (100): 20ms
Starting to relate Address to CountryId
________________________________________________________________________________
- Total time taken to relate Address to CountryId: 3989.2875ms
Starting to insert ContactAddress items
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++]
________________________________________________________________________________
- Total time for 49664 ContactAddress items: 11474.8396ms
- Average time per batch (100): 22.6197183098592ms
Processing (49664) relations for Contact --> Address
[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++]
________________________________________________________________________________
- Total time: 7388.1813
- Average time per batch (500): 73.8421ms
Process Complete
________________________________________________________________________________
Total time 118642.5606ms