Re: [Neo4j] NEO4J: love the idea/concept, wondering if it is what I need...

Peter Neubauer

unread,

Sep 9, 2012, 7:12:35 AM9/9/12

to ne...@googlegroups.com

Pat,
sounds like you are in the right place. I would love for you to put
some of this into a blog, maybe illustrate it with
http://console.neo4j.org/ and a setup at
http://console.neo4j.org/usage.html ? This looks like a very
interesting domain model!

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Wanna learn something new? Come to @graphconnect.

On Sun, Sep 9, 2012 at 12:52 AM, Bogus Exception <pat.t...@gmail.com> wrote:
> Experts,
>
> The things I need to persist do not fit an RDBMS model well, and I am
> searching to find a model/system that will. The data has many hierarchies, I
> guess is the easiest way to sum it up. I prefer to think of it as aspects,
> but interpretations differ.
>
> Consider a container in a container ship. It:
>
> Is a container
> Is in a particular ship
> Was checked by a certain inspector/border person
> The ship has employees
> The container is on a route
> The route has a source & destination
> The captain of the ship has been in accidents, or other ships
> Has contents
> Those contents have come from multiple source locations
> ...
> ...
>
> In other words, I never will know what aspect/point of view/top or bottom of
> any hierarchy the sole container will be referenced from. Has that same
> container been on board more than one time with this particular captain? Has
> this container been on any ships that were taken over by pirates? Has either
> the sender/receiver ever been charged with trying to convey bad things?
>
> A room full of people, some who know each other, some who don't... Some are
> left -handed, some wear size 10 shoes... his kind of scenario, although
> illustrative, is actually far too simple for me, as we are always talking
> about a person object, and the relationships between them.
>
> I hope I'm explaining this correctly. So basically in our (seemingly) simple
> example, there are many attributes/aspects of investigating relationships.
> In fact, I'm looking for a solution that will make discovering relationships
> easier. By that, of course, I mean a persistence layer/engine that will
> easily allow me to programmatically discover relationships where none are
> known to exist.
>
> A graph seems to be the thing I would need. There are stubs possible, but in
> my need, there is always some kind of relationship with another something.
> So it isn't like the Matrix example where you want to know friends of
> friends... It's more like 'what commonalities, traits & patterns in data and
> relationships exist in the Matrix people?'...
>
> TIA!
>
> :)
>
>
> --
>
>

Craig Taverner

unread,

Sep 9, 2012, 1:04:48 PM9/9/12

to ne...@googlegroups.com

I think you are looking at exactly the right database for this type of domain. Not only do you have many different types of relationships (contains, employed by, checked, from, to, etc, etc.) but, more importantly, it looks like you might intend to add more over time as you enrich your model, and that is something that neo4j is especially convenient for. In your case the actual graph would look very, very much like your description below, which makes it very easy to maintain. Look Ma, no ORM.

One subtlety I see in your domain model is the time dependence. You are describing a graph of a situation that is not static. Right now the ship might be on route x, and captained by mr t, but that will change. You could choose to simply modify the graph with time, removing and adding new relationships as needed, or you could also consider maintaining history, so you can traverse the graph backwards in time, or take snapshots. That can lead to some very interesting analyses indeed. In this case the fact that mr t has had accidents before might not be a property of mr t, or of the ships he captained, but instead is part of the graph structure itself. A traversal backwards of all ships mr t captained, looking for accidents, would be very fast, since a single captain only captains a limited number of ships in his lifetime.

This gets us to one of the key advantages of the graph database. Searching for accidents related to a single captain is a small traversal over a small subgraph. In a RDBMS it would require a lookup over at least three entire tables (captains, ships and accidents, and also any join tables you might decide to have)). That does not scale well.

The graph does not have this problem. :-)

Bogus Exception

unread,

Sep 9, 2012, 9:02:08 PM9/9/12

to ne...@googlegroups.com

Peter,

I'd love to help! Not sure how, though... I don't mind setting up this scenario on my end, as you & Craig have said I'm in the right place.

We can take your idea offline if you'd prefer...

Thanks again!

pat
:)

Bogus Exception

unread,

Sep 9, 2012, 9:51:37 PM9/9/12

to ne...@googlegroups.com

Craig,

Thank you very much for your long and thoughtful reply. I am so relieved that I may have actually found what I need! :) You see, I have been working for some years now on paper defining what I want to write as a system of programs, and I have been knocking my head trying to figure out how to do it in RDBMS... I have time series concerns, but they are probably not for this kind of engine/structure/framework.

[more inline below]

On Sunday, September 9, 2012 1:04:51 PM UTC-4, Craig Taverner wrote:

I think you are looking at exactly the right database for this type of domain. Not only do you have many different types of relationships (contains, employed by, checked, from, to, etc, etc.) but, more importantly, it looks like you might intend to add more over time as you enrich your model, and that is something that neo4j is especially convenient for. In your case the actual graph would look very, very much like your description below, which makes it very easy to maintain. Look Ma, no ORM.

Well this is the gist exactly. The container ship scenario is only representative, but it is one I like to use when explaining the concept as it's just simple enough that I usually don't get the glassy-eyed/deer-in-the-headlights stare we all know and hate. However, it isn't the most complicated, and after considering your excellent points below, I thin kyou have uncovered another 'concern' with this graph approach, and that is the time domain.

One subtlety I see in your domain model is the time dependence. You are describing a graph of a situation that is not static. Right now the ship might be on route x, and captained by mr t, but that will change. You could choose to simply modify the graph with time, removing and adding new relationships as needed, or you could also consider maintaining history, so you can traverse the graph backwards in time, or take snapshots. That can lead to some very interesting analyses indeed. In this case the fact that mr t has had accidents before might not be a property of mr t, or of the ships he captained, but instead is part of the graph structure itself. A traversal backwards of all ships mr t captained, looking for accidents, would be very fast, since a single captain only captains a limited number of ships in his lifetime.

Your example is excellent, but as we're just talking, let me try another scenario on you-equally as valid. Imagine I have something to track, say, terrorists or criminals, or foreigners, or malcontents here in the US (or any country, really).. I only want to determine which bad guys I should be watching. I often use the example of 9/11, where the next day everyone knew everything about the terrorists (I'm exaggerating, but you get the idea).

The day before, all of this information was available. It was just lost in the noise. So you naturally ask yourself, "Why didn't we notice this behavior and at least keep a closer eye on them?", or whatever... And I think the answer is that there is so much noise, the signal/noise ratio makes all but the most dramatic events buried in mountains of data.

Enter the WizBangThingie! It will take all the mounds of data and spit out the top 10 people that are to be scrutinized! If you think about it, with data rates and storage going through the roof, we're tracking a LOT of things on a LOT of people, so this is as good an example as one in, say, a car manufacturer's system.

Now as I understand things (which is "loose" to say the least), a graph will contain the kind of relationship between nodes (things). X is in a Y; X is in love with Z;etc. which is terrific. I can imagine adding attributes to a Person() object relatively infrequently (location, born, height, ...). But their current situation is probably not going to generate any significant pointers/alerts. Rather, it is the historical relationships that start to add up to form a clearer picture of (in this fictional use case) intent. True, everyone on an airplane could have the intent to hijack it, but that isn't good enough.

I think one of the terrorist pilots had traveled to several states, got a ticket here, used a credit card there, went to flight school in Florida, boarded a flight with a 1-way ticket NOT to his original country of origin, etc., etc. BUT all of these things happened in the time domain, and simply representing his current location, or even buying a one-way ticket, isn't enough. And I hadn't even mentioned the relationship of the one terrorist's behavior to another being significant...(!)

So from this slightly more complex use case you can see that it seems not only the graph's ability to tie relationships at one point in time is valuable, it is also the chain of events that will create a pattern, or even a lack of pattern, that will be interesting. I have the logic to evaluate and in most cases search for such patterns (or lack of pattern among data that has a 'normal' pattern), and I can even predict what will happen next, or report abnormal correlations of like data, but I lack that kung fu that will be the foundation to even start to layer on top of it the time domain.

OK, enough theory. A search in the terrorist example might go something like...

1. Show the top X people non-US born that have traveled the most over the last Y months.
2. Show the top X people that have a criminal record in more than one state, and have traveled outside the US in the last Y months.
3. Of the people on this flight, show me (in order of suspicion/interest) the rank of most to least suspicious.
4. Taking the police radio for X-Town and converting to text, extract proper names/driver's license #s and produce a report of their suspicious activity for all time.
5. Tell me everything suspicious that person X has done in the last 3 years.

I can see that a graph is exceptional for relationships at a single point in time. And as I read about graph groups, i my mind I'm trying to figure out how to structure the data store in my mind. I will install and putz with neo4j, to be sure, but I also can see where an RDBMS might be useful, or at least a K/V NoSQL model for some aspects of data-primarily I'm thinking of the time domain aspects of each object the graph is representing the relationship of/for.

To quote you again, Craig:

That can lead to some very interesting analyses indeed. In this case the fact that mr t has had accidents before might not be a property of mr t, or of the ships he captained, but instead is part of the graph structure itself. A traversal backwards of all ships mr t captained, looking for accidents, would be very fast, since a single captain only captains a limited number of ships in his lifetime.

So originally, I envisioned a single massive table with everything being a simple "object", and each having a parent-child relationship to each other. What a mess! Each row would be an event of some sort, and that event would affect other objects. There just isn't a way to represent relationships in a "relational" database. I just can't have one row point to thousands of other rows... And searches/queries? Yikes!

Now above, Craig, you are saying something I have not gotten out of the docs. You say something very clever sounding in "In this case the fact that mr t has had accidents before might not be a property of mr t, or of the ships he captained, but instead is part of the graph structure itself." ... part of the graph structure itself. That sounds very intriguing. You see, there are some things that I can account/look for by virtue of an object having that attribute hard wired in. "# of accidents" = 3. That hardly tells me enough, but it does tell me that this capt has been in 3. I don't know if that was while he was captain, while he was crew (officer/conscripted), or even as a passenger. Each of those accidents are objects in and of themselves, right? Each has a location, wordy human description/keywords, names, dates, other vessels, crew, charges, location, and a bunch of other stuff-but not too much that it couldn't be hard wired.

But think of those keywords! Even the official text of the official's report is chock full of words, phrases and proper nouns that aren't even formal objects. If we're looking for relationships, then am I constrained by a graph/neo4j by only those relationships I come up with ahead of time, or have I found the Holy Grail in neo4j in that there is a programmatic way to take the grammatical components I would extract automatically, and define the kind of (ever growing) model that a graph excels and thrives in?

Sorry for the ling post... But finally let's say that I captured from NLP identifiers for objects (people and places are frequent). Adding person names and location names as those kind of objects isn't a far stretch as far as I can tell-provided there was already a person and location object pre-provided. But here is a question that I'm probably way ahead of where I should be asking, in that I wonder if I can write a series of programs to use a graph foundation to create the objects/nodes for me, and fill them in with data. Imagine giving a black box a series of inputs, and having it spit out a structure, relationships, and time domain history... Is this box made of unobtainium, or is this a worthy carrot with which to arm my stick?

Thanks to those who have endured this post to the end. I find this product and your past and current discussions extremely interesting and relevant to my work.

pat
:)

P.S. I apologize to those who got my point early on, and were frustrated to see me drone on, adding unnecessary detail! :)
P.S.S. For those drawn to short, succinct, can-be-taken-any-number-of-ways-posts, You're best off deleting most everything I write! :)

Bogus Exception

unread,

Sep 11, 2012, 9:28:49 PM9/11/12

to ne...@googlegroups.com

OK... Perhaps my last post was a lot to take in... :)

How about this:

How should time-series/time-domain data be integrated into a graph?

TIA!

pat
:)

Michael Hunger

unread,

Sep 12, 2012, 3:32:18 AM9/12/12

to ne...@googlegroups.com

You can have a look at this: http://docs.neo4j.org/chunked/snapshot/cypher-cookbook-path-tree.html

Another option is to create a secondary index that links your nodes in the order of your time series:

Graphity is an example on how to take this to the extreme: http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/

Michael

--

Craig Taverner

unread,

Sep 14, 2012, 12:10:00 PM9/14/12

to ne...@googlegroups.com

Hi Pat,

You're right, that was a long post. I got about 75% of the way through before I exceeded my time budget for reading and responding. That was a few days ago, so I think I'm ready to try and respond now :-)

The quick answer is: you are right, even about the holy grail, but there are provisos. While everything can be done, much like you describe, not everything is easy. For example, writing a NLP approach in the graph might be a very nice way of doing it, but that is also an area that is quite huge, and I for one would be reluctant to jump in feet first.

However, most of your other comments and suggestions about expanding the data into a graph are completely on track. The example of representing the 3 accidents as three nodes in the graph with their own meta-data and even sub-graphs, is definitely the right way to go.

Even your suggestion of writing a set of programs to incrementally enrich the model is completely sensible. It is very common in the neo4j world to incrementally improve the graph as you learn more about what you expect to get out of it. It can grow both from the addition of data as well as the addition of complexity.

One area I thought I saw a slip-up was in your description of the analyses where you were selecting sub-sets of results and filtering down to the ones you needed. This smacked of set theory and rdbms's, and I think we can do better. Build the graph so that the results are found through short graph traversals. As a contrived example, if you wanted to list people with particular combinations of starting letters for their first and last names, you could structure this in three ways, each more graphy than the previous:

List of nodes with firstname and lastname properties. You scan the list and test each node for a match to the letters you want.
Connect all firstnames starting with 'A' to a firstname-A node, and firstname-B node and so on. Do the same for all last names. You effectively have created a custom in-graph index designed specifically for your exact query. The graph could be traversed like root->firstnames->A->Arthur Dent, or root->lastnames->D->Arthur Dent. Now your query requires that you decide which to traverse to match on firstname and lastname. You traverse down the one route and back up the other to check for the other condition. If you separately traversed from root, and created two sets and took the intersection, you would be doing things very similar to an RDBMS. By traversing back up for the second condition, it is, IMHO, a little more graphy, and probably faster two (depending on which set is smaller).
The third option is to create index nodes for each combination of firstname/lastname letters. Then your query is the simplest traversal: root->A->AD->Arthur Dent. In the previous structure you had 2*26 index nodes. Now you have 26^2 index nodes, but a faster traversal. The index has changed from a 1 level tree to a 2 level tree.

In an RDBS the index is also a tree, and one that self balances as data is added, so is generally faster for a wider range of cases. The reason the graph can be faster for you is you build an index specific to your domain, not generalized to all possible queries. If you cannot design a specific index, then of course use lucene. And in my contrived example above, adding the 'AD' to lucene would actually be a good idea. But I used this as a way of explaining the principle. I hope you got the point :-)

The last thing I can comment on is the time domain. I see Michael already pointed you to some good sources there, and I agree with them, so I'm not sure a lot needs to be said. The only rule of thumb would be to make the graph match your understand of the world. So in your case if a particular ship underwent a series of changes, these could be linked as a list of change events to the ship itself. I like to refer to the different models used by source code version control systems. One very simple model, as used by the old RCS system, is you maintain a single node for the current state of the ship, and the a list of change nodes that explain only how to change back into the previous state. So perhaps you have ship(blue)->(painted from black to blue)->(cabin revnovated)->(captain changed)->etc. And some of these change nodes might be linked to other objects, like the captain changed node should be linked to the previous and current captains.

Well, I'm going on a bit now, so I'll stop. I was tempted to go through your mail and comment on each paragraph, but that would have resulted in a ten page response. My current response is certainly incomplete, but hopefully short enough to read ;-)

--

Tero Paananen

unread,

Sep 14, 2012, 12:47:59 PM9/14/12

to ne...@googlegroups.com

Mr. Exception (sorry, couldn't resist),

I'm not sure whether you're doing this as more of a thought exercise
and/or personal project, but if you're not and are considering to
create something with a real world use, you should probably take a
look at the products from Palantir Technology.

What you're describing is exactly what that company has been doing for
quite some time, and its products are very, very good at it from what
I've been told (I have experience in using their financial
product...timeseries data analysis/visualization, but not in their
government product...tracking stuff just like what you described).

-TPP

Wes Freeman

unread,

Sep 14, 2012, 1:26:18 PM9/14/12

to ne...@googlegroups.com

Strangely enough, I don't think Palantir uses Neo4j (or a graph db) within their core technologies--please correct me if someone knows that's not true (they're not open source, so this is just heresay). I think they really could take advantage of many aspects of a graph db for their products.

Ikanow's infinit-E is another "generic analytics engine" that does similar things to Palantir, and is open source, and does not use Neo4j (instead ElasticSearch/MongoDB/Hadoop). I plan to ask why the next time I see someone from Ikanow at the MongoDB meetup. It seems like 90% of the time they're doing some sort of analysis of a social graph, and just crunching through it with Hadoop or the ElasticSearch indexes, which point back to the raw data in MongoDB.

Wes

--

Tero Paananen

unread,

Sep 14, 2012, 1:30:17 PM9/14/12

to ne...@googlegroups.com

On Fri, Sep 14, 2012 at 1:26 PM, Wes Freeman <freem...@gmail.com> wrote:
> Strangely enough, I don't think Palantir uses Neo4j (or a graph db) within
> their core technologies--please correct me if someone knows that's not true
> (they're not open source, so this is just heresay). I think they really
> could take advantage of many aspects of a graph db for their products.

They don't.

They plug into any data store you have via an abstraction layer, so
you could use Neo4j for it as well, if you wanted to.

-TPP

Reply all

Reply to author

Forward