[Gremlin] Transactions and when not to use Indexes

Eddy Respondek

unread,

Aug 29, 2011, 5:15:54 AM8/29/11

to gremli...@googlegroups.com

I've only found 1 mention of Neo4j transactions in Gremlin here http://lists.neo4j.org/pipermail/user/2010-January/002567.html Can someone give example usage?

I now understand indexes and how to use them but I I'm also wondering when you shouldn't use them. I'm thinking in terms of a website:

Say a user has a session and in that session a small index is created (say a 1000 nodes) to speed up subsequent visits and destroyed at the end of the session.
or
What if instead the index was created for each individual user and kept permanently. Could you store 100 thousand or a even a million small indexes?

Are these valid use cases? Would the size of the index be too small to see any real performance benefits? Would the system be able cope with keeping, creating and destroying indexes that way?

Peter Neubauer

unread,

Aug 29, 2011, 5:53:24 AM8/29/11

to gremli...@googlegroups.com

Eddy,

the use of indexes or in-memory structure is a tradeoff of a number of things. Indexes take some time to create, they consume IO since they are persistent on disk, but they are normally only expensive the first time, after that they are faster when treating bigger amounts of data. With millions of indexes, you are going to

1. create a LOT of files, which runs into IO problems

2. Duplicate the data since it is stored in different shapes for the different indexes, increasing your storage, IO and memory overhead

3. Have a harder time keeping all these indexes in sync with changed data (one update to a person might result in 100 indexes to be updated) which slows down write performance.

In short, use indexes for predictable, bigger amounts of your data where you have an idea that you need them repeatedly queried in the same way. (like structure at the beginning or end of traversals)

Caches or filters tend to be more appropriate when you don't need persistent structures, and don't know much on how you are going to query the graph, since there is no penalty for different queries or IO. However, they tend to scale less and involve sequential scans of the data, which doesn't scale.

In short, use caches and filters/iterators for data that is highly unpredictable, not too big and will likely change every time you access it (like structures in the middle of traversals)

HTH

/peter
Cheers,

/peter neubauer

GTalk: neubauer.peter
Skype    peter.neubauer
Phone    +46 704 106975
LinkedIn   http://www.linkedin.com/in/neubauer
Twitter http://twitter.com/peterneubauer

http://www.neo4j.org    - Your high performance graph database.
http://startupbootcamp.org/ - Öresund - Innovation happens HERE.
http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.

James Thornton

unread,

Aug 29, 2011, 6:32:11 AM8/29/11

to gremli...@googlegroups.com

On Monday, August 29, 2011 4:15:54 AM UTC-5, Eddy wrote:

Say a user has a session and in that session a small index is created (say a 1000 nodes) to speed up subsequent visits and destroyed at the end of the session.
or
What if instead the index was created for each individual user and kept permanently. Could you store 100 thousand or a even a million small indexes?

Are these valid use cases? Would the size of the index be too small to see any real performance benefits? Would the system be able cope with keeping, creating and destroying indexes that way?

Hi Eddy -

Redis or memcached are ideal for this type of temporal/ephemeral data because they have features that will automatically expire data that hasn't been touched in a certain amount of time (see http://redis.io/topics/expire, http://code.google.com/p/memcached/wiki/FAQ#Item_Expiration).

For sessions Redis may be better because you can persist it. This means session data isn't lost if the Redis instance is rebooted so you don't get a run on the DB and flood the DB with requests from having to renew all of the sessions.

You can use Redis or Memcached right now -- they're pretty easy to get up and running, and we have been batting around ideas for incorporating multiget, which will allow you to request multiple elements at a time. This will be useful for the cases where you run an expensive query and want to cache the IDs and lazy-load the elements as needed (see https://github.com/tinkerpop/rexster/issues/116).

- James

Eddy Respondek

unread,

Aug 29, 2011, 10:50:17 AM8/29/11

to gremli...@googlegroups.com

@Peter - That's pretty much what I thought but it's good to hear someone else confirm it :)

@James - Memcached is way down my list of things to do... but your suggesting syncing parts of the graphdb with Redis? How would that work exactly? I update something in the graph and at the same time Redis; or a new session is created, the first query gets sent to gremlin which populates Redis, you then set the data to expire and use Redis for the remainder of the session. But any updates to the graph you would also have to ensure it gets updated to Redis too otherwise you would have to wait for the session data to expire to see any changes. I only glanced at the docs but does that make sense at all?

James Thornton

unread,

Aug 29, 2011, 12:03:35 PM8/29/11

to gremli...@googlegroups.com

> your suggesting syncing parts of the graphdb with Redis? How would that work exactly?

> I update something in the graph and at the same time Redis;

You're using Bulbs, right?

The general idea would be to create a get() method like this:

https://gist.github.com/1178694

The get() method tries to get the element from Redis, and if it's not there, it queries the graph.

It uses pickle (http://docs.python.org/library/pickle.html) to serialize Python objects to string representations which you can store in Redis -- a more general approach would be to serialize objects to json.

When you update an element in the graph, you remove it from Redis so subsequent gets will fetch the updated version.

Adding a caching layer to Bulbs is on my list of things to do -- go ahead an add a ticket to remind me if it's something you're going to need soon (https://github.com/espeed/bulbs).

- James

Reply all

Reply to author

Forward