Rexster Gremlin Business

81 views
Skip to first unread message

stephen mallette

unread,
Jun 6, 2011, 6:40:31 PM6/6/11
to Gremlin-users
There are currently three issues open with Rexster at the moment that
relate to language specific adapters, binary protocols and other very
general functionality around those concepts:

https://github.com/tinkerpop/rexster/issues/105
https://github.com/tinkerpop/rexster/issues/32
https://github.com/tinkerpop/rexster/issues/10

Despite each of these issues being different, they are related in an
overriding general goal: to provide language specific client APIs
utilizing Gremlin as the core query language. It would make Rexster
akin to MySql, SQL Server, etc on the server providing access to
different databases and allowing developers to work with concrete
classes in their language of choice without concerning themselves with
REST and JSON parsing.

Here's the basics of what Marko and I have discussed so far:

1. Use MsgPack to serialize data coming from Rexster. Seems like
there is good language support there.
2. Allow Gremlin state to be maintained by keeping the script engine
bindings available. The client will pass a reference to the bindings
with requests to have them loaded to the script engine.
3. There is some challenge in what Gremlin returns as output. MySQL
is closed over tables. Gremlin is closed over a universe of things.
4. There is some question as to where the service should live in
Rexster. The current GremlinExtension has its context specific to a
graph, vertex or edge, but that's not what is in mind for the client
APIs.

In this last item, GremlinExtension (and this is not to say that
GremlinExtension is the mechanism by which Rexster will expose this
functionality) is exposed as:

http://localhost/somegraphihave/tp/gremlin

That doesn't make sense in scenarios where one issues a Gremlin
statement like:

g = new TinkerGraph();

It occurred to us that statements like these are kind of like a DDL
for Rexster. Of course, Rexster is statically configured by
rexster.xml, so new graphs have no way of being exposed through
Rexster the way they would if someone issued a SQL Server DDL script
like:

CREATE DATABASE MyDb;
CREATE TABLE MyTable (MyTableID INT NOT NULL PRIMARY KEY, MyTableText
VARCHAR(32) NOT NULL);

Perhaps there needs to be a way to expose the newly created graph,
"g", in Rexster when that command is issued. Perhaps a
RexsterConfigurationContext could be made available in the script
engine bindings so that the following could be done:

g = new TinkerGraph();
rexsterConfigContext.addGraph(g, configurationOptions);

Or perhaps, there needs to be a gremlin+rexster extension language
that handles DDL and Rexster configuration? Our collective minds
broke down at this point (or perhaps earlier) and I thought it would
be good point to summarize, reflect and take stock of what we had
discussed so far. Any thoughts or feedback on the approach, design,
functionality, etc. would be appreciated.

Best regards,

Stephen

Pierre De Wilde

unread,
Jun 7, 2011, 9:49:26 AM6/7/11
to gremli...@googlegroups.com
Hey Stephen,

How is Gremlin state currently maintained in DogHouse?

Pierre

2011/6/7 stephen mallette <spmal...@gmail.com>

stephen mallette

unread,
Jun 7, 2011, 10:14:09 AM6/7/11
to Gremlin-users
The Gremlin session is maintained with the session of the user's
browser.

On Jun 7, 9:49 am, Pierre De Wilde <pierredewi...@gmail.com> wrote:
> Hey Stephen,
>
> How is Gremlin state currently maintained in DogHouse?
>
> Pierre
>
> 2011/6/7 stephen mallette <spmalle...@gmail.com>

Pierre De Wilde

unread,
Jun 7, 2011, 12:12:46 PM6/7/11
to gremli...@googlegroups.com
SessionID is provided by user's browser but state is maintained by Rexster server (in-memory).
So, scalability issues may arise at some time.

MessagePack may help but should be optional, keeping JSON as default.
Rexster is primary a REST server, so should remain stateless.

An awkward idea to easily bind any client languages to Rexster/Gremlin is to exploit the door of DogHouse:

~$ curl -X POST -b "JSESSIONID=1234" -d "code=v=g.v(1)&g=tinkergraph" "http://localhost:8183/exec"
==>v[1]
~$ curl -X POST -b "JSESSIONID=1234" -d "code=v.name&g=tinkergraph" "http://localhost:8183/exec"
==>marko

The result is currently text-based (e.g. leading '==>' is included), but this may be improved.

Pierre


2011/6/7 stephen mallette <spmal...@gmail.com>
The Gremlin session is maintained with the session of the user's
browser.

stephen mallette

unread,
Jun 7, 2011, 12:32:07 PM6/7/11
to Gremlin-users
Pierre,

Regarding:

> MessagePack may help but should be optional, keeping JSON as default.
> Rexster is primary a REST server, so should remain stateless.

I don't see us changing the Gremlin Extension in Rexster. It will
remain as-is with JSON and all. I think MsgPack (or other
serialization) could be implemented through a different extension (or
other new piece to the Rexster architecture) that features the things
I've itemized.

Using the Gremlin exposed in the Dog House is an interesting idea to
re-use what we have, but as you've found with the text return and
included "==>", it is tweaked to work for the Gremlin Console in Dog
House.

Stephen

On Jun 7, 12:12 pm, Pierre De Wilde <pierredewi...@gmail.com> wrote:
> SessionID is provided by user's browser but state is maintained by Rexster
> server (in-memory).
> So, scalability issues may arise at some time.
>
> MessagePack may help but should be optional, keeping JSON as default.
> Rexster is primary a REST server, so should remain stateless.
>
> An awkward idea to easily bind any client languages to Rexster/Gremlin is to
> exploit the door of DogHouse:
>
> ~$ curl -X POST -b "JSESSIONID=1234" -d "code=v=g.v(1)&g=tinkergraph" "http://localhost:8183/exec"
> ==>v[1]
> ~$ curl -X POST -b "JSESSIONID=1234" -d "code=v.name&g=tinkergraph" "http://localhost:8183/exec"
> ==>marko
>
> The result is currently text-based (e.g. leading '==>' is included), but
> this may be improved.
>
> Pierre
>
> 2011/6/7 stephen mallette <spmalle...@gmail.com>

James Thornton

unread,
Jun 9, 2011, 5:58:05 PM6/9/11
to gremli...@googlegroups.com
One approach to paging/caching query results would be to do it in a caching layer above Rexster. 

Memcached is commonly used for this; however, Memcached has a 1MB object limit (http://code.google.com/p/memcached/wiki/FAQ#Why_are_items_limited_to_1_megabyte_in_size?) so you couldn't store large result sets. 

One way to deal with this is to use a two-phase fetch -- the first request gets a list of element IDs and stores that in memcached, and then the client's paging system works off that list, fetching elements from Rexster in ~10-50 sized chunks as needed and caching each element.

When clients update an element in Rexster, they update the object in memcached first.

You can already tell Gremlin to return just the element IDs so the only additional piece needed would be a way to request a list of n elements from Rexster without making n requests. 

Memcached can be clustered, has a binary protocol (http://code.google.com/p/memcached/wiki/MemcacheBinaryProtocol), is well understood, and most programming languages have client libraries for it.

Making Rexster memcached-friendly, instead of building in internal caching and a binary protocol, simplifies what Rexster needs to do. 

Other open-source caching options include Terracotta's Ehcache (http://ehcache.org/). This is what Hibernate uses. 

Ehcache works for a single JVM, and for a distributed solution, you can be pair it with Terrecotta (http://www.terracotta.org/) at $4K-$10K per node or the free and open-source version Web Sessions (http://www.terracotta.org/web-sessions/). JBoss's Infinispan (http://www.jboss.org/infinispan), the successor to JBoss Cache, is another option to consider.


Since it's harder to scale graph databases horizontally, does it make more sense to offload the caching responsibilities?


stephen mallette

unread,
Jun 10, 2011, 6:46:56 AM6/10/11
to Gremlin-users
James, thanks for your input. I've used memcached in the past to good
success on other projects and would do so again. If I understand what
you are writing here:

> One way to deal with this is to use a two-phase fetch -- the first request
> gets a list of element IDs and stores that in memcached, and then the
> client's paging system works off that list, fetching elements from Rexster
> in ~10-50 sized chunks as needed and caching each element.

In the past (under the old Traversal model) cache items were keyed on
the request URI. The downside here was that if the URI changed but
returned data previously retrieved (and cached) it was going to have
to go to the source again to get it. It sounds like you're suggesting
the cache be keyed on elements identifiers and that the first phase
uses the Gremlin Extension and the second phase uses some new service
that provides a multi-identifier fetch and caching. I would imagine
the greatest benefit in terms of caching would come from the second
phase of that operation.

I'm thinking MultiFetch in conjunction with the previously requested
Batching API

https://github.com/tinkerpop/rexster/issues/91

would make a nice set of Rexster-Kibbles. Building this kind of
functionality would help flesh out how caching should be exposed to
extension developers. Please let me know if I've strayed from your
line of thinking here. Thanks.

On Jun 9, 5:58 pm, James Thornton <james.thorn...@gmail.com> wrote:
> One approach to paging/caching query results would be to do it in a caching
> layer above Rexster.
>
> Memcached is commonly used for this; however, Memcached has a 1MB object
> limit (http://code.google.com/p/memcached/wiki/FAQ#Why_are_items_limited_to_...)

James Thornton

unread,
Jun 10, 2011, 10:43:08 AM6/10/11
to gremli...@googlegroups.com

Here's how the memcached process flow I was describing might go:

01. User loads a Web page on a Client system that displays recommended news items 
02. Client makes a request to Rexster for query results through the Gremlin Extension
03. Rexster returns to the Client a list of 1000 element IDs (but not the actual elements)
04. Client stores the list of element IDs from the query result in memcached
05. Client requests the first 20 elements in the list through the proposed MultiGet Extension on Rexster
06. Rexster returns to the client the list of 20 elements
07. Client stores each of the 20 elements in memached
08. Client returns a page that displays the first 10 query results to User (page uses data from memcached)
09. User clicks "next" to view the next 10 results
10. Client returns a page that displays the next 10 query results to User (page uses data from memcached)
11. Client requests the next 20 elements in list through the proposed MultiGet Extension on Rexster
12. Rexster returns to the client the list of 20 elements
13.  Client stores each of the 20 elements in memached
.....on so on...

Under this model, Rexster is not involved in caching/paging directly, but makes it easy to add a memcached layer by providing MultiGet in the API. 

The potentially expensive recommendation query is run only once, and clients use the list of element IDs in memcached for paging. Memcached machines are easy to cluster and scale out horizontally, and this type of architecture is used by sites like Facebook (http://www.facebook.com/note.php?note_id=23844338919).

Thoughts?




stephen mallette

unread,
Jun 10, 2011, 1:12:57 PM6/10/11
to Gremlin-users
I understand now. This sounds in line with leaving it to developers
to decided on their own caching approach. All you need for it to be
efficient is a MultiGet Extension for graph elements.

James Thornton

unread,
Jun 10, 2011, 1:41:32 PM6/10/11
to gremli...@googlegroups.com
If you want to make the memcached layer transparent to the users, I think you can do that with Infinispan...

"Starting with version 4.1, Infinispan distribution contains a server module that implements the memcached text protocol. This allows memcached clients to talk to one or serveral Infinispan backed memcached servers. These servers could either be working standalone just like memcached does where each server acts independently and does not communicate with the rest, or they could be clustered where servers replicate or distribute their contents to other Infinispan backed memcached servers, thus providing clients with failover capabilities."

James Thornton

unread,
Jun 10, 2011, 3:09:57 PM6/10/11
to gremli...@googlegroups.com
Regarding an internal second-level cache in Rexster, can an optimal internal-caching strategy be generalized for all supported databases? 

For example, the Neo4j docs on caching say "always assume the graph is in memory" and "second-level caching should be avoided to greatest extend possible since it will force you to take care of invalidation which sometimes can be hard" (http://wiki.neo4j.org/content/Guidelines_for_Building_a_Neo_App#Assume_everything.27s_automatically_persistent_.28Neo4j_will_make_it_so.29 and http://wiki.neo4j.org/content/Neo4j_Performance_Guide#Second_level_caching). But this may not be the case for other databases. 


Would a second-level cache in Rexster be fighting Neo4j's cache for resources?

What are the strategies for scaling Rexster?

How easy would it be to disable second-level in Rexster?

 

stephen mallette

unread,
Jun 10, 2011, 4:19:19 PM6/10/11
to Gremlin-users
I think that a good strength in Rexster is its ability to be
configured and extended. I don't imagine that Rexster would ever be
designed in any way to enforce one caching strategy or another. With
that overriding design goal in mind, a minimal implementation of an
internalized caching system would have to at least include the ability
to turn it on and off.

Regarding Rexster and scaling, I don't really know the full answer to
that. Rexster is built on Grizzly and I think you can find a lot of
information out there about it on this topic. I actually read
something the other day that explained how to run a cluster of
embedded Grizzly web servers with mod_jk/apache.

On Jun 10, 3:09 pm, James Thornton <james.thorn...@gmail.com> wrote:
> Regarding an internal second-level cache in Rexster, can an optimal
> internal-caching strategy be generalized for all supported databases?
>
> For example, the Neo4j docs on caching say "always assume the graph is in
> memory" and "second-level caching should be avoided to greatest extend
> possible since it will force you to take care of invalidation which
> sometimes can be hard" (http://wiki.neo4j.org/content/Guidelines_for_Building_a_Neo_App#Assum...
>  andhttp://wiki.neo4j.org/content/Neo4j_Performance_Guide#Second_level_ca...). But

James Thornton

unread,
Jun 10, 2011, 10:11:31 PM6/10/11
to gremli...@googlegroups.com
Well one way to implement a simple query cache would be to expose a /gremlin_cache URL or pass something like a cache=3600 param to /gremlin to tell Rexster to cache the query results for 3600 seconds. 

And then Rexster returns the results (or a subset of the results) with a hash key that identifies the query/result set. Subsequent requests to Rexster include the hash key as a param to indicate they want to pull from the cached set.





James Thornton

unread,
Jun 28, 2011, 6:47:39 PM6/28/11
to gremli...@googlegroups.com
Redis is one option for caching that we haven't mentioned yet. Salvator just posted this article, "How to take advantage of Redis by just adding it to your stack" (http://antirez.com/post/take-advantage-of-redis-adding-it-to-your-stack.html). 

For session stores, Redis is a good option because the data can be persisted so that you don't log your users out or flood the DB with requests if one of the cache nodes is rebooted -- these are some of the reasons the memcached FAQ recommends against using it for sessions even though "everyone does it" (http://code.google.com/p/memcached/wiki/NewProgrammingFAQ#Why_is_memcached_not_recommended_for_sessions?_Everyone_does_it!).

Another option is membase (http://www.couchbase.org/membase) -- an elastic, persistent store that uses the memcached protocol.






Reply all
Reply to author
Forward
0 new messages