Problem with batch loading into titan through Rexster's gremlin extension using PHP

517 views
Skip to first unread message

Filip Melík

unread,
Mar 5, 2014, 7:45:28 AM3/5/14
to aureliu...@googlegroups.com
Hello gentlemen,
I try to load a bunch of data (millions of vertices and some edges between them) from mysql into titan with PHP script.
To get an idea, my use case is storing customer's orders. ID of the order is one vertex [ORDER] and then I create [PRODUCT] vertices which customer bought in this order and connect them with the [ORDER] vertex. And I have approx ~4 million of orders, so the graph will result in millions of vertices.

Since I have to use PHP, which is not a java-enabled language and there is no PHP library for comunicating with Titan as far as I know, I cannot use BatchGraph.

What are the possibilities of loading this amount of data from mysql into titan, but in chunks?

I tried the following:

1] I wrote a PHP class for communicating with Rexster REST API with help of these methods https://github.com/tinkerpop/rexster/wiki/Basic-REST-API, but there is no way of querying or creating multiple vertices/edges at once, the API allows only to request/ create/ update only one edge/vertex at once. So with helps of these API methods I came with this program (pseudocode):

check if vertex with this orderID already exist (to avoid duplicates) -> if no, create it and get ID of this newly created order vertex
for each product associated with the order {
    check if the product vertex already exist -> if yes , get its ID, if no create it and get its ID
    connect order vertex and product vertex with edge
}

for example for order with 5 products transalates to following REST calls:
[ORDER VERTEX]
check if exist - GET API_URL/graphs/xsgraph/vertices?key=orderId&value=4598995
create -  POST API_URL/graphs/xsgraph/vertices/
set properties - POST API_URL/graphs/xsgraph/vertices?key=orderId&value=4598995&key=time&value=123456789

[PRODUCT VERTICES] - this is repeated 5 times
check if exist - GET API_URL/graphs/xsgraph/vertices?key=productId&value=9999
create -  POST API_URL/graphs/xsgraph/vertices/
set properties - POST API_URL/graphs/xsgraph/vertices?key=productId&value=4598995

it is 18 REST calls in total for one order in worst case which is incredibly slow

2] i modified my code to use Rexster gremlin extension for creating and assingning the properties to newly created  vertex with this query g.addVertex(null,["orderId":123456,"time":987987]);
however, since this change i got OutOfMemory Exception so I tried to change in the java heap size (-Xmx switch) from 512M to 8000M but it still gives me the same exception.


a] Is there some sort of memory leak or something? or the gremlin extension cannot handle this amount of requests?
b] Is there some way to add multiple vertices with properties at once so I would not make this amount of REST requests at once? and the samo for edges?

here is one of the stack traces:

Caused by: java.lang.OutOfMemoryError: PermGen space
142125977 [Grizzly(1)] ERROR com.tinkerpop.rexster.GraphResource  - It would be smart to trap this this exception within the extension and supply a good response to the user:PermGen space
java.lang.OutOfMemoryError: PermGen space
142133826 [Grizzly(2)] ERROR com.tinkerpop.rexster.GraphResource  - Dynamic invocation of the [tp:gremlin+*] extension failed.
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.tinkerpop.rexster.AbstractSubResource.invokeExtension(AbstractSubResource.java:322)
        at com.tinkerpop.rexster.AbstractSubResource.invokeExtension(AbstractSubResource.java:229)
        at com.tinkerpop.rexster.GraphResource.executeGraphExtension(GraphResource.java:281)
        at com.tinkerpop.rexster.GraphResource.getGraphExtension(GraphResource.java:224)
        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
        at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
        at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
        at com.codahale.metrics.jersey.InstrumentedResourceMethodDispatchProvider$TimedRequestDispatcher.dispatch(InstrumentedResourceMethodDispatchProvider.java:30)
        at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
        at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
        at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
        at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
        at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
        at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1511)
        at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1442)
        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1391)
        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1381)
        at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
        at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:538)
        at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:716)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:770)
        at org.glassfish.grizzly.servlet.FilterChainImpl.doFilter(FilterChainImpl.java:147)
        at org.glassfish.grizzly.servlet.FilterChainImpl.invokeFilterChain(FilterChainImpl.java:106)
        at org.glassfish.grizzly.servlet.ServletHandler.doServletService(ServletHandler.java:252)
        at org.glassfish.grizzly.servlet.ServletHandler.service(ServletHandler.java:188)
        at org.glassfish.grizzly.http.server.HttpHandler.doHandle(HttpHandler.java:164)
        at org.glassfish.grizzly.http.server.HttpHandlerChain.service(HttpHandlerChain.java:196)
        at org.glassfish.grizzly.http.server.HttpHandler.doHandle(HttpHandler.java:164)
        at org.glassfish.grizzly.http.server.HttpServerFilter.handleRead(HttpServerFilter.java:175)
        at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:119)
        at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:265)
        at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:200)
        at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:134)
        at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:112)
        at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:78)
        at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:815)
        at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:112)
        at org.glassfish.grizzly.strategies.LeaderFollowerNIOStrategy.executeIoEvent(LeaderFollowerNIOStrategy.java:102)
        at org.glassfish.grizzly.strategies.AbstractIOStrategy.executeIoEvent(AbstractIOStrategy.java:88)
        at org.glassfish.grizzly.nio.SelectorRunner.iterateKeyEvents(SelectorRunner.java:398)
        at org.glassfish.grizzly.nio.SelectorRunner.iterateKeys(SelectorRunner.java:368)
        at org.glassfish.grizzly.nio.SelectorRunner.doSelect(SelectorRunner.java:334)
        at org.glassfish.grizzly.nio.SelectorRunner.run(SelectorRunner.java:264)
        at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:567)
        at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:547)
        at java.lang.Thread.run(Thread.java:744)


thanks for your time!

Stephen Mallette

unread,
Mar 5, 2014, 8:05:37 AM3/5/14
to aureliu...@googlegroups.com
Loading that data is best done through BatchGraph (especially for a graph of the size you mentioned).  Write a script in Groovy to do the work and execute it with the Gremlin REPL.  Save the REST/Rexster stuff for your front-end PHP work.  It is super simple to work with a SQL database and a graph in the REPL...You can get an idea how to do it here:


As far as the OutOfMemoryException goes and the Gremlin Extension, I'm partially baffled.  Marko and I have tried to fix that permgen issue more times than I can count.  We even have tests that try to "blow the heap"...and the bottom line is that it looks like you can still somehow do it.  You can do two things to fix it:

1. The recommended fix would be to use parameterized requests.  (https://github.com/tinkerpop/rexster/wiki/Gremlin-Extension#wiki-script-engine-bindings)
2. If for some reason you can't use parameterized requests, set to the <script-engine><reset-threshold> to something like 1000 (or whatever your available memory will allow). 

This information is mostly to help you for your future development with Rexster as you build your application.  For your one time batch loading effort, please go with a simple Groovy script.  You will save a lot of time.

HTH, 

Stephen




--
You received this message because you are subscribed to the Google Groups "Aurelius" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aureliusgraph...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Filip Melík

unread,
Mar 6, 2014, 9:32:50 AM3/6/14
to aureliu...@googlegroups.com
Fantastic Stephen, thanks for help!

I used the approach with loading data with mysql JDBC driver and it works. However, another problem has emerged, probably with indexes. Problem is that when doing batch import via the groovy script, it seems like the elements are not being indexed via external elastic search index. The code I used for making the index in gremlin CLI:

// create index on vertex property key questionnaire against native index and also ES index
g.makeKey("orderId").dataType(Integer.class).indexed(Vertex.class).unique().indexed("search",Vertex.class).make();
g.makeKey("time").dataType(Long.class).indexed(Vertex.class).indexed("search", Vertex.class).make();
g.commit();

Then I made the import of 300k of vertices like this:

import groovy.sql.Sql;
sql = Sql.newInstance("jdbc:mysql://server", "user","pass");
g = TitanFactory.open('/opt/titan-server-0.4.1/conf/titan-cassandra-es.properties');
x=[] as Set;
sql.eachRow("SELECT aaaa FROM bbb"){x.add(it.questionnaireId)};
x.each{g.addVertex(null,[type:"orderId",orderId:it,time:666666])};
g.commit()

the vertices are created, i can see them through Rexster doghouse, but when i try to query one of them from gremlin or rexster gremlin plugin using index query with
g.V("orderId",7747142)

nothing is found and I also see that elasticsearch index has still the same size on hard drive, so it is not indexing them.
I tried restarting the Titan server but no luck.

Is anything am I doing wrong?

Thanks


Dne středa, 5. března 2014 14:05:37 UTC+1 stephen mallette napsal(a):

Daniel Kuppitz

unread,
Mar 6, 2014, 10:21:10 AM3/6/14
to aureliu...@googlegroups.com
Hi Filip,

can you verify that there's ANY vertex? Does a g.V()[0..<10].map() show 10 vertices with the expected properties?
To me it looks like you're just using a wrong graph configuration in your Gremlin REPL. That doesn't explain why the Gremlin plugin is also not working, but.. well ... no other idea so far.

Cheers,
Daniel

Filip Melík

unread,
Mar 7, 2014, 2:01:29 AM3/7/14
to aureliu...@googlegroups.com
Hello Daniel and thanks for response,
when supplying the gremlin query you posted, the vertices with expected properties are there. So I suppose I am using the correct graph config in REPL. However I have one question regarding indices:
Have you noticed that I created the index against both the Titan native indexing engine but also against the Elastic search engine (named "search")?

g.makeKey("orderId").dataType(Integer.class).indexed(Vertex.class).unique().indexed("search",Vertex.class).make();
How then Titan knows, which one to use and why? Is not the problem that it possibly ignores the external indexing via ES as long as it is registered against Titan native index?

thanks for any ideas

Dne čtvrtek, 6. března 2014 16:21:10 UTC+1 Daniel Kuppitz napsal(a):

Daniel Kuppitz

unread,
Mar 7, 2014, 3:18:11 AM3/7/14
to aureliu...@googlegroups.com
No, an index in Titan and ES works fine, Titans query optimizer will figure out when to use which one. Anyway, g.V("orderId", 7747142) should definitely use the Titan index.
I tried to reproduce the problem with the following code:

conf = new BaseConfiguration() {{
  setProperty("storage.backend", "cassandra");
  setProperty("storage.hostname", "localhost");
  setProperty("storage.keyspace", "filip");
  setProperty("storage.index.search.backend", "elasticsearch");
  setProperty("storage.index.search.directory", "/tmp/filip/es");
  setProperty("storage.index.search.client-only", false);
  setProperty("storage.index.search.local-mode", true);
}}

g = TitanFactory.open(conf)
g.makeKey("orderId").dataType(Integer.class).indexed(Vertex.class).unique().indexed("search",Vertex.class).make()
g.makeKey("time").dataType(Long.class).indexed(Vertex.class).indexed("search", Vertex.class).make()
g.makeKey("type").dataType(String.class).make()
g.commit()

(1..300000).each({ orderId ->
  g.addVertex(["type": "orderId", "orderId": orderId, "time": 666666]);
  if (0 == orderId%10000) {
    g.commit(); println orderId
  }
}); g.commit()

...but everything works just as expected:

gremlin> v = g.V("orderId", (int)(Math.random() * 300000)).next()
==> v[863952]
gremlin> v.map()
==> time=666666
==> type=orderId
==> orderId=215988

I really have no idea why it isn't working for you.

Cheers,
Daniel



For more options, visit https://groups.google.com/d/optout.

Daniel Kuppitz

unread,
Mar 7, 2014, 3:30:12 AM3/7/14
to aureliu...@googlegroups.com
One more thing. Just to make sure that you didn't mix up orderIds with vertex ids or something, please do a full graph scan to verify that the orderId actually exists:

g.V().filter({ it.getProperty("orderId") == 7747142 })

Cheers,
Daniel

Filip Melík

unread,
Mar 7, 2014, 4:18:55 AM3/7/14
to aureliu...@googlegroups.com
I run the code you provided, just changed the ES index direcotory to setProperty("storage.index.search.directory", "../db/es"); since I am using the Titan+Rexster+ES bundle.
the nodes were successfully created with all the properties and I can even find them with this query: g.V("orderId", (int)(Math.random() * 300000)).next().map

So it seems that the standard Titan index works fine, but it is not indexed by Elastic search at all. I downloaded the http://www.elastichq.org/ ES cluster manager, and when connectiong to the ES, I can see index called "titan" in it, but with no data at all (0 documents), so it seems that the properties we registered to be indexed (i.e. orderId) is not indexed at all, nor any other property.

Maybe I am just interpreting something wrongly, but I cannot get it to work. I have checked the manual Using-Elastic-Search but without any luck what might be wrong.
Any idea what I should check next?

thanks for your patience.


Dne pátek, 7. března 2014 9:30:12 UTC+1 Daniel Kuppitz napsal(a):

Daniel Kuppitz

unread,
Mar 7, 2014, 4:58:09 AM3/7/14
to aureliu...@googlegroups.com
So your assumption that ES is not used comes from the fact that you can't see anything in Elastic HQ? You haven't tried any ES query? If so, please try:

g.indexQuery("search", "+v.orderId:123456 +v.time:[600000 TO *]").vertices().collect({ it.getElement() })

This query works for me (and is definitely using the ES index).

Cheers,
Daniel

Filip Melík

unread,
Mar 7, 2014, 7:50:21 AM3/7/14
to aureliu...@googlegroups.com
So your assumption that ES is not used comes from the fact that you can't see anything in Elastic HQ
yes, because when I tried to add the vertices through the Rexster REST API and then looked into elastic HQ, the data there was visible.

It seems like I have found the problem and partial solution:

When i tried the g.indexQuery("search", "+v.orderId:123456 +v.time:[600000 TO *]").vertices().collect({ it.getElement() }) from the gremlin CLI (gremlin.sh), it has found the node successfully. But when I tried the same query from the Rexster's Doghouse gremlin console it found nothing. So I tried next thing and queried the gremlin CLI as such: g.V("orderId",123456) and it found the node. When I tried it from the Rexster gremlin console it found nothing. But when i clicked on the "Browse Vertices" in Rexster Doghouse, the all nodes were visible there with the expected properties {type,orderId,time}. So in the Rexster Gremlin console I tried g.v().count() which gave me 300,000, which are all nodes I imported with the script you provided above.

All this made me think that the Rexster somehow does not see the indices I have created in Gremlin CLI (gremlin.sh). So I cleared the graph to have fresh start, restarted titan server, and tried to run the commands for creating indices and importing data from the Rexster Doghouse's gremlin console. Everything worked - i can query the graph through Rexster REST Gremlin extension, and everything works. I even see that the data are indexed in Elastic HQ now.
BUT:  oppositely, now i cannot query the graph through the gremlin CLI, just from Rexster's gremlin console. Do not know why is that... It seems like each console is using slightly different settings because of this behaviour.

I want to query graph only through Rexster REST API, so it solved my problem, but i consider it only partially solved, because I do not know why it is behaving like that, any ideas?

Anyway, million thanks to Stephen and Daniel for making me figure out how to solve it.



Dne pátek, 7. března 2014 10:58:09 UTC+1 Daniel Kuppitz napsal(a):

Stephen Mallette

unread,
Mar 7, 2014, 8:05:27 AM3/7/14
to aureliu...@googlegroups.com
Filip, are you using embedded ES?  If so that would explain your problem.  If the ES is embedded in Rexster you won't see it from the Gremlin REPL and if embedded from Gremlin REPL it won't be available to Rexster.  You would need to use an external ES to allow for that.

Filip Melík

unread,
Mar 7, 2014, 8:10:19 AM3/7/14
to aureliu...@googlegroups.com
Filip, are you using embedded ES?  If so that would explain your problem. 
For tests I am using the Titan server bundle downloaded from titan website (Titan+Rexster+Cassanda+ES) so I think I am using the embedded ES and then, this explains everything.

Thanks for clarification!

 

vaclav77

unread,
Mar 24, 2014, 3:54:59 PM3/24/14
to aureliu...@googlegroups.com
Hey Filip,

Have you tried rexpro-php? https://github.com/PommeVerte/rexpro-php
I installed the whole Titan stack two weeks ago and have been using this connector. It's pretty much like Gremlin shell but I haven't tested transactions yet. It supports sessions so you can execute a call, return to your PHP code to follow your application logic and then go back to execute a closing call to Titan through Gremlin/Groovy - I think it's very close to what you would do directly under Java based app.

Vacek
Reply all
Reply to author
Forward
0 new messages