gremlin-python tutorial

139 views
Skip to first unread message

Wolfgang Fahl

unread,
Sep 21, 2019, 5:29:57 AM9/21/19
to Gremlin-users
At http://wiki.bitplan.com/index.php/Gremlin_python I have started a mini-tutorial for gremlin-python.


new Issues are welcome. Especially I'd like to get more positive results for the http://wiki.bitplan.com/index.php/Gremlin_python#Connecting_to_Gremlin_enabled_graph_databases so you are invited to share your experience.



my latest comment there is:

IMHO there is a need to get the following issues solved by a companion API to GLV - which then would not only be available to gremlin-python but also to other languages:

  • selecting a provider
  • selecting a graph e.g. by name/alias
  • listing available graphs
  • authenticating
  • getting meta-information
  • checking the availability of a server
  • ...

In traditional database APIs like JDBC these needs are all handled by some means or another. For GLVs this is IMHO much harder than necessary. E.g. if you want to use data from two or three graph databases at the same time it get's very awkward. I have many usecase where i need adhoc in-memory graphdatabases and only the computational results should be stored in another graph database that is backed by some provider. Should there be a ticket for each of the improvement wishes - are any of these things already addressed somewhere? Should the discussion first happen in a forum before this ticket is refined? I don't know what the proper procedure is to get to an improved version fo TINKERPOP ...






Wolfgang Fahl

unread,
Sep 21, 2019, 5:40:07 AM9/21/19
to Gremlin-users
As a first step I created https://github.com/WolfgangFahl/gremlin-python-tutorial/blob/master/tutorial/remote.py which is intended to allow configuring the server to be used with parameters or read things from a yaml file.

As an example you can start an OrientDB in docker using
scripts/runOrientDB
3.0.23-tp3: Pulling from library/orientdb
Digest: sha256:97770fb0d21f83f68e1613f5b8e05669a373f9db6cc947c2bb73dee2e0a49312
Status: Image is up to date for orientdb:3.0.23-tp3
docker
.io/library/orientdb:3.0.23-tp3
ca2ed42e690725b6595b4ea86702235c2b2b2185a1bf2d1e0dc6da4642623529

And then do a
   ln -f OrientDB.yaml server.yaml

now all test will be runing against the orientdb. Here is an example of a failing test:
python3 test_004_io.py
Traceback (most recent call last):
 
File "test_004_io.py", line 20, in <module>
    test_loadGraph
()
 
File "test_004_io.py", line 13, in test_loadGraph
    g
.V().drop().iterate()
 
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gremlin_python/process/traversal.py", line 65, in iterate
   
try: self.nextTraverser()
 
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gremlin_python/process/traversal.py", line 70, in nextTraverser
   
self.traversal_strategies.apply_strategies(self)
 
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gremlin_python/process/traversal.py", line 512, in apply_strategies
    traversal_strategy
.apply(traversal)
 
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gremlin_python/driver/remote_connection.py", line 148, in apply
    remote_traversal
= self.remote_connection.submit(traversal.bytecode)
 
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gremlin_python/driver/driver_remote_connection.py", line 54, in submit
    results
= result_set.all().result()
 
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 435, in result
   
return self.__get_result()
 
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
   
raise self._exception
 
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gremlin_python/driver/resultset.py", line 90, in cb
    f
.result()
 
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 428, in result
   
return self.__get_result()
 
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
   
raise self._exception
 
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result
= self.fn(*self.args, **self.kwargs)
 
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gremlin_python/driver/connection.py", line 80, in _receive
    status_code
= self._protocol.data_received(data, self._results)
 
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gremlin_python/driver/protocol.py", line 97, in data_received
   
return self.data_received(data, results_dict)
 
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/gremlin_python/driver/protocol.py", line 110, in data_received
   
raise GremlinServerError(message["status"])
gremlin_python
.driver.protocol.GremlinServerError: 599: null:none([])

Running the tests against the default Gremlin Server works better:

docker stop ca2ed42e690725b6595b4ea86702235c2b2b2185a1bf2d1e0dc6da4642623529
# start the default server
./run -s
# set the configuration to the default server
ln
-f ln -f TinkerGraph.yaml server.yaml
# run all pytests of the tutorial
./run -t
============================= test session starts ==============================
platform darwin
-- Python 3.7.4, pytest-5.1.2, py-1.8.0, pluggy-0.12.0
rootdir
: /Users/wf/source/python/gremlin-python-tutorial
collecting
0 items                                                             g.V().count=6
g
.E().count=6
[v[1], v[2], v[3], v[4], v[5], v[6]]
[v[1]]
['marko']
[e[7][1-knows->2], e[8][1-knows->4]]
['vadas', 'josh']
['vadas', 'josh']
6 results
{'host': '/127.0.0.1:51992'}
air
-routes-small.xml has 47 vertices
collected
11 items                                                            

test_000
.py .
test_001
.py g.V().count=6
.g.E().count=6
.
test_002_tutorial
.py [v[1], v[2], v[3], v[4], v[5], v[6]]
.[v[1]]
.['marko']
.[e[7][1-knows->2], e[8][1-knows->4]]
.['vadas', 'josh']
.['vadas', 'josh']
.
test_003_connection
.py 6 results
{'host': '/127.0.0.1:52018'}
.
test_004_io
.py air-routes-small.xml has 47 vertices
.

============================== 11 passed in 5.46s ==============================


This is also what the travis configuration of the tutorial project checks.

Josh Perryman

unread,
Sep 21, 2019, 4:44:31 PM9/21/19
to Gremlin-users
Thank you so much for posting these resources.  These are going to be a great and valuable contribution to the TinkerPop ecosystem.  I've been monitoring the discussions in the TinkerPop Jira, on StackOverflow and here.  I appreciate that you've been able to get to a place where you have collected all of your learnings and are willing to share them back with the community.  

The TinkerPop ecosystem is a bit different than many other OSS solutions out there. While DataStax employs several the committee members, and they have a DSE Graph product which uses TinkerPop, the community is much more federated than you find with other OSS solutions.  There's a very diverse set of implementations, some of which are pure graph databases (dare I say "natural" ones?), others of which are technically other types of data engines but with a TinkerPop-compatible interface added. 

Given the looseness of that confederation, the variety of implementations, and the fact of day jobs, there's a lot of work that has yet to be done.  In the present time, we find this most acutely in the GLVs especially C#, JavaScript and Python; and also in the general areas of documentation & tutorials.  I'm sorry that you've had to wrestle through this deficit, but some of us are (slowly) also trying to fill in the gaps.  

Dave Bechberger and I are at work on a book "Graph Databases in Action" and there's an MEAP available, but all of the code examples are in Java.  However, we focus on methodology even more than we do on code so it should be useful to all who are working with graph databases in general, and Apache TinkerPop in particular.  

Also, I am scheduled to offer a 3 hour workshop "Build a graph database application in the language of your choice" in January at the Global Graph Summit in Austin, TX. That workshop will be designed for attendees to use any of the following languages: Java, C#, JavaScript or Python in the workshop.  We will start with an empty code repository and a locally running TinkerPop Gremlin server, and proceed to build a simple command-line application. In 3 hours we'll cover the basics of setting up an environment, connecting to the server, running operations against the server, and then handling the results.  It won't be very deep since we have a lot of material to cover and a short amount of time, but it will be a nice way to quickly orient a developer new to this ecosystem and they will leave with working code as a reference point for their future projects. 

I know that's about 5 months later than you need it, but for our future peers who are facing the same challenges in 2020 or later, know that there are resources in development, or even more readily available.

Finally, you bring up an interesting set of feature requests on which I'd like to opine, Namely: 
  • selecting a provider
  • selecting a graph e.g. by name/alias
  • listing available graphs
  • authenticating
  • getting meta-information
  • checking the availability of a server
First, there's a reasonable attitude within the realm of highly connected data that the value of the data is in being able to reason over those connections.  As such, separating data into distinct data stores can add a lot in the way of costs & overhead.  For this line of thinking, it is far preferable to have all of the data in a single data store with all of the relevant connections materialized as edges.  I have found that when building graphs we should be strongly biased toward having all of the data in a single data store. 

That being said, there are certainly valid use cases for storing graph data in separate data stores, and being able to switch between those. (Or even, dare I say, "join" them together in some fashion.)  But given the previously mentioned federated aspect of the implementations, the details and operation around that should be left to the different providers.  I for one think that they have done that quite well.  I wouldn't be able to automate my integration tests at my day job if not for the ability to define a separate graph data store, define schema, add data, and then clean it all up. 

Given that, I doubt that we will ever see most of this functionality define in TinkerPop, or if it is, it will only be suitable for the most broad and generic of use cases.  Building these types of capabilities into Gremlin Server and/or Gremlin Console, which are really just very simple reference implementations, isn't likely in the interest of the volunteer developers who maintain TinkerPop. (Note, I believe that there is already some simple sort of authentication functionality in place, but it is rarely used with Gremlin Server as most providers will have their own implementation-specific approaches.)

I am personally interested in some use cases around working with multiple graph data stores, and plan to start working on those use cases in the coming months.  For that end, I'm looking forward to the forthcoming tooling from Shinavier & Wisnesky related to their recently published a paper on Algebraic Property Graphs.   

We can't forget that this is still pretty cutting edge area of technology (pun not intended) and there is a lot of work to be done to get the ecosystem updated.  Most of us have been spoiled for years (really decades) with the vast and deep ecosystems around RDBMSs.  That technology has matured to the point where many applications just use a modern ORM with a code-first attitude and don't even give thought to the particulars of their persistence layer.  For any solution which can take that approach that is absolutely the right call.  

But with TinkerPop in particular, and graph databases in general, we still have a ways to go to catch up.  I hope that with tools such as GLVs, and the benefit of our decades of experiences with traditional data solutions, we will surpass those legacy technologies.  In a decade or so we may recall the days with loads of functionality, limited tutorials and tooling to speak of, and marvel at how far we have come.

Very glad to have joining us for this journey, 

Josh

Ryan Wisnesky

unread,
Sep 21, 2019, 11:46:01 PM9/21/19
to gremli...@googlegroups.com
Along those lines, if anyone is interested in collaborating on a pilot project to apply (formal) APG in practice, do please let me know. The APG paper contains algorithms for such things as joining and merging and migrating APGs from schema to schema and it would be great to have more and more realistic examples of (formal) APGs and operations thereon, even independently of Tinkerpop tooling efforts, simply to better understand the theory of APGs.
> --
> You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/b813a3f1-866d-42a3-a9a5-a170b8c0e884%40googlegroups.com.

JB Data31

unread,
Sep 23, 2019, 1:47:54 AM9/23/19
to gremli...@googlegroups.com
Reading the topic of this post, I can share a past year post call How to pythonize TinkerGraph and Gremlin language.
It's a kind of dissident way to use gremlin, but relevant in this post.

@JBΔ



Wolfgang Fahl

unread,
Sep 23, 2019, 3:35:46 AM9/23/19
to Gremlin-users



For Neo4J i has some success

git clone https://github.com/WolfgangFahl/gremlin-python-tutorial
cd gremlin
-python-tutorial
./run -i
scripts
/runNeo4j -rc ./run -n ln -f Neo4j.yaml server.yaml ./run -t


But I can't see the modifications via the http://localhost:7474/

MATCH (n) RETURN n

doesn't show anything.


Stephen Mallette

unread,
Sep 23, 2019, 8:38:51 AM9/23/19
to gremli...@googlegroups.com
Thanks for the reply Josh - I'd agree with your points here especially those about the suggested features. I will add to your point about "authentication" - we do have basic authentication in place for all languages. Java is the only one that has full kerberos support at this time. It would be nice if kerberos was implemented across the board for all language variants. That's definitely a feature I'd like to see.

--

Stephen Mallette

unread,
Sep 23, 2019, 9:32:16 AM9/23/19
to gremli...@googlegroups.com
But I can't see the modifications via the http://localhost:7474/

If you want Gremlin Server and Neo4j Server both operating on the same graph you can't configure Gremlin Server to use Neo4j embedded. You have to either:

1. Configure Gremlin Server's Neo4j configuration as HA - http://tinkerpop.apache.org/docs/current/reference/#_high_availability_configuration
2. Use neo4j-gremlin-bolt in conjunction with Neo4j Server -  https://github.com/SteelBridgeLabs/neo4j-gremlin-bolt

As another gotcha you might not expect, please pay attention to the Neo4j versions you're using. Some will work fine together while others may not. Based on TinkerPop's most recent official release at 3.4.3:


you would want to use: 


We won't see an official upgrade of that version until TinkerPop 3.5.0, currently at (0.9-3.4.0) and thus:


Hope that helps.





--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

Wolfgang Fahl

unread,
Sep 23, 2019, 11:51:07 AM9/23/19
to Gremlin-users


Am Samstag, 21. September 2019 22:44:31 UTC+2 schrieb Josh Perryman:

First, there's a reasonable attitude within the realm of highly connected data that the value of the data is in being able to reason over those connections.  As such, separating data into distinct data stores can add a lot in the way of costs & overhead.  For this line of thinking, it is far preferable to have all of the data in a single data store with all of the relevant connections materialized as edges.  I have found that when building graphs we should be strongly biased toward having all of the data in a single data store. 

Thank you for your detailed comments which I appreciate.

The single data store ideal is evil from my point of view. A project where you can see what happens if you go that route is wikidata.
Wikidata's infrastructure is based on blazegraph and it was one of my first motivations to come up with the https://github.com/BITPlan/com.bitplan.simplegraph project.

The usecase was that i wanted to check the content of Wikidata regarding the royal family tree against the GEDCOM dataset.
See http://royal-family.bitplan.com/index.php/GEDCOM_import for the import of GEDCOM data into SIDIF which is supported by http://wiki.bitplan.com/index.php/SiDIF and https://github.com/BITPlan/org.sidif.triplestore. SiDIF files can be easily converted to a graph structure and the simpleGraph project supports it. So getting the GEDCOM Data via the SiDIF supporting triplestore module into a graph was easy.

The Wikidata side of things is much harder. I have been trying to use Gremlin/Tinkerpop together with blazegraph and the performance is abysmal and the compatibility a night mare. The people working on blazegraph have moved on to Amazon Neptune. Simplegraph therefore uses the simplegraph-wikidata module as a workaround which is not the kind of integration i originally had in mind.

I had hoped I could get a copy of Wikidata and access it via SPARQL or Tinkerpop/Gremlin and connect excerpts of the data like the royal family tree with other data. The Semantic Mediawiki side of things can be seen at http://royal-family.bitplan.com/index.php/Main_Page. So a simple three part use case of integrating Semantic Media Wiki, GEDCOM data and Wikidata is already very troublesome using the single data store ideal.

At BITPlan we found that we are using 20+ different APIs for our daily work including CRM systems, mail, web, office tools and the like. When the simplegraph project was started it was pretty clear to me that it would never be feasible to move all the data from all these systems into a single data store. What would be feasilbe is to extract relevants parts and have the ids and links ready to be used and at somepoint "clicked by an end user". A good example are our RESTful systems which we could happily integrate this way. A system like Microsoft Word is much harder to handle because it's uni-directional - you can link from a Word-Document to other data items but it is hard to link to a parapgraph in a word document and expect that you can show the result's easily in any application. Simplegraph eases that pain only a bit.

I still feel that the potential of "adhoc" usage of graph technology is underestimate. I think graph technology is great for solving every day problems where you need to integrate data from many different sources. Any 3+ sources usecase makes is interesting from my point of view. The setup times to get the sources integrated need to come done to make such an adhoc approach feasible. This is a lot about useability and explanation by example. That's is why i am so keen to get the tutorials up and running to show off how easy things are. At this point the reality is not delivering I encountered during the last few month is not delivering up on this motivation.



Josh Perryman

unread,
Sep 23, 2019, 11:28:03 PM9/23/19
to Gremlin-users
I think you misunderstand the single-graph preference. It isn't a recommendation for a single data store.  It is a preference that all "graphy" access patterns, that is questions that require reasoning over the connections, be in one graph.  If the primary value in the graph is reasoning over connections, then it is simplest and cheapest to put all of the connections in one graph. 

Note that I'm emphasizing connections, not data. There's a lot of data involved in applications which has nothing to do with the connections between things. 

From a general data architecture point of view, I'm strongly biased toward relational databases as a starting point, largely for reasons of staffing and tooling. But if a problem area is interesting enough to look at other engines, then I will usually switch to multi-model.  In these cases, if we already have a relational database or a document store, they are excellent engines for entity persistence, but generally poor at complex connection analysis. 

Your point about links in Word documents is emblematic of the challenges most engines have with reasoning over connections. In general, they can only see the connections going in one direction (links in wikipedia, foreign keys in RDBMSs), and it become prohibitively expensive to look at the connections in reverse.  Graphs, however, by materializing connections in both directions, do allow much greater flexibility and are more useful for those connection-focused questions.  

I hope that clears up that my preference for "all connections in one graph" is not the same as "all data in one engine".   

Cheers,

-Josh
Reply all
Reply to author
Forward
0 new messages