GSoC 2014 - Idea proposal - Connecting Cytoscape and Neo4j

486 views
Skip to first unread message

Georg Summer

unread,
Mar 12, 2014, 11:43:38 AM3/12/14
to cytoscap...@googlegroups.com

Hi all,


My name is Georg Summer and I am a PhD Student at the Maastricht University and TNO (tno.nl). My research focus is the application of network biology concepts and methods in cardiologie and nutritional health. Additionally I want to further these applications of network biology with some solid technologies like Cytoscape and Neo4j.


Without much more please have a look at my proposal:


Project Idea Introduction:


Neo4j provides a potential computational backend for more performance intensive tasks required of Cytoscape. The idea is to allow the user to connect and upload a local Cytoscape graph to a Neo4j instance (preferably on a high performance machine) and then invoke operations on this instance via Cytoscape. The necessary calculations are executed on the Neo4j instance and then streamed back to the client. This approach turns Cytoscape to a thin client and treats a Neo4j server as a mainframe (Figure 1).


Figure 1:

schematic.png


I developed a prototype providing the up- and download functionality as well the invocation of Neo4j extensions. The source code and a brief demo are available at:


Out of the box Neo4j servers do not provide layouts algorithms or graph statistics for large graphs but Neo4J has an extension facility that allows for the implementation of such algorithms.

Alternatively algorithms could also be submitted directly to Neo4j as it is an open source project.

This proposal aims to recreate the Network Analysis Toolbox of Cytoscape (NetworkAnalyzer App) as a Neo4j extension as well as 1-2 layout algorithms. This small collection of extensions will serve as a showcase of how computationally expensive algorithms (especially on large networks) can be deployed and used.

Additionally the implementation also serves as a study to investigate how feasible it is to port existing algorithms bound to Cytoscape to Neo4j extensions.


As a real world showcase a network integrating the following data sources is planned:

STRING

WikiPathways

miRNA - gene (importing CyTargetLinker RegINS)

drug - gene (importing CyTargetLinker RegINS)


These data sources can either be imported in Neo4j directly via various scripts (Network Builder  https://github.com/thomaskelder/network-builder) or can first be combined in Cytoscape and then synchronized to a Neo4j instance (already supported in the prototype)



Project Deliverables:


1. Neo4j extension implementation Network Analysis Toolbox


The Neo4j version of the toolbox will do the same calculations as the Cytoscape app and return to the client the results in the same way as the Cytoscape app stores the network statistics.

If possible this will then be automatically fed into the app and displayed as normal, otherwise user interaction or an adaptation of the existing app will be necessary.


2. Layout Algorithms as Neo4j extensions


Implement at least 1 layout algorithm as a Neo4j extension. The choice of algorithm is to be decided. After the execution of the extension it streams back the updated x/y coordinates for the nodes and then applied to the current network view.


3. Updated CyNetLibSync to support and integrate the new extensions

The existing prototype provides a facility to invoke Neo4j extensions. To properly support the planned ones and integrate them into Cytoscape the app has the be updated with the necessary event handling.

Project Plan:


Date

Task

2014-04-21

Project Start


Check if Cytoscape App to Neo4j Extension wrapper is feasible


Identify NetworkAnalyzer algorithms needed


Identify layout algorithm(s) to be implemented


Final decision which datasets to use in the showcase network

2014-05-19

Start Coding


Implementation of NetworkAnalyzer algorithms as Neo4j extensions


Implementation of layout algorithms as Neo4j extensions

2014-06-27

Mid-term evaluation deadline


Improvement of data transfer from Cytoscape to Neo4j


Showcase network assembly


UI improvement for Neo4j extension invocation


Source documentation and tutorial write up


Draft of application note for the App and Extensions

2014-08-22

Firm ‘pencils down’


Hurdles / Complications:


Algorithm Performance

Neo4j has a set of algorithms implemented but most of them are not considered production quality (http://components.neo4j.org/neo4j-graph-algo/snapshot/).

While these could be used for the implementation of the Neo4j extensions, the algorithms might not be sufficiently efficient to serve the goal to analyze large networks. Implementing and testing other algorithms might be required and more time consuming than the straight forward use of predefined ones. This could increase the required time for the algorithm implementation significantly.



A bit about my experiences in Bioinformatics


I did my BSc and MSc at the Upper Austrian University of Applied Sciences in Bioinformatics. I did research internships at the Visual Genomics Centre of the University of Calgary and at emergentec biodevelopment in Vienna.

My Master thesis was written about network inference from time-series data at the Perkins Lab of the Ottawa Health Research Institute [accompanying publication: PMID:21143801].

After that I worked as a bioinformatician for the Heymans group at the Maastricht University which eventually transitioned into my PhD. During that time (and still) I did transcriptomics analysis and visualization, data integration tasks and clinical data collection and curation. Graphs, networks and network biology are topics that followed me throughout my studies and work experience.


My daily work requires mostly R as a programming and analysis language. C++ and Java are the two languages I use for my software development needs. I am experienced in Matlab and php / html / java script but do not use these regularly.

I mostly use open source software and except some bug reports have not contributed to any larger open source development. Changing that is also a major reason why I wanted to get on board with the GSoC this year.




Georg
Neo4j_NetworkAnalyzer_GSoC2014.pdf

Barry Demchak

unread,
Mar 12, 2014, 3:00:44 PM3/12/14
to cytoscap...@googlegroups.com, Alexander Pico

[Alex … I recommend we somehow aggressively support this if possible.]

 

Hi, Georg –

 

This is a very welcome proposal, and I hope serious consideration is given it.

 

A few comments …

 

Your proposal focuses on network computational models, as it properly should. In terms of the relationship of your server (… does it have a name? …) to Cytoscape, we are thinking of Cytoscape as a client (and possibly a service) within a larger infrastructure called the Cytoscape Cyberinfrastructure (Cytoscape-CI). The CI is a service-oriented architecture (SOA, http://en.wikipedia.org/wiki/Service-oriented_architecture) that welcomes loosely bound, interface-oriented services … and your server very definitely qualifies. (Other servers are in process now, too, such as NDEx -- http://www.ndexbio.org/).

 

In this vein, you would be creating a service with a particular interface, capable of particular transformations (e.g., transforming a network into a layout). So, the input could be a Cytoscape network and parameter set, and the output would be selectable according to the many functions your service would be capable of.

 

If your server wants to be a container for computation algorithms (… also services), all the better, so long as the container interface is well defined and the algorithm service interfaces are well defined, too. Given the interfaces, the actual mechanics (including Neo4J) can scale however they want and execute however they want. (In architecture terms, the interface would stay the same, and the Service Level Agreement – SLA, http://en.wikipedia.org/wiki/Service-level_agreement – would change.)

 

From a biological perspective, your idea is solid gold, and it plays into an initiative we’re calling “discoverability”. From a biologist’s perspective, anything that slows the cognitive/creative process reduces “discoverability” … so “discoverability” amounts to how little a system gets in the way of a biologist’s path to great discoveries. Getting calculation times down by 100x is a good goal, especially since our desktop processors aren’t improving very quickly.

 

A major issue in the CI is moving data quickly between endpoints, including both network transfer time and transcoding – transcoding is the larger penalty. Your deliverable #3 focuses on a particular implementation within CyNetLibSync … I suspect this is mainly because Cytoscape hasn’t fully developed the architectural mechanisms to efficiently tie its data model to a replica maintained within external servers such as yours. This will be a strategic development for the Cytoscape-CI in the next few months, and so being so specific in your deliverable #3 could be counterproductive.

 

In terms of your project plan, Project Start needs to include the specific service interface presented to Cytoscape. There could be a couple of layers of interface … one that’s a Cytoscape App (which would include a user interface). A deeper layer would be the server interface (including how networks and results are exchanged, and how algorithms are selected), usable by Cytoscape and clients other than Cytoscape. Finally, a deeper layer would be more internal to your server, which would be a calculation-level (extension-level?) interface that also specifies inputs and outputs. This really should be specified very early.

 

For this, I recommend face-to-face time, and could be a primary justification for a trip to see us.

 

Does this move the ball forward for you??

 

As you can tell, I’d love to support this … let’s make it happen.

--
You received this message because you are subscribed to the Google Groups "cytoscape-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cytoscape-disc...@googlegroups.com.
To post to this group, send email to cytoscap...@googlegroups.com.
Visit this group at http://groups.google.com/group/cytoscape-discuss.
For more options, visit https://groups.google.com/d/optout.

Georg Summer

unread,
Mar 13, 2014, 10:23:22 AM3/13/14
to cytoscap...@googlegroups.com, Alexander Pico
Thanks for the encouragements.

I just want to be careful not to increase the scope of the proposal too much :)

For me the goal of this proposal is also to create an actual prototype for this Client - Server communication / interaction.
I have to say that during the initial development of the CyNetLibSync app I found quite some complications when dealing with Neo4j and room for improvements.

More responses in line.

On Wednesday, 12 March 2014 20:00:44 UTC+1, Barry Demchak wrote:

[Alex … I recommend we somehow aggressively support this if possible.]

 

Hi, Georg –

 

This is a very welcome proposal, and I hope serious consideration is given it.

 

A few comments …

 

Your proposal focuses on network computational models, as it properly should. In terms of the relationship of your server (… does it have a name? …) to Cytoscape, we are thinking of Cytoscape as a client (and possibly a service) within a larger infrastructure called the Cytoscape Cyberinfrastructure (Cytoscape-CI). The CI is a service-oriented architecture (SOA, http://en.wikipedia.org/wiki/Service-oriented_architecture) that welcomes loosely bound, interface-oriented services … and your server very definitely qualifies. (Other servers are in process now, too, such as NDEx -- http://www.ndexbio.org/).


The server is a bare bone version of the Neo4j Community Edition. No adaptation at 
all to the server. The Extensions are the same as apps/plugins and are just loaded
with the Neo4j Server (add/remove of a extensions requires restart).
The extension is a Java class that extents a Neo4j Server Plugin. The services this
extension offers are then defined as method calls.

@Description( "Find the shortest path between two nodes." )
@PluginTarget( Node.class )
public Iterable<Path> shortestPath(
@Source Node source,
@Description( "The node to find the shortest path to." )
@Parameter( name = "target" ) Node target,
@Description( "The relationship types to follow when searching for the shortest path(s). " +
"Order is insignificant, if omitted all types are followed." )
@Parameter( name = "types", optional = true ) String[] types,
@Description( "The maximum path length to search for, default value (if omitted) is 4." )
@Parameter( name = "depth", optional = true ) Integer depth )
{... implementation ... }
These extensions are then exposed from the server via REST endpoints.
 

 

In this vein, you would be creating a service with a particular interface, capable of particular transformations (e.g., transforming a network into a layout). So, the input could be a Cytoscape network and parameter set, and the output would be selectable according to the many functions your service would be capable of. 


Conceptually yes but each extension would represent its own Service. Each extension
than would contain one or more functions that this Service offers. In turn this
allows (given proper tools) to link individual services to individual instances/setups
of Neo4j servers. This allow the users/admins to recombine services, data and
infrastructure freely.

How the CyNetLibSync app currently works is as follows:
1. The App up/downloads a given Cytoscape network to a Neo4j server independent of
services provided. (e.g. a service could be made that uses a more efficient
transport format than JSON wrapped cypher [=Neo4j SQL] queries) This functionality
will eventually become a service that the app exposes to other Cytoscape apps
2. Upon connection to a Server the app queries the Server for its exposed extensions
via the Neo4j REST interface (url:7474/db/data/ext/). It then checks the available
extensions against a list of supported ones. Those are then exposed to the user.
3. If the user executes a supported extension, the implementation of this extension
takes charge and collects all necessary parameters (via UI or directly from 
Cytoscape), creates the call (URL + payload) to the Server and then hands this
call off to the CyNetLibSync app again for execution. The app passes the response
back to the extension implementation which does what it wants with it. 

If your server wants to be a container for computation algorithms (… also services), all the better, so long as the container interface is well defined and the algorithm service interfaces are well defined, too. Given the interfaces, the actual mechanics (including Neo4J) can scale however they want and execute however they want. (In architecture terms, the interface would stay the same, and the Service Level Agreement – SLA, http://en.wikipedia.org/wiki/Service-level_agreement – would change.)


As of now there is no generalization in place for such service calls within the
prototype app. Practically though there is the informal one that my app currently
uses.
Here is the run down:
Neo4jExtension:
  wraps the information the Neo4j Server provides about the extension. All the necessary
  execution information is captured here. The class implements a Extension (which is
  currently just a placeholder that should be clearer defined)
 
  As described above in 2. such extensions are filtered for being supported.
  Supported in this case means that the name is linked to a class implementing the
  ExtensionExecutor:
  If then invoked an object of the linked type is instantiated and then executed.
  (why is there no execute() method wrapping all the steps? As of now the responsibility
  of execution is still with the CyNetLibSync app.) The ExtensionExecutor gathers the
  necessary parameters (e.g. pops up a GUI to fill in some numbers). After that
  it prepares the calls (URL + payload) which are still executed through the
  CyNetLibSync app. The results are then processed as well.
 
  To reach the flexibility and independence you described the following steps would 
  have to be taken:
  a. move more responsibility into the Extension interface and use it to as 
  ServiceDescription
  b. create more types of XYZExtensions supporting different service endpoints
  and/or
  c. put a discoverability layer adhering to a certain set up operations in front
  of all services. For instance:
  A Server has to provide information about its services at:
  serverurl.org:port/providedservices
  This endpoint would return a list of (in my case Neo4jExtensions)
  ServiceDescriptions.
 
 
 

From a biological perspective, your idea is solid gold, and it plays into an initiative we’re calling “discoverability”. From a biologist’s perspective, anything that slows the cognitive/creative process reduces “discoverability” … so “discoverability” amounts to how little a system gets in the way of a biologist’s path to great discoveries. Getting calculation times down by 100x is a good goal, especially since our desktop processors aren’t improving very quickly.

 

A major issue in the CI is moving data quickly between endpoints, including both network transfer time and transcoding – transcoding is the larger penalty. Your deliverable #3 focuses on a particular implementation within CyNetLibSync … I suspect this is mainly because Cytoscape hasn’t fully developed the architectural mechanisms to efficiently tie its data model to a replica maintained within external servers such as yours. This will be a strategic development for the Cytoscape-CI in the next few months, and so being so specific in your deliverable #3 could be counterproductive.

Yes the upload and download of a network are a bottleneck. The deliverable #3
was actually aimed at the above mentioned ExtensionExecutors. You are right though
that Cytoscape does not provide me this functionality currently but this is also
why I created the prototype in the first place, to have such functionality.
I am more interested in the execution of extensions on the server, but waiting times
annoy me. Neo4j itself provides multiple ways to load data in, yet all that use
the JSON based REST calls over the network are not super efficient.
A cheat would be implement your own (binary?) protocol via the unmanaged extensions.
(something the Neo4j team suggests themselves)
 

 

In terms of your project plan, Project Start needs to include the specific service interface presented to Cytoscape. There could be a couple of layers of interface … one that’s a Cytoscape App (which would include a user interface). A deeper layer would be the server interface (including how networks and results are exchanged, and how algorithms are selected), usable by Cytoscape and clients other than Cytoscape. Finally, a deeper layer would be more internal to your server, which would be a calculation-level (extension-level?) interface that also specifies inputs and outputs. This really should be specified very early.

Part of those interfaces are defined as described above but not generalized well.
The part of exposing the apps functionality as services in Cytoscape: I will add it
to the final proposal.
 
 

For this, I recommend face-to-face time, and could be a primary justification for a trip to see us.

  Discussing and designing grand master plans for world dom... software projects?
Count me in.

Keiichiro Ono

unread,
Mar 13, 2014, 3:34:48 PM3/13/14
to cytoscap...@googlegroups.com, Alexander Pico
Hi Georg.

Thanks for the great proposal.  One quick question:

Are there any specific reasons to use Neo4j dependent API?  Have you thought about using Tinkerpop Blueprints?


Thanks,
Kei


--
Keiichiro Ono    http://keiono.github.io/

Georg Summer

unread,
Mar 13, 2014, 4:26:55 PM3/13/14
to cytoscap...@googlegroups.com, Alexander Pico

1. Mostly because I already had code snippets and experience with the Neo4j API.
2. I wanted to play around with Cypher (the Neo4j SQL-like query language).
3. I might be wrong but I do not think execution of extensions is supported via Blueprints.
4. Checking up on it now I think to access a Neo4j instance remotely that instance would have to be wrapped in Rexter or the Gremlin extension has to be used.

That said the up/download component could very well be replaced with a Blueprints stack (if remote is not an issue).
The extension invocation - what I am mostly interested in - I fear not.

From their user group it seems they (Neo4j devs) offloaded all Blueprints related work to the community and do not maintain it anymore themselves. They are heavily pushing their Cypher in favor of any other graph exploration methods.

I have not compared performance so I can not say anything about this.


Georg

Keiichiro Ono

unread,
Mar 13, 2014, 4:43:01 PM3/13/14
to cytoscap...@googlegroups.com, Alexander Pico
Hi Georg.

I have one more question.
What is the purpose of implementing analysis module by yourself?

Built-in module is good, but there are lots of advanced network analysis tools:


and have you already think about deploying these as external services using R+Shiny or Python+NumPy/SciPy+Pandas+Flask?  Then access those from Neo4j?
If your application can access those as services, users can access more advanced analysis functions (of course, there is always network overhead...)

What do you think?

Thanks,
Kei

Keiichiro Ono

unread,
Mar 13, 2014, 5:02:38 PM3/13/14
to cytoscap...@googlegroups.com
Thanks for your quick reply!

 
1. Mostly because I already had code snippets and experience with the Neo4j API.
2. I wanted to play around with Cypher (the Neo4j SQL-like query language).
3. I might be wrong but I do not think execution of extensions is supported via Blueprints.

Yes, inserting extra layer between Cytoscape and Neo4j is an addition, not a replacement of Neo4j API.  If you want to use Cypher, yes, I think using their native API is the only option.

 
4. Checking up on it now I think to access a Neo4j instance remotely that instance would have to be wrapped in Rexter or the Gremlin extension has to be used.

That said the up/download component could very well be replaced with a Blueprints stack (if remote is not an issue).
The extension invocation - what I am mostly interested in - I fear not.

I used Neo4j via Rexter when I implement this application:


and it is relatively easy to replace Tinkerpop layer to Neo4j native API (and vice versa).  So I'm not too much worrying about it.  I was just curious why you choose native instead of Blueprints.


From their user group it seems they (Neo4j devs) offloaded all Blueprints related work to the community and do not maintain it anymore themselves. They are heavily pushing their Cypher in favor of any other graph exploration methods.

I have not compared performance so I can not say anything about this.

I see.  I'm not sure the performance difference, but I think the difference is not so serious at this point.

Thanks,
Kei

Georg Summer

unread,
Mar 13, 2014, 6:26:05 PM3/13/14
to cytoscap...@googlegroups.com, Alexander Pico
I would ultimately like to be able to have a large collection of different networks saved in different neo4j instances. Cytoscape would then become a tool for me to connect to such an instance and then load (parts of) the graph (defined by e.g. a cypher query) to work with it. In an optimal scenario the neo4j server would reside on a significantly more powerful machine as my Cytoscape client and perform the calculations faster.
This use-case would essentially transform the Neo4j server into the computational backend of Cytoscape.

The reason why I want to focus on the analysis module is to show-case this exact idea: that the Neo4j server could become a seamless drop-in replacement for the local computation.

Re-implementing the algorithms is the last resort for me in this case. What I will try to do in order:
1. Attempt to write a wrapper around the existing algorithm/apps, investigating how well existing apps could be transformed into Neo4j extensions.
2. Use the existing code and manually replace CyNode with org.neo4j.graphdb.Node. (also removing dependencies on Cytoscape Services if non-essential for the algorithm)
3. Try and find algorithms implemented for Neo4j
4. re-implement.

The same goes for the layout algorithm.

If and only if performance is an issue and a hand-tailored implementation close to the Neo4j metal would be the only way to go... well then it would be the only way to go.

Why not use existing tools like igraph? I might actually but wrap them into a neo4j extension. I want to avoid the transfer here. This is a bit of a fall back to the "old" SQL ideas of stored procedures. Heavy lifting close to the database, and the client deals only with the user input.
Instead of:

Cytoscape -> Neo4j -> igraph -> Neo4j -> Cytoscape

I want to end up with

Cytoscape -> Neo4j (ext:igraph:algorithmXY) -> Cytoscape

Hope this clarifies the vision a bit.

Georg

Georg Summer

unread,
Mar 13, 2014, 6:49:10 PM3/13/14
to cytoscap...@googlegroups.com
Currently extension invocation and the up/down are within one app but eventually they could be separated.

And dropping in one tech for the other is not that problematic. The transaction_api branch on the github actually is an example for that. While it might not use a different technology it uses the transaction endpoint of Neo4j instead of the Cypher endpoint.
The annoying part in both cases is putting the nodes and edges into the json that the REST API requires.

...
Reply all
Reply to author
Forward
0 new messages