Hi all,
My name is Georg Summer and I am a PhD Student at the Maastricht University and TNO (tno.nl). My research focus is the application of network biology concepts and methods in cardiologie and nutritional health. Additionally I want to further these applications of network biology with some solid technologies like Cytoscape and Neo4j.
Without much more please have a look at my proposal:
Project Idea Introduction:
Neo4j provides a potential computational backend for more performance intensive tasks required of Cytoscape. The idea is to allow the user to connect and upload a local Cytoscape graph to a Neo4j instance (preferably on a high performance machine) and then invoke operations on this instance via Cytoscape. The necessary calculations are executed on the Neo4j instance and then streamed back to the client. This approach turns Cytoscape to a thin client and treats a Neo4j server as a mainframe (Figure 1).
Figure 1:
I developed a prototype providing the up- and download functionality as well the invocation of Neo4j extensions. The source code and a brief demo are available at:
Out of the box Neo4j servers do not provide layouts algorithms or graph statistics for large graphs but Neo4J has an extension facility that allows for the implementation of such algorithms.
Alternatively algorithms could also be submitted directly to Neo4j as it is an open source project.
This proposal aims to recreate the Network Analysis Toolbox of Cytoscape (NetworkAnalyzer App) as a Neo4j extension as well as 1-2 layout algorithms. This small collection of extensions will serve as a showcase of how computationally expensive algorithms (especially on large networks) can be deployed and used.
Additionally the implementation also serves as a study to investigate how feasible it is to port existing algorithms bound to Cytoscape to Neo4j extensions.
As a real world showcase a network integrating the following data sources is planned:
STRING
WikiPathways
miRNA - gene (importing CyTargetLinker RegINS)
drug - gene (importing CyTargetLinker RegINS)
These data sources can either be imported in Neo4j directly via various scripts (Network Builder https://github.com/thomaskelder/network-builder) or can first be combined in Cytoscape and then synchronized to a Neo4j instance (already supported in the prototype)
Project Deliverables:
1. Neo4j extension implementation Network Analysis Toolbox
The Neo4j version of the toolbox will do the same calculations as the Cytoscape app and return to the client the results in the same way as the Cytoscape app stores the network statistics.
If possible this will then be automatically fed into the app and displayed as normal, otherwise user interaction or an adaptation of the existing app will be necessary.
2. Layout Algorithms as Neo4j extensions
Implement at least 1 layout algorithm as a Neo4j extension. The choice of algorithm is to be decided. After the execution of the extension it streams back the updated x/y coordinates for the nodes and then applied to the current network view.
3. Updated CyNetLibSync to support and integrate the new extensions
The existing prototype provides a facility to invoke Neo4j extensions. To properly support the planned ones and integrate them into Cytoscape the app has the be updated with the necessary event handling.
Project Plan:
Date | Task |
2014-04-21 | Project Start |
Check if Cytoscape App to Neo4j Extension wrapper is feasible | |
Identify NetworkAnalyzer algorithms needed | |
Identify layout algorithm(s) to be implemented | |
Final decision which datasets to use in the showcase network | |
2014-05-19 | Start Coding |
Implementation of NetworkAnalyzer algorithms as Neo4j extensions | |
Implementation of layout algorithms as Neo4j extensions | |
2014-06-27 | Mid-term evaluation deadline |
Improvement of data transfer from Cytoscape to Neo4j | |
Showcase network assembly | |
UI improvement for Neo4j extension invocation | |
Source documentation and tutorial write up | |
Draft of application note for the App and Extensions | |
2014-08-22 | Firm ‘pencils down’ |
Hurdles / Complications:
Algorithm Performance
Neo4j has a set of algorithms implemented but most of them are not considered production quality (http://components.neo4j.org/neo4j-graph-algo/snapshot/).
While these could be used for the implementation of the Neo4j extensions, the algorithms might not be sufficiently efficient to serve the goal to analyze large networks. Implementing and testing other algorithms might be required and more time consuming than the straight forward use of predefined ones. This could increase the required time for the algorithm implementation significantly.
A bit about my experiences in Bioinformatics
I did my BSc and MSc at the Upper Austrian University of Applied Sciences in Bioinformatics. I did research internships at the Visual Genomics Centre of the University of Calgary and at emergentec biodevelopment in Vienna.
My Master thesis was written about network inference from time-series data at the Perkins Lab of the Ottawa Health Research Institute [accompanying publication: PMID:21143801].
After that I worked as a bioinformatician for the Heymans group at the Maastricht University which eventually transitioned into my PhD. During that time (and still) I did transcriptomics analysis and visualization, data integration tasks and clinical data collection and curation. Graphs, networks and network biology are topics that followed me throughout my studies and work experience.
My daily work requires mostly R as a programming and analysis language. C++ and Java are the two languages I use for my software development needs. I am experienced in Matlab and php / html / java script but do not use these regularly.
I mostly use open source software and except some bug reports have not contributed to any larger open source development. Changing that is also a major reason why I wanted to get on board with the GSoC this year.[Alex … I recommend we somehow aggressively support this if possible.]
Hi, Georg –
This is a very welcome proposal, and I hope serious consideration is given it.
A few comments …
Your proposal focuses on network computational models, as it properly should. In terms of the relationship of your server (… does it have a name? …) to Cytoscape, we are thinking of Cytoscape as a client (and possibly a service) within a larger infrastructure called the Cytoscape Cyberinfrastructure (Cytoscape-CI). The CI is a service-oriented architecture (SOA, http://en.wikipedia.org/wiki/Service-oriented_architecture) that welcomes loosely bound, interface-oriented services … and your server very definitely qualifies. (Other servers are in process now, too, such as NDEx -- http://www.ndexbio.org/).
In this vein, you would be creating a service with a particular interface, capable of particular transformations (e.g., transforming a network into a layout). So, the input could be a Cytoscape network and parameter set, and the output would be selectable according to the many functions your service would be capable of.
If your server wants to be a container for computation algorithms (… also services), all the better, so long as the container interface is well defined and the algorithm service interfaces are well defined, too. Given the interfaces, the actual mechanics (including Neo4J) can scale however they want and execute however they want. (In architecture terms, the interface would stay the same, and the Service Level Agreement – SLA, http://en.wikipedia.org/wiki/Service-level_agreement – would change.)
From a biological perspective, your idea is solid gold, and it plays into an initiative we’re calling “discoverability”. From a biologist’s perspective, anything that slows the cognitive/creative process reduces “discoverability” … so “discoverability” amounts to how little a system gets in the way of a biologist’s path to great discoveries. Getting calculation times down by 100x is a good goal, especially since our desktop processors aren’t improving very quickly.
A major issue in the CI is moving data quickly between endpoints, including both network transfer time and transcoding – transcoding is the larger penalty. Your deliverable #3 focuses on a particular implementation within CyNetLibSync … I suspect this is mainly because Cytoscape hasn’t fully developed the architectural mechanisms to efficiently tie its data model to a replica maintained within external servers such as yours. This will be a strategic development for the Cytoscape-CI in the next few months, and so being so specific in your deliverable #3 could be counterproductive.
In terms of your project plan, Project Start needs to include the specific service interface presented to Cytoscape. There could be a couple of layers of interface … one that’s a Cytoscape App (which would include a user interface). A deeper layer would be the server interface (including how networks and results are exchanged, and how algorithms are selected), usable by Cytoscape and clients other than Cytoscape. Finally, a deeper layer would be more internal to your server, which would be a calculation-level (extension-level?) interface that also specifies inputs and outputs. This really should be specified very early.
For this, I recommend face-to-face time, and could be a primary justification for a trip to see us.
Does this move the ball forward for you??
As you can tell, I’d love to support this … let’s make it happen.
--
You received this message because you are subscribed to the Google Groups "cytoscape-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cytoscape-disc...@googlegroups.com.
To post to this group, send email to cytoscap...@googlegroups.com.
Visit this group at http://groups.google.com/group/cytoscape-discuss.
For more options, visit https://groups.google.com/d/optout.
[Alex … I recommend we somehow aggressively support this if possible.]
Hi, Georg –
This is a very welcome proposal, and I hope serious consideration is given it.
A few comments …
Your proposal focuses on network computational models, as it properly should. In terms of the relationship of your server (… does it have a name? …) to Cytoscape, we are thinking of Cytoscape as a client (and possibly a service) within a larger infrastructure called the Cytoscape Cyberinfrastructure (Cytoscape-CI). The CI is a service-oriented architecture (SOA, http://en.wikipedia.org/wiki/Service-oriented_architecture) that welcomes loosely bound, interface-oriented services … and your server very definitely qualifies. (Other servers are in process now, too, such as NDEx -- http://www.ndexbio.org/).
In this vein, you would be creating a service with a particular interface, capable of particular transformations (e.g., transforming a network into a layout). So, the input could be a Cytoscape network and parameter set, and the output would be selectable according to the many functions your service would be capable of.
If your server wants to be a container for computation algorithms (… also services), all the better, so long as the container interface is well defined and the algorithm service interfaces are well defined, too. Given the interfaces, the actual mechanics (including Neo4J) can scale however they want and execute however they want. (In architecture terms, the interface would stay the same, and the Service Level Agreement – SLA, http://en.wikipedia.org/wiki/Service-level_agreement – would change.)
From a biological perspective, your idea is solid gold, and it plays into an initiative we’re calling “discoverability”. From a biologist’s perspective, anything that slows the cognitive/creative process reduces “discoverability” … so “discoverability” amounts to how little a system gets in the way of a biologist’s path to great discoveries. Getting calculation times down by 100x is a good goal, especially since our desktop processors aren’t improving very quickly.
A major issue in the CI is moving data quickly between endpoints, including both network transfer time and transcoding – transcoding is the larger penalty. Your deliverable #3 focuses on a particular implementation within CyNetLibSync … I suspect this is mainly because Cytoscape hasn’t fully developed the architectural mechanisms to efficiently tie its data model to a replica maintained within external servers such as yours. This will be a strategic development for the Cytoscape-CI in the next few months, and so being so specific in your deliverable #3 could be counterproductive.
In terms of your project plan, Project Start needs to include the specific service interface presented to Cytoscape. There could be a couple of layers of interface … one that’s a Cytoscape App (which would include a user interface). A deeper layer would be the server interface (including how networks and results are exchanged, and how algorithms are selected), usable by Cytoscape and clients other than Cytoscape. Finally, a deeper layer would be more internal to your server, which would be a calculation-level (extension-level?) interface that also specifies inputs and outputs. This really should be specified very early.
For this, I recommend face-to-face time, and could be a primary justification for a trip to see us.
1. Mostly because I already had code snippets and experience with the Neo4j API.2. I wanted to play around with Cypher (the Neo4j SQL-like query language).3. I might be wrong but I do not think execution of extensions is supported via Blueprints.
4. Checking up on it now I think to access a Neo4j instance remotely that instance would have to be wrapped in Rexter or the Gremlin extension has to be used.That said the up/download component could very well be replaced with a Blueprints stack (if remote is not an issue).The extension invocation - what I am mostly interested in - I fear not.
From their user group it seems they (Neo4j devs) offloaded all Blueprints related work to the community and do not maintain it anymore themselves. They are heavily pushing their Cypher in favor of any other graph exploration methods.I have not compared performance so I can not say anything about this.
...