Inquring: Cytoscape app plugin for bio4j (GSoC 2014 project) From Yigang Zhou

Yigang Zhou

unread,

Feb 27, 2014, 9:32:00 AM2/27/14

to bio4j...@googlegroups.com, eparej...@ohnosequences.com, ppa...@ohnosequences.com

Hi,

I'm Yigang Zhou, a Chinese PhD student, who has successfully completed 3 GSoC projects [2] [3] [4] in 2009, 2010 and 2012. Especially in GSoC 2012, I developed a Cytoscape plugin "Semscape" [4] that could visualize bio ontologies as graphs through SPARQL queries. The plugin works well with any biology DBs that provide SPARQL endpoints, such as:
- PC : Pathway Commons
- BioCyc : Collection of Pathway/Genome, MeSH : Medical Subject Headings
- Reactome : A knowledgebase of biological pathways and processes
- HGNC : Human Gene Nomenclature Database
- others like "cpath", "kegg", "chembl", etc
Gene Ontology was out of concern at that time, because its SPARQL endpoint was highly experimental and not fully supported in 2012 (not even now).
The above plugin is a bundle app of Cytoscape 3.0. So I know about OGSi, with hands-on development experience with the bundles.

For GSoC 2014, I'd like to contribute to bio4j on the project of "Cytoscape app plugin for bio4j"[1]. I find my background matches the project requirements very well, as is introduced previously. Could you please tell me the detailed scopes of the project? What're the features/functions of the plugin to be delivered in the end? Thanks a lot!

Best,
Yigang Zhou

[1] https://github.com/bio4j/gsoc14/wiki/Cytoscape-app-plugin-for-bio4j
[2] http://www.ncsa.illinois.edu/News/09/0827Studentspursue.html
[3] https://wiki.duraspace.org/display/GSOC/GSOC10+-+Storage+Service+Implementations+Based+on+Semantic+Content+Repository
[4] https://code.google.com/p/vsdlc3/wiki/UserGuide

Eduardo Pareja Tobes

unread,

Feb 27, 2014, 10:41:15 AM2/27/14

to bio4j...@googlegroups.com, eparej...@ohnosequences.com, ppa...@ohnosequences.com

Hi!

About the idea, what we want is basically to make Bio4j usable as a data source in Cytoscape. This could include something like what you did but using Gremlin (or even Cypher) as the query language, letting you visualize the results of a query and work with it. What will certainly be different is the way of accessing a Bio4j endpoint: the idea is to create the necessary AWS resources for that (EC2 instances basically; deploying a Bio4j instance takes under 5min) and release them once the user has finished with it. This would require of course thinking about the best way of communicating this to the user, as the cost, type of resources etc will depend on the Bio4j module/s needed, desired performance, etc. I will update the idea page so that it reflects all this.

Anyway, as you certainly have more knowledge about Cytoscape than any of us, feel free to offer any sort of feedback/ideas/whatever

HTH

--

best,

Eduardo Pareja-Tobes

Eduardo Pareja Tobes

unread,

Feb 27, 2014, 10:52:09 AM2/27/14

to bio4j...@googlegroups.com, eparej...@ohnosequences.com, ppa...@ohnosequences.com

I just updated the idea -> https://github.com/bio4j/gsoc14/wiki/Cytoscape-app-plugin-for-bio4j

Yigang Zhou

unread,

Mar 1, 2014, 3:10:50 AM3/1/14

to bio4j...@googlegroups.com, eparej...@ohnosequences.com, ppa...@ohnosequences.com

Hi,

Thanks for your explanations! I went through the docs[1] of bio4j and got a general idea of how the bio4j DB was created and loaded. Here're some more questions:

1) I can't figure out the codes for a java client (i.e. the Cytoscape plugin) interacting a bio4j DB. Are there any client side code samples/unit tests, especially for Gremlin/Blueprints layer [3] ? I find that the example source codes in [2] are not available.

2) I don't have a local cluster nor a AWS account right now. Are there any already created AWS services with bio4j DB for me to test? I'd like to test the code of 1).

3) Is the Cytoscape plugin responsible for "create the necessary AWS resources", loading the data into the bio4j DB and "release them once the user has finished with it"? Or does the Cytoscape plugin just focus on interacting with a ready-to-use bio4j DB, without concerning how and what the bio4j DB is created?

4) Should/Could the Cytoscape plugin to be coded just in Blueprints layer [3] without worrying about the underline implementations?

Best,

Yigang Zhou

[1] https://github.com/bio4j/bio4j/

[2] https://github.com/bio4j/bio4j/blob/master/docs/examples.md

[3] https://github.com/bio4j/blueprints

--
Has recibido este mensaje porque estás suscrito al grupo "bio4j-user" de Grupos de Google.
Para anular la suscripción a este grupo y dejar de recibir sus correos electrónicos, envía un correo electrónico a bio4j-user+...@googlegroups.com.
Para obtener más opciones, visita https://groups.google.com/groups/opt_out.

Eduardo Pareja-Tobes

unread,

Mar 1, 2014, 8:56:22 AM3/1/14

to bio4j...@googlegroups.com, Pablo Pareja Tobes

Hi,

About the examples: we are in the middle of a pretty big refactoring including how we make releases, code structure, repositories, etc. It’s been a bit inconvenient for new people (sorry about it) but we think that it’s better to change things now before any actual GSoC work will take place than later. Anyway, you can keep track of how things are going here

https://github.com/bio4j/bio4j/issues/15

We expect that all this will be fixed during next week. So, I think it’d be better to wait before all that if you want to test things etc. You can of course see what’s going on and help with tasks that don’t require a lot of knowledge (like reafctorings and the like)

Re 3) data import only happens once, and client tools don’t need to care about that. Simplifying a bit, the plugin should just create an EC2 instance through Bio4j-specific (already developed) libs who would do all the work of retrieving already imported data (DB binaries) from S3, configuring the DB etc.

About 4) it will depend on the scope and features; the current situation is that in the neo4j version we had the need to significantly alter the model due to serious performance problems with a more natural approach, something which we don’t have the need to do in the (still in-progress) Titan implementation. If the set of queries/data access patterns is going to be determined beforehand in some way, then I’d rather use the abstract model directly which would yield good performance in all cases; if not, the default Blueprints impl could be OK. Another option would be to code everything in Blueprints but for the client to access the raw data model of each particular implementation; this would have other downsides in terms of queries working for one technology but not for the other etc.

best

Eduardo Pareja-Tobes

Math & CS freak

oh no sequences!

--
Has recibido este mensaje porque estás suscrito a un tema del grupo "bio4j-user" de Grupos de Google.
Para anular la suscripción a este tema, visita https://groups.google.com/d/topic/bio4j-user/t705kBB-tI0/unsubscribe. Para anular la suscripción a este grupo y todos sus temas, envía un correo electrónico a bio4j-user+...@googlegroups.com.

Yigang Zhou

unread,

Mar 8, 2014, 4:32:44 AM3/8/14

to bio4j...@googlegroups.com, Pablo Pareja Tobes

Hi,

I'm grateful if you can help me with the following questions inline.
By the way, please keep me posted on your refactoring work. I'm still
in trouble of not finding the new documents and the new examples.

On Sat, Mar 1, 2014 at 9:56 PM, Eduardo Pareja-Tobes
<eparej...@ohnosequences.com> wrote:
>
> Hi,
>
> About the examples: we are in the middle of a pretty big refactoring including how we make releases, code structure, repositories, etc. It's been a bit inconvenient for new people (sorry about it) but we think that it's better to change things now before any actual GSoC work will take place than later. Anyway, you can keep track of how things are going here
>
> https://github.com/bio4j/bio4j/issues/15
>
> We expect that all this will be fixed during next week. So, I think it'd be better to wait before all that if you want to test things etc. You can of course see what's going on and help with tasks that don't require a lot of knowledge (like reafctorings and the like)
>
> Re 3) data import only happens once, and client tools don't need to care about that. Simplifying a bit, the plugin should just create an EC2 instance through Bio4j-specific (already developed) libs who would do all the work of retrieving already imported data (DB binaries) from S3, configuring the DB etc.

Is the Bio4j-specific instance supposed to be created using the
template [1] following the instructions [2]? The instructions [2] show
how to make it in the AWS console. But how to do that through Java
coding in the Cytoscape plugin? Creating the EC2 instance would take
several minutes? So the user of the plugin is require to wait for that
long time before querying? Every time using bio4j EC2 instance would
be charged for a user? I'm afraid long-time-waiting and not-free-using
would hamper the wide adoption of the plugin.

[1] https://s3-eu-west-1.amazonaws.com/bio4j-public/Bio4jBasicInstanceTemplate.txt
[2] http://blog.bio4j.com/2011/12/bio4j-aws-cloudformation-your-own-fresh-baked-db-in-less-than-a-minute/

>
> About 4) it will depend on the scope and features; the current situation is that in the neo4j version we had the need to significantly alter the model due to serious performance problems with a more natural approach, something which we don't have the need to do in the (still in-progress) Titan implementation. If the set of queries/data access patterns is going to be determined beforehand in some way, then I'd rather use the abstract model directly which would yield good performance in all cases; if not, the default Blueprints impl could be OK. Another option would be to code everything in Blueprints but for the client to access the raw data model of each particular implementation; this would have other downsides in terms of queries working for one technology but not for the other etc.
>

I'd prefer to code in the abstract model of bio4j for this plugin.
Because there may be other Cytoscape plugins for Blueprints and graph
database (see Cytoscape GSoC 2014 idea 23: Graph Database Support for
Cytoscape 3 by TinkerPop Software Stack [3]). The abstract model is
specified for bio4j, which I think we should focus on in this bio4j
GSoC project. If the user wants to query and visualize bio4j data
through Blueprints/neo4j directly, they may choose other TinkerPop
based Cytoscape plugins instead. Any comments?

[3] http://nrnb.org/gsoc/

Best,
Yigang Zhou

Eduardo Pareja Tobes

unread,

Mar 9, 2014, 8:10:06 AM3/9/14

to bio4j...@googlegroups.com, Pablo Pareja Tobes

Hi, answering inline

On Saturday, March 8, 2014 10:32:44 AM UTC+1, Yigang Zhou wrote:

Hi,

I'm grateful if you can help me with the following questions inline.
By the way, please keep me posted on your refactoring work. I'm still
in trouble of not finding the new documents and the new examples.

Still no docs, but https://github.com/bio4j/neo4jdb shouldn't change much from how it looks right now.

On Sat, Mar 1, 2014 at 9:56 PM, Eduardo Pareja-Tobes
<eparej...@ohnosequences.com> wrote:
>
> Hi,
>
> About the examples: we are in the middle of a pretty big refactoring including how we make releases, code structure, repositories, etc. It's been a bit inconvenient for new people (sorry about it) but we think that it's better to change things now before any actual GSoC work will take place than later. Anyway, you can keep track of how things are going here
>
> https://github.com/bio4j/bio4j/issues/15
>
> We expect that all this will be fixed during next week. So, I think it'd be better to wait before all that if you want to test things etc. You can of course see what's going on and help with tasks that don't require a lot of knowledge (like reafctorings and the like)
>
> Re 3) data import only happens once, and client tools don't need to care about that. Simplifying a bit, the plugin should just create an EC2 instance through Bio4j-specific (already developed) libs who would do all the work of retrieving already imported data (DB binaries) from S3, configuring the DB etc.

Is the Bio4j-specific instance supposed to be created using the
template [1] following the instructions [2]? The instructions [2] show
how to make it in the AWS console. But how to do that through Java
coding in the Cytoscape plugin? Creating the EC2 instance would take
several minutes? So the user of the plugin is require to wait for that
long time before querying? Every time using bio4j EC2 instance would
be charged for a user? I'm afraid long-time-waiting and not-free-using
would hamper the wide adoption of the plugin.

The instance would be created using the same mechanism as in bio4j/modules. This is Scala-based, if we're going to do this in Java we would need to wrap that for the specific data sources (modules) that we want to use.

Creating the instance and getting all data will be under 5min. And yes, every time the user would need to use the plugin he will need to pay to AWS (not to us, of course), something in the cents range.

This is of course not ideal, and we could offer an alternative "you're on your own" version, where he can specify a local folder where the database is or something like that. But in the long run, this is not feasible or convenient for the user:

- maintaining versions

- executing the imports of all data sources periodically

- etc etc

[1] https://s3-eu-west-1.amazonaws.com/bio4j-public/Bio4jBasicInstanceTemplate.txt
[2] http://blog.bio4j.com/2011/12/bio4j-aws-cloudformation-your-own-fresh-baked-db-in-less-than-a-minute/

>
> About 4) it will depend on the scope and features; the current situation is that in the neo4j version we had the need to significantly alter the model due to serious performance problems with a more natural approach, something which we don't have the need to do in the (still in-progress) Titan implementation. If the set of queries/data access patterns is going to be determined beforehand in some way, then I'd rather use the abstract model directly which would yield good performance in all cases; if not, the default Blueprints impl could be OK. Another option would be to code everything in Blueprints but for the client to access the raw data model of each particular implementation; this would have other downsides in terms of queries working for one technology but not for the other etc.
>

I'd prefer to code in the abstract model of bio4j for this plugin.
Because there may be other Cytoscape plugins for Blueprints and graph
database (see Cytoscape GSoC 2014 idea 23: Graph Database Support for
Cytoscape 3 by TinkerPop Software Stack [3]). The abstract model is
specified for bio4j, which I think we should focus on in this bio4j
GSoC project. If the user wants to query and visualize bio4j data
through Blueprints/neo4j directly, they may choose other TinkerPop
based Cytoscape plugins instead. Any comments?

yes, in principle I prefer to base this on the abstract model. It will however make things a bit more tricky in terms of how the user will specify queries.

best

Yigang Zhou

unread,

Mar 16, 2014, 11:43:03 PM3/16/14

to bio4j...@googlegroups.com, Pablo Pareja Tobes

Dear Pablo,

I agree with you that "creating instance in AWS" is not ideal, but
it's something we can do in this project for the convenience of the
user. I'll go for this way in the project proposal.

We do have the problem of "how the user will specify queries?" if the
plugin is based on the abstract model. I've studied the visualization
tools of Neo4j [1]. Basically, we have 2 options:

1) Querying through some query languages
The user can send query strings of Cypher/Gremlin to the Cytoscape
plugin. Then plugin render the result as graphs. That's how Neo4j
Server Web Interface [1] do with the visualization of Neo4j. I also
made a plugin using SPARQL querying for biology DBs.
Do we have such a query language in the abstract model of bio4j? If
not, can we add an interface in the abstract model that the plugin can
call a method of querying it with the low level Cypher/Gremlin
language? Something like NodeRetriever.queryWithCypher/Gremlin(),
which returns 2 list of BasicNodes and BasicRelationships as a graph?
Or can the query result be the concrete nodes (e.g. Protein) and
relationships in the abstract model? It up to the user to choose which
low level query language to use, or the plugin can choose it for him
through the configuration of the implementation of the backend of
bio4j. For example, if the backend is Neo4j, the user can make Cypher
queries in the plugin.

2) Searching with the keywords and then exploring
Another approach is similar to Linkurio [3], in which no graph query
language is required. Firstly, the user types any keyword in the
search bar and brings up all the related data in one step. Then, the
plugin does the job of rendering it and responding to user
interactions like clicking, touching, moving nodes, which is called
"exploring" actually. I've done similar work before. In the
attachment, there're some screenshots of a knowledge browsing system I
developed, using TouchGraph [4] (spring layout) for visualization of
RDF/OWL resources and their properties in the domain of Chinese Modern
History.

Could you please tell me which one is more appreciate for bio4j. Are
there any visualization use cases of bio4j from the point-of-views of
the users? Any comments?

Best,
Yigang Zhou

[1] http://www.neo4j.org/develop/visualize
[2] https://code.google.com/p/vsdlc3/wiki/UserGuide
[3] http://linkurio.us/
[4] http://www.touchgraph.com/navigator

On Sun, Mar 9, 2014 at 8:10 PM, Eduardo Pareja Tobes

> Para anular tu suscripción a este grupo y dejar de recibir sus mensajes,
> envía un mensaje a bio4j-user+...@googlegroups.com.
> Para acceder a más opciones, visita https://groups.google.com/d/optout.

Figure 5.jpg

Figure 7.jpg

Pablo Pareja Tobes

unread,

Mar 19, 2014, 1:17:12 PM3/19/14

to Yigang Zhou, bio4j...@googlegroups.com

Hi Yigang,

(Now it's Pablo answering ;) )

Regarding both options let me point out how the size of the results from the queries is something that should be carefully tackled here. Queries performed by users could many times end up returning hundreds of thousands of elements if they are not further filtered. Thus, the approach taken in Neo4j Server/data-browser won't be adequate, more processing should be done at the server side in our case in order to provide the users with results that are manageable.

Again when thinking of using a specific query language at the abstract model layer, I would say that this would only be feasible in an ideal case. In reality, however, we would be running into diverse problems, for example when trying to manage indices at the Blueprints level since, as far as I know, neither Neo4j or Titan are 100% fully compatible in all cases or simply have way much worse performance results when using them.

In any case we look forward to your proposal to further discuss all these details ;)

Cheers,

Pablo

--
Pablo Pareja Tobes

LinkedIn http://www.linkedin.com/in/pabloparejatobes

Twitter http://www.twitter.com/pablopareja

http://about.me/pablopareja

http://www.ohnosequences.com

Yigang Zhou

unread,

Mar 21, 2014, 9:50:49 AM3/21/14

to Pablo Pareja Tobes, eparej...@ohnosequences.com, bio4j...@googlegroups.com

Hi,

There's much to be discussed for further details. Since the deadline
is approaching, I just submitted the proposal here:
http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/egangzhou/5757334940811264
Any feedbacks are welcome, thanks!

cheers,
Yigang Zhou

Reply all

Reply to author

Forward