Can bio4j be used to aggregate journal articles from various sources on the web?

Mark Farrell

unread,

Jul 10, 2014, 5:43:37 AM7/10/14

to bio4j...@googlegroups.com

Hi,

Ideally I would like to construct a software service that can aggregate biology journal articles from different publication sources around the web:

the service would stream in journal articles as they are published and store them in a ".txt" format. It's for use with a project I'm currently working

on: https://github.com/markfarrell/text-network-compiler.

Have a look in https://github.com/bio4j/bio4j/tree/master/src/main/java/com/bio4j/model/util:

There seems to be "ArticleRetriever" and "BookRetriever" interfaces in Bio4j's abstract model; I just don't offhand how I might use these features of Bio4j

to build the software service or if they are even implemented in any of the backends.

Any advice?

Andrei Kucharavy

unread,

Jul 10, 2014, 4:55:40 PM7/10/14

to bio4j...@googlegroups.com

Hello Mark,

For my master thesis (Bourne lab, UCSD) this was one of the projects we've considered implementing over a neo4j adapted for biology (back then it was slightly different from bio4j).

Unfortunately we've run into several problems that led us to center on different projects:

Andrei Kucharavy

unread,

Jul 10, 2014, 5:32:13 PM7/10/14

to bio4j...@googlegroups.com

*Sorry for the previous, anticipated reply

The biggest one was the availability of parseable joarticles. The best you can do is to parse all the journals that provide the HTML representation and are open. HTML is due to the fact that pdf often tend to permute words and there is no reliable way of re-ordering text automatically. The second one has more to do with the copyright and intellectual property of the publishers. For us this meant that we couldn't index most of the journals from the Nature and Cell publishing, which was excluded for us.

I won't elaborate much on the other problems, for this is not the point of your question.

What we've found useful before abandoning this project was projecting the terms encountered in the articles and relations between them into a GO term and BioPax lvl3 ontology to parse the article elements of sense and to bind them more easily to other database-like data.

A year ago when we were working on it, bio4j didn't yet support any of those (has it changed?). Thus we had to build our own neo4j-based biological database. You might want to implement those node types and associated functions in bio4j before you start using it as a backbone for your article aggregation service.

Another problem you might want to consider is also the fact that the underlying technology, neo4j, has a node and relation number limitations that might make it not very well-suited for data-intesive applications. You can find more about it here: http://www.slideshare.net/andreikucharavy/graph-databases-in-biology-case-of . As such I've chosen to use a language-specific Gremlin compiler in my projects (bulbs in case of python) to make the transition from neo4j to Titan or other graph database easier and less painful.

You might consider doing the same thing. One of the long-term advantages is that neo4j implements writing locks, whereas in Titan the IO is fully parallel and can supports thousands simultaneous writes. With text processing and while using only opensource articles you most likely will need to use several parsing servers just to get over everything published in time and parallel IO might not be a luxury.

Good luck for your project!

Andrei

Pablo Pareja Tobes

unread,

Jul 15, 2014, 5:39:46 AM7/15/14

to bio4j...@googlegroups.com

Hi Mark,

First of all let me clarify that Bio4j only includes citations such as journal articles, books and so on..., that are included in Uniprot KB.

All this is already implemented for the Neo4j backend in the version 0.9 of Bio4j and it would be pretty soon available for the Titan version.

Also take into account that, again, the information we store for citations is exclusively that that is included in Uniprot XML files. Have a look for instance at this entry file: http://www.uniprot.org/uniprot/P50566.xml (check the <citation/> tag to see the information provided). In any case both DOI and PubMed ids are stored whenever they're present for the citation.

I'm not sure if this answers your question, in case it doesn't please don't hesitate to write us back! ;)

Cheers,

Pablo

--
Has recibido este mensaje porque estás suscrito al grupo "bio4j-user" de Grupos de Google.
Para anular la suscripción a este grupo y dejar de recibir sus mensajes, envía un correo electrónico a bio4j-user+...@googlegroups.com.
Para acceder a más opciones, visita https://groups.google.com/d/optout.

--
Pablo Pareja Tobes

LinkedIn http://www.linkedin.com/in/pabloparejatobes

Twitter http://www.twitter.com/pablopareja

http://about.me/pablopareja

http://www.ohnosequences.com

Reply all

Reply to author

Forward