Non-AWS install v0.12

61 views
Skip to first unread message

Stephen Coleman

unread,
Sep 28, 2015, 2:34:03 PM9/28/15
to bio4j-user
Hi,
I'm at a university with underutilized compute resources, and so I'm reluctant to pay for AWS in order to try out bio4j.  I'm having trouble figuring out how to download the software.

From here (https://github.com/bio4j/bio4j/blob/master/docs/bio4j-on-your-own.md) I am sent to the bio4j-Titan repo, even though I am not interested in the Titan release. But looking for clues, I go here (https://github.com/bio4j/bio4j-titan/blob/master/docs/ImportingTitanBio4j.md) and am sent to step 6. I scroll down to find that the steps skip from 4 to 7.  Step 7 instructs me to download a bio4j-titan .jar file from S3 (again, not interested in titan).  

So if I follow these instructions form step 7 onwards on my cluster, should I be good to go, assuming I have Java 8? Does it matter that I am not, at least initially, interested in Titan?

Thanks,
-S

Pablo Pareja Tobes

unread,
Sep 29, 2015, 9:36:53 AM9/29/15
to bio4j...@googlegroups.com
Hi Stephen,

Thanks for your interest in Bio4j ;)

Let me try to clarify things a bit.
Even though it's the easiest way to go, you're not required to use AWS for Bio4j.
The second link you were pointing to is the page you should have a look at in order to import the database in your own cluster. (By the way I just fix the step numbers that were wrong...)

Regarding Titan, that's the database technology Bio4j currently uses so, (unless you are committed to implement your own back-end connector) I'm afraid you will be using it if you import Bio4j DB.

In any case you don't have to learn anything about Titan, neither deal directly with their API since we're providing a Java layer that simplifies things a lot, allowing you to interact with the database in a much more user-friendly manner.

Please don't hesitate to ask me if there's anything you don't understand about what I said.

Cheers,

Pablo Pareja


--
Has recibido este mensaje porque estás suscrito al grupo "bio4j-user" de Grupos de Google.
Para anular la suscripción a este grupo y dejar de recibir sus mensajes, envía un correo electrónico a bio4j-user+...@googlegroups.com.
Para acceder a más opciones, visita https://groups.google.com/d/optout.



--

Stephen Coleman

unread,
Sep 29, 2015, 4:29:19 PM9/29/15
to bio4j-user
Thanks for clearing up my confusion. Given the name, I thought that Neo4j was the default back-end and that Titan was optimized for AWS.  

A note for anyone else new that is stumbling upon this: following the directions on the page you linked, it isn't initially clear that I need the bio4j-titan repo locally to link in all of the .properties files for executionsBio4jTitan.xml.  

I am importing UniProt now, I'm sure I'll be back with questions once I've played with this a bit.

Thanks,
-Stephen

Pablo Pareja Tobes

unread,
Oct 1, 2015, 5:12:24 AM10/1/15
to bio4j...@googlegroups.com
Hi Stephen,

The default back-end was indeed Neo4j at the beginning (that was the reason for having this name). 
However, after a while we had to move to Titan since the data was getting too big to be properly dealt with Neo4j (dealing with supernodes was an overkill when using their technology and, in general they simply can't cope with Big Data as Titan does...)

You're right about the properties files, I didn't realize that the documentation was not updated after we made this update. Thanks for pointing it out, I'm going to change the docs right now.

Please let us know about your progress with the importing process.

Cheers,

Pablo

PS: Just in case you didn't find the link for the bioRxiv Bio4j paper in case you want to have a look at it: http://biorxiv.org/content/early/2015/03/20/016758

Stephen Coleman

unread,
Oct 15, 2015, 1:56:01 PM10/15/15
to bio4j-user
Hi,
I've built a couple of bio4j modules locally, and I'm trying to get an example (GetGOAnnotation.java) to run.  Is there a reason for Era7 library dependencies in all of the examples? I seem to be able to compile and execute code without it. I'm not yet sure that code is doing what I expect, but it runs.

-Stephen

Pablo Pareja Tobes

unread,
Oct 19, 2015, 5:50:50 AM10/19/15
to bio4j...@googlegroups.com
Hi Stephen,

Just to confirm it, you are talking about this program right?
What do you exactly mean by Era7 library dependencies?
Are you referring to the following import statement?

import com.era7.bioinfo.bioinfoutil.Executable;

Could you please share your code so that I can further help you out with it?

Cheers,

Pablo


Stephen Coleman

unread,
Oct 19, 2015, 12:21:54 PM10/19/15
to bio4j-user
Yes, I am talking about that line. I commented out that line and removed reference to it, compiled it and ran it, and my list of input Uniprot accession numbers returns an empty set. I'm trying to figure out why.  

I have left the rest of the code in the GetGOAnnotation.java alone, except for the following:

//import com.era7.bioinfo.bioinfoutil.Executable;
public class GetGOAnnotation{

    // @Override
public void execute(ArrayList<String> array) {
String[] args = new String[array.size()];
for (int i = 0; i < array.size(); i++) {
args[i] = array.get(i);
}
main(args);


On Monday, October 19, 2015 at 3:50:50 AM UTC-6, Pablo Pareja Tobes wrote:
Hi Stephen,

Stephen Coleman

unread,
Oct 19, 2015, 4:40:17 PM10/19/15
to bio4j-user
I have determined that I am likely getting an empty set because i have not yet imported some of the modules. I am finding the contents of executionsBio4jTitan.xml to be confusing - based on the naming scheme I can't tell if some of these are redundant or not.

How are these different?
ImportUniProtTitan (btw your example file calls this ImportUniprotTitan, which gives an error, and there also does not appear to be a .properties file for this)
ImportUniProtVerticesUsingFolderTitan
ImportUniProtEdgesUsingFolderTitan

Thank you,
-Stephen

Pablo Pareja Tobes

unread,
Oct 20, 2015, 7:37:44 AM10/20/15
to bio4j...@googlegroups.com
Hi Stephen,

The line of code you're referring to is for importing the Executable class which is simply used to sequentially execute several java programs with the help of an executions.xml file. There's no requirement to use that at all.

Yeah you're right the current documentation for that is confusing, let me tell you why (I will update it once I finish writing this message).
At the beginning only one program was needed to import the UniProt module: ImportUniProtTitan.
However, as TrEMBL continued growing exponentially the strategy for the importing process started to throw some exceptions due to Titan not being able to cope with adding so much data at once as well as making a lot of queries to indices on that process. That's way in order for it to work with the latest TrEMBL versions we had to split the process in two:
  1. First importing all the vertices that should created as part of the UniProt module with ImportUniProtVerticesUsingFolderTitan 
  2. Secondly importing all the relationships among the nodes that have already been stored in the database by means of the program: ImportUniProtEdgesUsingFolderTitan
Even when using this new solution we also had some issues with the last version of TrEMBL so we had to split the huge TrEMBL XML file in several ones (I don't remember exactly how many of them but somewhere around the hundreds or thousands). The program used for this is SplitUniProtXMLFile, that's why it's included as the first occurence in the executionsBio4jTitan.xml file.

Regarding GO annotations information you would still need to import two more modules:
  1. GeneOntology
  2. UniprotGO

I know this might not seem very intuitive and that's why we always recommend to use Bio4j releases that are already imported in AWS.

Cheers,

Pablo


--
Has recibido este mensaje porque estás suscrito al grupo "bio4j-user" de Grupos de Google.
Para anular la suscripción a este grupo y dejar de recibir sus mensajes, envía un correo electrónico a bio4j-user+...@googlegroups.com.
Para acceder a más opciones, visita https://groups.google.com/d/optout.

Stephen Coleman

unread,
Oct 21, 2015, 2:19:47 PM10/21/15
to bio4j-user
Thank you. I understand that you are optimizing for AWS, but my lab can't justify the cost when we have a share of a large cluster that is sitting relatively idle. 

Pablo Pareja Tobes

unread,
Oct 22, 2015, 7:53:58 AM10/22/15
to bio4j...@googlegroups.com
OK let me know then if you ran into any other issues.

Cheers,

Pablo
Reply all
Reply to author
Forward
0 new messages