I observed that the batch mode takes a very long time to process as the number of sequences becomes large. I traced this to the following code in MELTING 5.2:
configuration/OptionManagement.java, line 605:
BasePair.initialiseNucleicAcidList();
Here is the function in sequences/BasePair.java:
public static void initialiseNucleicAcidList(){
existingNucleicAcids.add("A");
existingNucleicAcids.add("T");
existingNucleicAcids.add("U");
existingNucleicAcids.add("G");
existingNucleicAcids.add("C");
existingNucleicAcids.add("I");
existingNucleicAcids.add("-");
existingNucleicAcids.add("A*");
existingNucleicAcids.add("AL");
existingNucleicAcids.add("TL");
existingNucleicAcids.add("GL");
existingNucleicAcids.add("CL");
existingNucleicAcids.add("UL");
existingNucleicAcids.add("X_C");
existingNucleicAcids.add("X_T");
}
The issue seems to be that call in OptionManagement adds the nucleic acids to the internal list of acids for the class object, not an instance. The environment is reconstructed for every sequence so this method is called over and over and the list grows with every initialization. Since every call to parse the nucleic acids from a sequence requires traversing the entire list, run time becomes O(n^2). I fixed this locally by removing the code in the initialization function and statically declaring the array:
...
private static ArrayList<String> existingNucleicAcids = new ArrayList<String>(
Arrays.asList(
"A",
"T",
"U",
"G",
"C",
"I",
"-",
"A*",
"AL",
"TL",
"GL",
"CL",
"UL",
"X_C",
"X_T"
)
);
public static void initialiseNucleicAcidList(){
}
This changed the run time for 50,000 sequences from > 2 hrs to ~15 sec.
Best,
Duncan