Abacus get stuck when retrieving protein lengths

Ivo Chamrád

unread,

Sep 7, 2011, 5:13:45 AM9/7/11

to Abacus Support

Hi Damian,

we encountered following problem - when we tried to lounch Abacus with
our files, it got stuck during the first phase when retrieving protein
lengths.

We are a little bit suspicious that the reason is the database we use.
It is complete NCBInr protein databse which is about 8 GB.

(We tried to run Abacus with our data and another, significantly
smaller database just to test it and it worked, processed the files we
uploaded and then obviously got stuck. So, we think, our data should
not be the problem.)

Is the size of the databse limiting or do you think that the problem
is connected to something else? We are surprised by this because we
used this version of NCBInr for all the processing steps in TPP and it
worked properly.

Ivo

GATTACA

unread,

Sep 7, 2011, 8:24:45 AM9/7/11

to Abacus Support

Well that is a large database but it shouldn't cause a crash unless
you are using a 32bit operating system.
TPP is written completely in C++ so memory management and speed are
the focus over operating system independence.

In order to see if a crash is the problem do the following:

Instead of double-clicking on the abacus.jar file, open up a command
line window and change to the directory that contains the abacus.jar
file.
Then type: java -d64 -jar abacus.jar

This should launch the user interface. Now run Abacus as you normally
would. If there is a crash, the JAVA exception errors will be printed
to the command line window.

Please paste those errors into the discussion forum thread so I can
see where the problem is.

If you don't get an error message at all, let me know that too.

Damian

Ivo Chamrád

unread,

Sep 8, 2011, 6:10:25 PM9/8/11

to Abacus Support

Hi Damian,

concerning our operating system - it is a 64bit version (and we have
also 64bit Java installed).

When opened from a command line, Abacus seemed to work properly but
after a while, we got following error message:

Exception in thread "Thread-4" java.lang.OutOfMemoryError:Java heap
space
at java . util . Arrays . copyOf (Unknown Source)
at java . lang . AbstractStringBuilder . expandCapacity
(Unknown Source)
at java . lang . AbstractStringBuilder . append (Unknown
Source)
at java . lang . StringBuilder . append (Unknown Source)
at abacus . globals . parseFasta(globals . java : 206)
at abacus . abacusUI . workThread . run (abacusUI . java :
2417)

Any idea what is causing our "crash" problem?

(If needed, I can send you a print screen with original comman line
window.)

Ivo

> > Ivo- Skrýt citovaný text -
>
> - Zobrazit citovaný text -

GATTACA

unread,

Sep 8, 2011, 7:01:52 PM9/8/11

to Abacus Support

Yes you are running out of memory when reading the fasta file.
Try running abacus from the command line with this option:
java -d64 -jar abacus.jar

This will force java to run in 64bit mode.

Also how much physical memory do you have? You will need at least 16GB
of RAM for such a large fasta file.
Alternatively you could use the keep DB option to try and reduce your
memory footprint.

Damian

On Sep 8, 6:10 pm, Ivo Chamrád <i.cham...@gmail.com> wrote:
> Hi Damian,
>

> concerning our operating system - it is a 64bit version (and we have
> also 64bit Java installed).
>
> When opened from a command line, Abacus seemed to work properly but
> after a while, we got following error message:
>
> Exception in thread "Thread-4" java.lang.OutOfMemoryError:Java heap
> space
> at java . util . Arrays . copyOf (Unknown Source)
> at java . lang . AbstractStringBuilder . expandCapacity
> (Unknown Source)
> at java . lang . AbstractStringBuilder . append (Unknown
> Source)
> at java . lang . StringBuilder . append (Unknown Source)
> at abacus . globals . parseFasta(globals . java : 206)
> at abacus . abacusUI . workThread . run (abacusUI . java :
> 2417)
>
> Any idea what is causing our "crash" problem?
>
> (If needed, I can send you a print screen with original comman line
> window.)
>
> Ivo
>

Ivo Chamrád

unread,

Sep 10, 2011, 5:45:18 PM9/10/11

to Abacus Support

I though it is a memory problem.

Unfortunately, the only PC that has original EN 64bit Windows
inastalled has only 8 GB of RAM. Moreover, we can not use 64bit
version of Java because it interferes with another software that we
need keep going (this PC is our Proteinscape server and it maintains
data processing).

We also tried Keep DB option, but the result was the same - Abacus
crashed.

Therefore, we tested a little bit differet approach - using Combined
protein xml file, our colleague extracted all identified sequences
from the database we use for identifications (NCBInr mentioned
earlier) and saved them as fasta. Then, we used this reduced database
for Abacus and the analysis run through. We were also able to get the
results fir Qspec in this way, the only limitation was connected with
the calculation of NSAF - Abacus could not do this. It reported
following error message:

2011-09-09T10 : 12 : 15 . 593 +0100 SERVERE null
java . sql . SQLException : user lacks privilege or object not
found : NAN
at org . hsqldb . jdbc . Util . sqlException (Unknown Source)
at org . hsqldb . jdbc . JDBCStatement . fetchResult (Unknown
Source)
at org . hsqldb . jdbc . JDBCStatement . executeUpdate
(Unknown Source)
at abacus . hyperSQLObject . getNSAF_values_prot
(hyperSQLObject . java : 2738)

at abacus . abacusUI . workThread . run (abacusUI . java :

2626)
Caused by : org . hsqldb . HsqlException : user lacks privilege or
object not found : NAN
at org . hsqldb . error . Error . error (Unknown source)
at org . hsqldb . error . Error . error (Unknown source)
at org . hsqldb . ExpressionColumn. checkColumnsResolved
(Unknown source)
at org . hsqldb . ParserDML . resolveUpdateExpressions
(Unknown Source)
at org . hsqldb . ParserDML . compileUpdateStatement (Unknown
Source)
at org . hsqldb . ParserCommand . compilePart (Unknown
Source)
at org . hsqldb . ParserCommand . compileStatements (Unknown
Source)
at org . hsqldb . Session . executeDirectStatement (Unknown
Source)
at org . hsqldb . Session . execute (Unknown Source)
... 4 more

So, could you tell us what is the problem this time? Do you think that
the approach we used (to generate artificial databse using combined
protein xml) is O.K.? If not, what is the reason for the use of the
whole database originally used for protein identifications?

And what databases and what database sizes are the best for Abacus? I
am asking because we usualy use quite large databases for protein
identifications (Swiss Prot, NCBInr oh Human IP) and I have never seen
a fasta database smaller than 1 GB. So, if you could give use some
recommendation, it would be really very helpfull for us.

Ivo

> > > - Zobrazit citovaný text -– Skrýt citovaný text –
>
> – Zobrazit citovaný text –

GATTACA

unread,

Sep 10, 2011, 6:06:48 PM9/10/11

to Abacus Support

Well you are at least making progress.

This error is occurring probably because of an error in the protXML
file names.
Can you show me what the names of the protXML files are? They should
not contain spaces in the names.
Ideally, the protXML files should have the default TPP syntax of
interact-TAG.prot.xml. Deviations from this cause problems. I have
tried to make Abacus tolerant of different naming conventions but it's
hard to predict how people will name their files.

The reason for the whole database is for simplicity. Abacus uses the
FASTA file to compute protein lengths. Most users of Abacus are not
programmers. For them it is far easier to just provide the entire
search FASTA file than it is to generate a file that is specifically
for Abacus.

Very few users use the entire NCBI nr database to perform searches and
so the database file size hasn't been an issue before.

Damian

GATTACA

unread,

Sep 10, 2011, 7:23:45 PM9/10/11

to Abacus Support

Out of curiosity what search tool do you use? Sequest? Mascot? X!
Tandem?

I have actually never heard of anyone using the entire NCBI nr
database for performing a protein search. It must take a very long
time to complete just one search. Is there a particular reason you use
NCBI nr?

In our lab we routinely use Refseq Human (
ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.protein.faa.gz
) or Swissprot/Uniprot (
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/proteomes/HUMAN.fasta.gz
).

Most users search their data using a fasta file specific to the
organism for which they have mass spectrometry data.
The two links I give you above are for the human proteome. Both of
those files are under 100MB in size. The entire Swissprot fasta file
is about 80MB.

You may want to consider using a smaller database in general. Apart
from the obvious reduction in search time, you results may be more
meaningful depending upon the search tool you are using. For X!Tandem
I know that as the database size grows, the peptide matches that are
reported as significant increases but the false discovery rate
increases at a faster rate.

Damian

Ivo Chamrád

unread,

Sep 16, 2011, 6:18:12 AM9/16/11

to Abacus Support

Finaly, we were able to solve the problem with NSAF computing. The
cause was the format of headers belonging to several proteins in
database we used. After their manual correction, NSAF computing was
O.K.

To answer your question - for the datbase search, we use Mascot
running under local server (version 2.2).

The reason for selection of NCBInr was our model organis -
unfortunately, we work with wheat (T. aestivum) that has not been
sequenced eyt (actualy it was, but its genome has not been released).
So, for this species you can not download your desired FASTA database.
We tried to download various databases for green plants (particular
parts of RefSeq or Swissprot), but they did not work for us.

But, following your advices, we continued searching and in the last
week and we found TIGR database for wheat (it is a tentative consensus
sequence database). As the sequences were in nucleotides, we
translated all 6 ORFs for all sequences and then filtered out querried
shorter than 100 AA. Resulting database is about 240 MB and it is
compatible with TPP.

Unfortunately, it is not compatible with Abacus. When we try to use
it, following error message occurs:

Sep 16, 2011 11:42:43 AM abacus.abacusUI.workThread run
SEVERE: null
java.sql.SQLException: statement is not in batch mode
at org.hsqldb.jdbc.Util.sqlException(Unknown Source)
at org.hsqldb.jdbc.Util.sqlException(Unknown Source)
at org.hsqldb.jdbc.Util.sqlExceptionSQL(Unknown Source)
at org.hsqldb.jdbc.JDBCPreparedStatement.executeBatch(Unknown
Source)
at abacus.hyperSQLObject.makePepUsageTable(hyperSQLObject.java:
1934)
at abacus.abacusUI.workThread.run(abacusUI.java:2613)
Caused by: org.hsqldb.HsqlException: statement is not in batch mode
at org.hsqldb.error.Error.error(Unknown Source)
at org.hsqldb.error.Error.error(Unknown Source)
... 4 more

I think that there is something wron with the format of the sequences.
Especially, I am suspicious about the headers (based on our last
experience).

Am I right or do we have another problem?

Thanks for all your help.

Ivo

On 11 zář, 01:23, GATTACA <dfer...@umich.edu> wrote:
> Out of curiosity what search tool do you use? Sequest? Mascot? X!
> Tandem?
>
> I have actually never heard of anyone using the entire NCBI nr
> database for performing a protein search. It must take a very long
> time to complete just one search. Is there a particular reason you use
> NCBI nr?
>
> In our lab we routinely use Refseq Human (ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.protein.faa.gz

> ) or Swissprot/Uniprot (ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledge...

> > > > – Zobrazit citovaný text –– Skrýt citovaný text –
>
> – Zobrazit citovaný text –

Ivo Chamrád

unread,

Sep 16, 2011, 6:56:32 AM9/16/11

to Abacus Support

Or, maybe, could you specify how should headers look like to be
recognized by Abacus without problems?

E. g., we found that it can not be like this:

>TA_4859_58_9

It is insufficient, we had to add a description of a protein:

>TA_4859_58_9 CBP1 homologue 1

Then, everything was O.K. Could you explain this behaviour becaouse,
in general (according to the rules for FASTA format), the first
version should be O.K.too?

Ivo

> ...
>
> číst dál »– Skrýt citovaný text –
>
> – Zobrazit citovaný text –

GATTACA

unread,

Sep 16, 2011, 11:02:12 AM9/16/11

to Abacus Support

This is strange.
Abacus does have code to support cases where a protein description is
not present in the fasta file.
However your error message is related to the protein identifier.

Can you send me the *.prot.xml and *.pep.xml files you are trying to
process through abacus so I can take a look at them?

The fasta file would be nice too if you can send it (I believe gmail
does support large attachments up to 2GB).

Damian

> ...
>
> read more »

GATTACA

unread,

Sep 16, 2011, 11:53:48 PM9/16/11

to Abacus Support

Okay,
After I got your files I figured out the problem and fixed it.
If you download the newest release of abacus from sourceforge it
should work with your data.

The problem had nothing to do with the fasta file header lines. The
HSQLDB backend to abacus is case sensitive. As a result, interact-
max00454_pd3_pseudo_3864_score5.pep.xml could not be paired with
interact-MAX00454_PD3_pseudo_3864_score5.prot.xml because "max" is not
considered to be the same as "MAX" in the file names.

I have updated the code to handle this situation.

Please try it out and let me know if it works for you now.

Damian

> ...
>
> read more »

Ivo Chamrád

unread,

Oct 27, 2011, 9:29:11 AM10/27/11

to Abacus Support

Hi guys,

I have just found out that we did not post any final comment about our
problem.

So, as Damian wrote in previous contribution, our problem was caused
by an error in the pairing of peptide and protein lists due to
differences in their names (name in capital letters were considered
different from the same name in lower case-letters). These differences
were comming from TPP that generates names in such forms.

After the update of the code, Abacus works smoothly. Thanks a lot for
your help Damian!

Ivo

Reply all

Reply to author

Forward