maximum fasta file size

75 views
Skip to first unread message

AZIZ ALNAKLI

unread,
Oct 21, 2020, 12:14:57 AM10/21/20
to spctools...@googlegroups.com
Hi 
what is the maximum fasta file size that TPP can handle?
Anyone knows?

Thank you all. 

--

Aziz Alnakli

Master of Research Candidate (Bioinformatics Research Group)

Faculty of Science and Engineering |   Level 1 Room 120, F7B Building (4 Wally's Walk)
Macquarie University, NSW 2109, Australia

M: + 61 424375954 | aziz.a...@students.mq.edu.au

Macquarie University

Eric Deutsch

unread,
Oct 21, 2020, 12:46:24 PM10/21/20
to spctools...@googlegroups.com, Eric Deutsch

Hi Aziz, there is no specific maximum size. In general, the larger your database, the greater the memory requirements and CPU time needed for the software tools and the lower your sensitivity, although it depends on the degree of redundancy in your database. Hundreds of MB and hundreds of thousands of protein sequences is routine.

 

--

Image removed by sender. Macquarie University

--
You received this message because you are subscribed to the Google Groups "spctools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spctools-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/spctools-discuss/CANqhZV4npzoZMuEkNiivFHpcdpENUgzdJLyHb0rANiUFeksceA%40mail.gmail.com.

AZIZ ALNAKLI

unread,
Oct 21, 2020, 11:44:53 PM10/21/20
to spctools...@googlegroups.com
Thanks Eric for your reply. 

The database I am having is in a single file that is over 10 GB in size. It has been almost 24 hrs since I got the job run, and it is still running. 
Do you have a suggestion on the timeframe of jobs dealing with such huge files?

Thanks. Regards.




--

Aziz Alnakli

Master of Research Candidate (Bioinformatics Research Group)

Faculty of Science and Engineering |   Level 1 Room 120, F7B Building (4 Wally's Walk)
Macquarie University, NSW 2109, Australia

M: + 61 424375954 | aziz.a...@students.mq.edu.au

Macquarie University

Eric Deutsch

unread,
Oct 22, 2020, 1:46:03 AM10/22/20
to spctools...@googlegroups.com, Eric Deutsch

Hi Aziz, I can’t give an estimate because it depends on so many things. How many spectra? Fully tryptic, semi-trypic? How many PTMs? Are you using Comet? How many cores/threads does your machine have/did you let Comet use? Etc. If you use Comet, you can usually look at the output to see its progress. You should see something like:

 

Search start:  09/29/2020, 10:35:28 AM

   - Load spectra: 15009

     - Search progress:  29%

     - Post analysis:  done

   - Load spectra: 15009

     - Search progress:  54%

Etc.

 

To get a sense of how quickly progress is being made.

 

Also, check the RAM usage on your machine to make sure the computer isn’t swapping due to low memory conditions.

 

Eric

AZIZ ALNAKLI

unread,
Oct 22, 2020, 8:48:29 PM10/22/20
to spctools...@googlegroups.com

Hi Eric

I appreciate your help Eric so much, and Shoba (my supervisor) actually knows you and says hi.


I am doing the search with the X!Tandem pipeline instead of Comet, and maybe that is why I am unable to view the progress of the X!Tandem search.

I appreciate it if you comment on the screenshot of the job I am currently running. Is it looking healthy?

image.png

 

Also, You may look at the screenshot below to have an idea about the specs of the desktop I am using to perform the analysis.

image.png

 

  And here is another question. Can I run multiple jobs at the same time? Does it affect how quickly the results are retrieved?

 

Thanks for your support.




--

Aziz Alnakli

Master of Research Candidate (Bioinformatics Research Group)

Faculty of Science and Engineering |   Level 1 Room 120, F7B Building (4 Wally's Walk)
Macquarie University, NSW 2109, Australia

M: + 61 424375954 | aziz.a...@students.mq.edu.au

Macquarie University

Eric Deutsch

unread,
Oct 23, 2020, 5:04:40 PM10/23/20
to spctools...@googlegroups.com, Eric Deutsch

I don’t recall now what the X!Tandem progress output enough to say what your screenshot shows. Maybe someone else knows.

 

Regarding the CPU usage, it looks like you have plenty of memory, so that’s good. But it does look like X!Tandem is not using all of the core on the machine. You could set the number of threads to use up to 8 on that machine.

The parameter is spectrum, threads:

https://www.thegpm.org/TANDEM/api/st.html

 

Of course, if you use all 8 threads, it might make the computer difficult to use interactively if this machine is used for interactive use.

 

You could also run another search in parallel, yes. But keep an eye on the memory, because if you run too many at once, you will run out of memory.

 

Regards,

Eric

 

 

From: spctools...@googlegroups.com <spctools...@googlegroups.com> On Behalf Of AZIZ ALNAKLI
Sent: Thursday, October 22, 2020 5:48 PM
To: spctools...@googlegroups.com
Subject: Re: [spctools-discuss] maximum fasta file size

 

Hi Eric

I appreciate your help Eric so much, and Shoba (my supervisor) actually knows you and says hi.

 

I am doing the search with the X!Tandem pipeline instead of Comet, and maybe that is why I am unable to view the progress of the X!Tandem search.

I appreciate it if you comment on the screenshot of the job I am currently running. Is it looking healthy?

 

Also, You may look at the screenshot below to have an idea about the specs of the desktop I am using to perform the analysis.

AZIZ ALNAKLI

unread,
Nov 11, 2020, 12:37:13 AM11/11/20
to spctools...@googlegroups.com
Hi Eric, 
Thanks for your support, 

I ended up killing the job and I redid it with Comet. I ended up getting some results. 

I am however still not sure if the steps I have taken were correct. 

The question is that; do the search engines such as Comet create the decoy database by default or do I have to create them?

Thanks.



--

Aziz Alnakli

Master of Research Candidate (Bioinformatics Research Group)

Faculty of Science and Engineering |   Level 1 Room 120, F7B Building (4 Wally's Walk)
Macquarie University, NSW 2109, Australia

M: + 61 424375954 | aziz.a...@students.mq.edu.au

Macquarie University

Eric Deutsch

unread,
Nov 11, 2020, 2:10:05 AM11/11/20
to spctools...@googlegroups.com, Eric Deutsch

Hi, Comet does have a parameter for generating an internal decoy database:

http://comet-ms.sourceforge.net/parameters/parameters_201901/decoy_search.php

 

but I confess I’ve never used it, so I have no experience with that. Maybe someone else does.

I always create my own appended decoy database.

 

But if your database is enormous, there is good reason to avoid that.

AZIZ ALNAKLI

unread,
Feb 24, 2021, 1:15:45 AM2/24/21
to spctools...@googlegroups.com, Eric Deutsch
Hi Eric,
I hope this  email finds you well. 

Can I process multiple mzML files together? Would that affect the specificity of the results? 
I have 24 mzml files (~5 GB each) which I need to run each for 3 hours 10 times, so this will take me around 270 hrs to analyse. Therefore, I thought of batch running them and I wanted to get some suggestions before proceeding.


Thanks 
Aziz



--

Aziz Alnakli

Master of Research Candidate (Bioinformatics Research Group)

Faculty of Science and Engineering |   Level 1 Room 120, F7B Building (4 Wally's Walk)
Macquarie University, NSW 2109, Australia

M: + 61 424375954 | aziz.a...@students.mq.edu.au

Macquarie University

Eric Deutsch

unread,
Feb 25, 2021, 12:59:32 AM2/25/21
to AZIZ ALNAKLI, spctools...@googlegroups.com, Eric Deutsch

Hi Aziz, thanks for your questions. A few comments in green:

 

Can I process multiple mzML files together?

 

After you process all the mzML files, separately, you would use the TPP tools (PeptideProphet) to merge the results into a single results, yes, if that’s the question.

 

Would that affect the specificity of the results? 

 

Generally merging the results of multiple runs will lead to better models and better specificity of results

 

I have 24 mzml files (~5 GB each)

 

I wonder if these files are centroided or profile mode? 5 GB seems like to profile mode? You may benefit from centroiding your data first? But it depends on what instrument you’re using and many other factors

 

which I need to run each for 3 hours 10 times, so this will take me around 270 hrs to analyse. Therefore, I thought of batch running them and I wanted to get some suggestions before proceeding.

 

I suggest optimizing your search on one file first before processing all. Lest you expend 270 hours of compute time only to find a problem. Make sure you’re using all available cores on your machine. Maybe you can spread the task among multiple machines. Either local computers or cloud computing servers, etc.

 

I hope you find these answers helpful.

 

Regards,

Eric

AZIZ ALNAKLI

unread,
Mar 25, 2021, 5:35:36 PM3/25/21
to Eric Deutsch, spctools...@googlegroups.com
Thanks Eric for answering the above questions. 
I am sorry I did not get back to you earlier. 

I am now about to include a positive control to the experiment. In order to do so I am planning to reanalyse a TMT raw file through TPP (Comet Search). Would you recommend any modifications to the Comet parameter file, or any other considerations to take into account, or is Comet going to automatically execute the necessary modifications and I simply should treat it as an ordinary MS file? 

I appreciate your support.   

Macquarie University

Jimmy Eng

unread,
Mar 25, 2021, 5:44:03 PM3/25/21
to spctools...@googlegroups.com
You'll need to specify the TMT modifications for the search.  Assuming you're using the 6-plex TMT, you would minimally set the following parameters:

   add_Nterm_peptide = 229.162932
   add_K_lysine = 229.162932

If you're using the 16-plex TMTpro, the mass addition would be 304.207146.

AZIZ ALNAKLI

unread,
Mar 25, 2021, 6:17:52 PM3/25/21
to spctools...@googlegroups.com, jke...@gmail.com, Eric Deutsch
Thanks Jimmy, 
Thanks for your reply.

I am dealing with 11-plex TMT data. Any idea what specific parameters should be modified ?

Thanks.

Jimmy Eng

unread,
Mar 25, 2021, 6:59:03 PM3/25/21
to AZIZ ALNAKLI, spctools...@googlegroups.com
Yes, the two parameters I specified in my post.
Reply all
Reply to author
Forward
0 new messages