Including viruses in the RefSeq database

175 views
Skip to first unread message

rachael...@gmail.com

unread,
Jul 16, 2018, 5:58:57 AM7/16/18
to SAMSA bioinformatics group
Hi Sam,

Can you provide detail on how the RefSeq_bac.fa file, downloadable by SAMSA2, was created?

I would like to create an updated database in the same way. I am looking to use the most recent version of RefSeq and would also like to include viral genomes. The database should contain genomes in both complete and draft stage, not just all of the complete ones (my species of interest have assemblies in 'scaffold' stage).

If I download all of the non-redundant protein files in:

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral


and concatenate them together, would this give me a database built equivalently to the one provided with SAMSA2?


Regards,
Rachael

Sam Westreich

unread,
Jul 16, 2018, 6:15:12 PM7/16/18
to rachael...@gmail.com, SAMSA bioinformatics group
Hi Rachael,

Yes, if you download the protein files (.faa.gz) from those links you shared and concatenate them together, you can use that as the reference for SAMSA2.  You can make the DIAMOND-structured version with command:

diamond makedb --in merged.faa --d merged.dmnd

The database provided with SAMSA2 is bacterial only, but this approach should handily add viruses as well.

Best,
Sam

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/21efae37-7848-4822-8837-4854c8a4142d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Sam Westreich
Microbiome Scientist, DNAnexus, 

Rachael Lappan

unread,
Jul 16, 2018, 11:00:13 PM7/16/18
to Sam Westreich, SAMSA bioinformatics group

Thanks Sam,

Should it be the x.protein.faa.gz files or the nonredundant_protein.x.protein.faa.gz files? It looks like the non-redundant ones are provided only for bacterial, not viral genomes - so perhaps I should use the former?

Cheers,
Rachael

Sam Westreich

unread,
Jul 17, 2018, 1:45:32 PM7/17/18
to Rachael Lappan, SAMSA bioinformatics group
Hi Rachael,

I do see non-redundant proteins here (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/), but you could also use the regular x.protein.faa.gz files as well.  Since there aren't nearly as many viral protein sequences as there are bacterial protein sequences, this shouldn't cause too much database "bloat".

Best,
Sam

On Mon, Jul 16, 2018 at 8:00 PM, Rachael Lappan <rachael...@gmail.com> wrote:

Thanks Sam,

Should it be the x.protein.faa.gz files or the nonredundant_protein.x.protein.faa.gz files? It looks like the non-redundant ones are provided only for bacterial, not viral genomes - so perhaps I should use the former?

Cheers,
Rachael


On 17/07/18 6:14 AM, Sam Westreich wrote:
Hi Rachael,

Yes, if you download the protein files (.faa.gz) from those links you shared and concatenate them together, you can use that as the reference for SAMSA2.  You can make the DIAMOND-structured version with command:

diamond makedb --in merged.faa --d merged.dmnd

The database provided with SAMSA2 is bacterial only, but this approach should handily add viruses as well.

Best,
Sam
On Mon, Jul 16, 2018 at 2:58 AM, <rachael...@gmail.com> wrote:
Hi Sam,

Can you provide detail on how the RefSeq_bac.fa file, downloadable by SAMSA2, was created?

I would like to create an updated database in the same way. I am looking to use the most recent version of RefSeq and would also like to include viral genomes. The database should contain genomes in both complete and draft stage, not just all of the complete ones (my species of interest have assemblies in 'scaffold' stage).

If I download all of the non-redundant protein files in:

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral


and concatenate them together, would this give me a database built equivalently to the one provided with SAMSA2?


Regards,
Rachael

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.



--
Sam Westreich
Microbiome Scientist, DNAnexus, 

Rachael Lappan

unread,
Jul 17, 2018, 10:52:30 PM7/17/18
to samsa-bioinfo...@googlegroups.com

Thanks Sam,

At the moment (release 89, July 13th 2018) those non-redundant viral files are very small and only contain a handful of proteins - an issue I have alerted the RefSeq staff to. I'll use the regular protein files for viruses and non-redundant for bacteria. Thank you!

Cheers,
Rachael

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatic...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/CAJgADcL%2Bhj0_3ZpS9VXKF13cjd9MZ9Rd1LTeCQbHsRd_0zd74w%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages