Replacing RefSeq with NCBI NR protein database

Graham Colby

unread,

Apr 9, 2018, 1:28:08 PM4/9/18

to SAMSA bioinformatics group

Hi,

When working with environmental metacommunities, refseq is often fails to capture the diversity of sequences that may be present in samples, as it included only organisms with complete sequences. Is there a quick way to switch the annotation from the refseq database to the NR protein database ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz, which includes both refseq and genbank?

Is this something you'd consider implementing?

Otherwise, I am interested in modifying the scripts such that I am using the NR database. Do I only need to modify the master script?

Thanks,

Graham

Sam Westreich

unread,

Apr 9, 2018, 1:42:45 PM4/9/18

to Graham Colby, SAMSA bioinformatics group

Hi Graham,

If you're working with the SAMSA pipeline (https://github.com/transcript/samsa), all of the databases that are used have been externally selected by MG-RAST. I'm not an MG-RAST developer, so SAMSA is limited to the databases currently available. MG-RAST does offer both RefSeq and GenBank as separate databases, and this pipeline works with both options - but it doesn't have the NCBI NR database, I believe.

If you're interested in using SAMSA2, which is standalone and can incorporate custom databases (but requires a server instance to run effectively), it should be able to use the NR database. There may need to be some code modification if the NR database uses a different sequence labeling approach than RefSeq.

Best,

Sam Westreich

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d97248ab-6824-486d-b3ea-05693c51641d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Sam Westreich

DEB Biotechnology Fellow

Integrative Genetics and Genomics Graduate Group, University of California, Davis

College of Biological Sciences, University of Minnesota

Are you doing what you want to be doing?

Graham Colby

unread,

Apr 9, 2018, 2:24:27 PM4/9/18

to SAMSA bioinformatics group

Thanks for the quick response.

I am interested in using SAMSA2 operating on a server. Can you expand on what you mean by sequence labeling approach - is this just in regards to accession number format? I know Refseq and genbank will have different accessions.

Graham

On Monday, 9 April 2018 13:42:45 UTC-4, S. Westreich (creator) wrote:

Hi Graham,

If you're working with the SAMSA pipeline (https://github.com/transcript/samsa), all of the databases that are used have been externally selected by MG-RAST. I'm not an MG-RAST developer, so SAMSA is limited to the databases currently available. MG-RAST does offer both RefSeq and GenBank as separate databases, and this pipeline works with both options - but it doesn't have the NCBI NR database, I believe.

If you're interested in using SAMSA2, which is standalone and can incorporate custom databases (but requires a server instance to run effectively), it should be able to use the NR database. There may need to be some code modification if the NR database uses a different sequence labeling approach than RefSeq.

Best,
Sam Westreich

On Mon, Apr 9, 2018 at 10:28 AM, Graham Colby

Hi,

When working with environmental metacommunities, refseq is often fails to capture the diversity of sequences that may be present in samples, as it included only organisms with complete sequences. Is there a quick way to switch the annotation from the refseq database to the NR protein database ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz, which includes both refseq and genbank?

Is this something you'd consider implementing?

Otherwise, I am interested in modifying the scripts such that I am using the NR database. Do I only need to modify the master script?

Thanks,
Graham

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d97248ab-6824-486d-b3ea-05693c51641d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sam Westreich

unread,

Apr 9, 2018, 2:37:20 PM4/9/18

to Graham Colby, SAMSA bioinformatics group

Ah, okay, so this will be feasible.

For RefSeq accessions, the header line of the FAA files downloaded from NCBI have the following appearance:

>NCBI_ID Description of the gene [Organism genus species strain]

Parsing happens on lines 127-183 in this script: https://github.com/transcript/samsa2/blob/master/python_scripts/DIAMOND_analysis_counter.py

I'm writing back to you before waiting for the NR database to download so you can get a faster reply, but you'll need to modify this block of code if the NR accession headers are different. The script currently uses the brackets to identify the organism of origin's name, and pulls everything between the ID and the first bracket to get the description of function.

As long as these 3 bits of information are pulled out in some way and supplied to the script, the rest of the pipeline should run as usual.

Let me know if this makes sense, and if I can be of any other help!

Sam

On Mon, Apr 9, 2018 at 11:24 AM, Graham Colby <graham...@gmail.com> wrote:

Thanks for the quick response.

I am interested in using SAMSA2 operating on a server. Can you expand on what you mean by sequence labeling approach - is this just in regards to accession number format? I know Refseq and genbank will have different accessions.

Graham

On Monday, 9 April 2018 13:42:45 UTC-4, S. Westreich (creator) wrote:

Hi Graham,

If you're working with the SAMSA pipeline (https://github.com/transcript/samsa), all of the databases that are used have been externally selected by MG-RAST. I'm not an MG-RAST developer, so SAMSA is limited to the databases currently available. MG-RAST does offer both RefSeq and GenBank as separate databases, and this pipeline works with both options - but it doesn't have the NCBI NR database, I believe.

If you're interested in using SAMSA2, which is standalone and can incorporate custom databases (but requires a server instance to run effectively), it should be able to use the NR database. There may need to be some code modification if the NR database uses a different sequence labeling approach than RefSeq.

Best,
Sam Westreich

On Mon, Apr 9, 2018 at 10:28 AM, Graham Colby

Hi,

When working with environmental metacommunities, refseq is often fails to capture the diversity of sequences that may be present in samples, as it included only organisms with complete sequences. Is there a quick way to switch the annotation from the refseq database to the NR protein database ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz, which includes both refseq and genbank?

Is this something you'd consider implementing?

Otherwise, I am interested in modifying the scripts such that I am using the NR database. Do I only need to modify the master script?

Thanks,
Graham

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d97248ab-6824-486d-b3ea-05693c51641d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
DEB Biotechnology Fellow
Integrative Genetics and Genomics Graduate Group, University of California, Davis
College of Biological Sciences, University of Minnesota

Are you doing what you want to be doing?

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/c61742d8-89a9-4bae-94ac-36bb4b13c254%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Sam Westreich

unread,

Jul 16, 2018, 6:10:27 PM7/16/18

to rachael...@gmail.com, SAMSA bioinformatics group

Hi Rachael,

Question 1: I downloaded the RefSeq NR protein database and did a search for one of your organisms, Turicella otitidis, and didn't see any instances where there were multi-organism labels. Can you link me to where you found the NR database with multi-organism labels in it? Is this DNA or protein?

Question 2: The SEED Subsystems database is a few years old (and not likely to be updated, sadly - I need to switch at some point over to GO terms), but it's structured around functions rather than organisms. One interesting approach might be to search against the RefSeq NR database for organisms, filter down to all hits that matched to Turicella, and then run that subset of samples against Subsystems to see the percentage of matching reads, along with their labeled functions.

Question 3: Yes, I suspect there's likely going to be possible over-representation of organisms that are more completely sequenced and analyzed. SAMSA2 is set up to report "best 1" for each sequence fed into the search step, but that could be configured to include all matches above a certain threshold, if desired.

Another option could be to grab the (draft) genomes of your organisms of interest and build those into a second reference database. You could annotate against both total nonredundant and this custom, smaller reference, and then compare to see how many reads overlap (and are thus mistakenly identified as a more well-annotated organism).

Question 4: Yes, in the Step 5 results, it's the percentage of all mapped reads, followed by the number of mapped reads, in columns 1 and 2.

Let me know on question 1 where you're getting these different headers for RefSeq's NR, and I can see if there's an easy programmatic solution.

Sam

On Fri, Jul 13, 2018 at 12:00 AM, <rachael...@gmail.com> wrote:

Hi Sam,

I am also considering giving the NCBI NR protein database a shot. My samples are human middle ear fluids (from middle ear infections) that contain only a few species, but two of these - Alloiococcus otitis and Turicella otitidis - have genome assemblies only in the 'scaffold', not 'complete' stage in RefSeq. My thoughts were that the NCBI NR database might contain more information on these species of interest. I have a few questions regarding this:

1. Did you determine how to modify the script/headers to make SAMSA2 compatible with the NCBI NR database? I've had a look at the file (nr.gz) and it seems that there's multiple annotations per header. Some of these are very long; apparently where sequences are identical the headers are collapsed. From the README:

"To be merged two sequences must have identical lengths and every residue at every position must be the same. The FASTA deflines for the different entries that belong to one record are separated by control-A characters invisible to most programs. In the example below both entries Q57293.1 and AAB05030.1 have the same sequence, in every respect:

>Q57293.1 RecName: Full=Fe(3+) ions import ATP-binding protein FbpC ^AAAB05030.1 afuC
[Actinobacillus pleuropneumoniae] ^AAAB17216.1 afuC [Actinobacillus pleuropneumoniae]
MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVTKSSIQNRDIC
IVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQQQRVALARALVLKPKVLILD
EPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMNKGTIMQKARQKIFIYDRILYSLRNFMGEST
ICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPEAIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLIN
ANPDQFDPDATKAFIHFTEQGIFLLNKE
"

How could this be handled? It doesn't seem feasible because when I search for Turicella (zcat nr.gz | grep 'Turicella' | head | less) I can see that some proteins' first annotation is [Bacteria] followed by several species of different genera. While it seems that the NR database contains a lot more information about Turicella (it appears in 8405 entries in the RefSeq database and 15566 in NR) I wonder if this is because it appears in the headers for more conserved proteins (along with several other species), compared to RefSeq where each entry has a single taxonomic annotation. Perhaps RefSeq is okay for my purposes? At least they are in the database! My current thoughts are to keep things simple and stick with RefSeq but make sure I have the latest release.

2. Is the SEED subsystems database still relevant for organisms like this - or was it built from the proteins in 'complete' genomes? Am I likely to miss a large chunk of functions if they are from less well-characterised genomes?

3. Similarly, could the results end up biased by the coverage of the organism in the databases? For example, Streptococcus pneumoniae is also in my samples and there are many more annotations to this than to my two species of interest because it's a well-characterised organism. Could the results show a larger proportion of transcription attributed to S. pneumoniae just because it's covered more thoroughly in the database? This isn't an issue specific to SAMSA2 but just wondering if I need to take this into account when interpreting my results.

4. Can you confirm what the columns are in the Step 5 RefSeq output results: is this the percentage of all mapped reads, then the number of mapped reads?

13.468642143 16663 Streptococcus pneumoniae
9.17658850441 11353 Staphylococcus aureus

Thank you and look forward to hearing your thoughts.

Cheers,
Rachael

--

You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Sam Westreich

Microbiome Scientist, DNAnexus,

http://www.mosaicbiome.com

rachael...@gmail.com

unread,

Jul 23, 2018, 4:15:20 AM7/23/18

to SAMSA bioinformatics group

Hi Sam, apologies for the late response.

Question 1: I'm referring to the NCBI NR protein database, linked to in Graham's original post: ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz and described in ftp://ftp.ncbi.nih.gov/blast/db/README

nr.gz* | non-redundant protein sequence database with entries
from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq

I think I just searched for 'Turicella' (there's only one species) but the full scientific name was in the headers. I am happy to stick with the newest RefSeq as a higher quality curated database compared to the NCBI NR one- my initial impression was that RefSeq only contains complete genomes but it looks like it has what I need.

Question 2: Is this similar to the 'specific organism' results described in your pipeline?

Question 3: That sounds like a good idea, but could I also get lots of hits to those genomes from conserved proteins - which could belong to anything but map to those bugs because they're all that's in the custom database?

Question 4: Thanks!

Cheers,
Rachael

Sam Westreich

unread,

Jul 23, 2018, 6:18:24 PM7/23/18

to Rachael Lappan, SAMSA bioinformatics group

Hi Rachael,

Ah, I see. I've typically used the RefSeq database, rather than the BLAST non-redundant one.

Q2: The "specific organism" feature in SAMSA2 is designed to filter functional results for a specific organism or group. Since each read receives an annotation to both an organism and a function, it's possible to filter the functional results based on organism annotation. Similarly, since the read ID is preserved, this allows for cross-referencing between different sets of results annotated against different databases.

I can expand on how to do this, if you need.

Q3: yes, you'd get some false matches for highly conserved proteins. You could mess with the DIAMOND settings in the master_script if you wanted a higher cutoff, or choose a subset of genes you're specifically interested in finding. I'm not sure there's a perfect solution without some tinkering.

Best,

Sam

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/6897d778-cb90-4912-b844-cde30ec25e77%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

rachael...@gmail.com

unread,

Jul 30, 2018, 1:00:52 AM7/30/18

to SAMSA bioinformatics group

Thanks Sam.

For the specific organism results, I have gathered from your documentation that I should run this script (eg.):

DIAMOND_specific_organism_retriever.py -I $sample -SO Alloiococcus -D $Refseqdb

then the analysis counter on the output for the functional results from these organisms:

DIAMOND_analysis_counter.py -I $output -D $Refseqdb -F

Is this correct?

Cheers,
Rachael

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/6897d778-cb90-4912-b844-cde30ec25e77%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Sam Westreich

unread,

Jul 30, 2018, 4:54:33 PM7/30/18

to Rachael Lappan, SAMSA bioinformatics group

Hi Rachael,

Yes, that's right. The DIAMOND_specific_organism_retriever.py is just making a subset of the original list of results, only passing on to the outfile the entries that have the search term contained in them.

The output is just a smaller subset version of the input, so you can use the rest of the pipeline as usual.

Best,

Sam

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/6897d778-cb90-4912-b844-cde30ec25e77%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--

You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/886d2bf2-733c-4ac2-b0f9-405516550a27%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

rachael...@gmail.com

unread,

Jul 30, 2018, 10:37:27 PM7/30/18

to SAMSA bioinformatics group

Hi Sam,

At the moment this isn't working for me - it returns 0 results for Alloiococcus and Turicella, even with the results from the small database containing only proteins from those genomes (where there should be nothing but hits to those genera).

I use:

for sample in /data/samsa2/Ao_To/step_4_output/*Ao_To_annotated
do
python ~/bin/samsa2-2.0.0/python_scripts/DIAMOND_specific_organism_retriever.py -I $sample -SO Alloiococcus -D /data/genomes/samsa2_databases/Ao_To_database/Ao_To_refseq.fa

python ~/bin/samsa2-2.0.0/python_scripts/DIAMOND_specific_organism_retriever.py -I $sample -SO Turicella -D /data/genomes/samsa2_databases/Ao_To_database/Ao_To_refseq.fa

done

The diamond output files I am using look correct:

head -n 5 step_4_output/TSO_002_FR_RefSeq_Ao_To_annotated

700666F:268:CATL9ANXX:2:2202:19558:2890 WP_003777422.1  100.0   32      0       0       98      3       287     318     8.9e-13 67.8
700666F:268:CATL9ANXX:2:2202:17547:4779 WP_004600256.1  45.5    33      18      0       100     2       187     219     5.4e-05 42.0
700666F:268:CATL9ANXX:2:2202:5956:7022  WP_004601419.1  71.9    32      9       0       99      4       101     132     6.1e-09 55.1
700666F:268:CATL9ANXX:2:2202:11226:8200 WP_003778163.1  95.0    20      1       0       60      1       71      90      2.0e-04 40.0
700666F:268:CATL9ANXX:2:2202:11536:11478        WP_003776212.1  100.0   32      0       0       99      4       365     396     1.2e-12 67.4

And the first ID can be found in the RefSeq database as Alloiococcus:

grep 'WP_003777422.1' /data/genomes/samsa2_databases/Ao_To_database/Ao_To_refseq.fa
>WP_003777422.1 replicative DNA helicase [Alloiococcus otitis]

Yet the script returns 0 entries for both genera.

I don't speak Python very well so I'm not sure what's going wrong - does the script work in your hands? I can email you these files if necessary.

Cheers,
Rachael

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/6897d778-cb90-4912-b844-cde30ec25e77%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/886d2bf2-733c-4ac2-b0f9-405516550a27%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Sam Westreich

unread,

Aug 1, 2018, 1:51:00 PM8/1/18

to Rachael Lappan, SAMSA bioinformatics group

Hey Rachael,

Fixed! Newer version is on Github, and attached here.

It looks like the issue is that, in an older version of RefSeq, there was a double-space ("\s\s") used to separate components, and that was being used as the parser in DIAMOND_specific_organism_retriever.py. That's no longer the case, so I've switched over to single-space parsing. It's really just important for getting the ID, since the organism name is parsed out from brackets ([ ]).

It should work now, so let me know if there are issues!

Best,

Sam

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/6897d778-cb90-4912-b844-cde30ec25e77%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/886d2bf2-733c-4ac2-b0f9-405516550a27%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--

You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/c6e0c2e8-a049-4d82-a020-d7d7cff15eb7%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

DIAMOND_specific_organism_retriever.py

rachael...@gmail.com

unread,

Aug 2, 2018, 5:28:36 AM8/2/18

to SAMSA bioinformatics group

Thanks Sam,

That script is working now.

I have noticed that when I use the analysis counter script (for these specific organism outputs, but I also experienced this when I ran it as part of the master script with the full sample results) I receive some strange output:

Now reading through the m8 results infile.

Analysis of TSO_002_FR_RefSeq_new_annot_Alloiococcus.tsv complete.
Number of total lines: 1115
Number of unique sequences: 1115
Time elapsed: 0.005615 seconds.

Starting database analysis now.
1M lines processed so far in 16.161206 seconds.
2M lines processed so far in 31.637007 seconds.
# Don't need to see all these lines

>NP_569145.1 unnamed protein product [Miscanthus streak virus - [91]]

Miscanthus streak virus -
>NP_569146.1 unnamed protein product, partial [Miscanthus streak virus - [91]]

Miscanthus streak virus -
>NP_569147.1 unnamed protein product [Miscanthus streak virus - [91]]

Miscanthus streak virus -
>NP_569143.1 unnamed protein product [Miscanthus streak virus - [91]]

Miscanthus streak virus -
>NP_569144.1 unnamed protein product [Miscanthus streak virus - [91]]

Miscanthus streak virus -

Success!
Time elapsed: 1673.980078 seconds.
Number of lines: 90579976
Number of errors: 66639358

Dictionary database assembled.
Time elapsed: 1673.980078 seconds.
Number of errors: 66639358

Top ten function matches:
117     hypothetical protein
95      YSIRK-type signal peptide-containing protein
15      ABC transporter ATP-binding protein
14      DNA starvation/stationary phase protection protein
14      formate C-acetyltransferase
13      SDR family NAD(P)-dependent oxidoreductase
12      dihydrolipoyl dehydrogenase
12      ISLre2-like element ISAot1 family transposase
11      3-hydroxyacyl-CoA dehydrogenase
11      manganese transporter

Annotations saved to file: 'TSO_002_FR_RefSeq_new_annot_Alloiococcus_function.tsv'.
Number of errors: 0

I have two concerns about this output:

1. It throws out some headers from the database - presumably they're a problem because they have some square brackets within the square brackets. These IDs aren't in my input file and I'm not sure why it prints them.

2. It says it has found a large number of errors, but then later prints '0' errors. Should I be concerned about this?

I just want to check whether either of these affect my results or if it is just erroneous printing from the analysis counter script.

Thank you very much for your prompt help with all of my questions.

Cheers,
Rachael

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/6897d778-cb90-4912-b844-cde30ec25e77%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/886d2bf2-733c-4ac2-b0f9-405516550a27%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/c6e0c2e8-a049-4d82-a020-d7d7cff15eb7%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Sam Westreich

unread,

Aug 8, 2018, 6:40:31 PM8/8/18

to Rachael Lappan, SAMSA bioinformatics group

Hi Rachael,

For point 1: Yes, it's the multiple brackets. When the script encounters multiple brackets, it prints the line and then its guess at the organism name. I can probably mute this, as it's mainly for debugging but doesn't add much to the script usage for anyone besides myself.

For point 2: these errors occur when an organism name doesn't split as expected into two names (genus and species). The organism name is still added into the reference database, and this shouldn't pose any downstream problems. For splitting up results to those only from a specific organism, for example, the entire database line is searched, so parsing problems shouldn't reduce results.

Best,

Sam

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/6897d778-cb90-4912-b844-cde30ec25e77%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/886d2bf2-733c-4ac2-b0f9-405516550a27%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/c6e0c2e8-a049-4d82-a020-d7d7cff15eb7%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sam Westreich
Microbiome Scientist, DNAnexus,
http://www.mosaicbiome.com

--

You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/78f24b24-2a3b-4be1-8a98-3621c74685a7%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward