--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d97248ab-6824-486d-b3ea-05693c51641d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi Graham,If you're working with the SAMSA pipeline (https://github.com/transcript/samsa), all of the databases that are used have been externally selected by MG-RAST. I'm not an MG-RAST developer, so SAMSA is limited to the databases currently available. MG-RAST does offer both RefSeq and GenBank as separate databases, and this pipeline works with both options - but it doesn't have the NCBI NR database, I believe.If you're interested in using SAMSA2, which is standalone and can incorporate custom databases (but requires a server instance to run effectively), it should be able to use the NR database. There may need to be some code modification if the NR database uses a different sequence labeling approach than RefSeq.Best,Sam Westreich
On Mon, Apr 9, 2018 at 10:28 AM, Graham Colby
Hi,
When working with environmental metacommunities, refseq is often fails to capture the diversity of sequences that may be present in samples, as it included only organisms with complete sequences. Is there a quick way to switch the annotation from the refseq database to the NR protein database ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz, which includes both refseq and genbank?Is this something you'd consider implementing?Otherwise, I am interested in modifying the scripts such that I am using the NR database. Do I only need to modify the master script?Thanks,Graham
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d97248ab-6824-486d-b3ea-05693c51641d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Thanks for the quick response.I am interested in using SAMSA2 operating on a server. Can you expand on what you mean by sequence labeling approach - is this just in regards to accession number format? I know Refseq and genbank will have different accessions.
Graham
On Monday, 9 April 2018 13:42:45 UTC-4, S. Westreich (creator) wrote:
Hi Graham,If you're working with the SAMSA pipeline (https://github.com/transcript/samsa), all of the databases that are used have been externally selected by MG-RAST. I'm not an MG-RAST developer, so SAMSA is limited to the databases currently available. MG-RAST does offer both RefSeq and GenBank as separate databases, and this pipeline works with both options - but it doesn't have the NCBI NR database, I believe.If you're interested in using SAMSA2, which is standalone and can incorporate custom databases (but requires a server instance to run effectively), it should be able to use the NR database. There may need to be some code modification if the NR database uses a different sequence labeling approach than RefSeq.Best,Sam Westreich
On Mon, Apr 9, 2018 at 10:28 AM, Graham Colby
Hi,When working with environmental metacommunities, refseq is often fails to capture the diversity of sequences that may be present in samples, as it included only organisms with complete sequences. Is there a quick way to switch the annotation from the refseq database to the NR protein database ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz, which includes both refseq and genbank?Is this something you'd consider implementing?Otherwise, I am interested in modifying the scripts such that I am using the NR database. Do I only need to modify the master script?Thanks,Graham
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d97248ab-6824-486d-b3ea-05693c51641d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--Sam WestreichDEB Biotechnology FellowIntegrative Genetics and Genomics Graduate Group, University of California, DavisCollege of Biological Sciences, University of MinnesotaAre you doing what you want to be doing?
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/c61742d8-89a9-4bae-94ac-36bb4b13c254%40googlegroups.com.
Hi Sam,
I am also considering giving the NCBI NR protein database a shot. My samples are human middle ear fluids (from middle ear infections) that contain only a few species, but two of these - Alloiococcus otitis and Turicella otitidis - have genome assemblies only in the 'scaffold', not 'complete' stage in RefSeq. My thoughts were that the NCBI NR database might contain more information on these species of interest. I have a few questions regarding this:
1. Did you determine how to modify the script/headers to make SAMSA2 compatible with the NCBI NR database? I've had a look at the file (nr.gz) and it seems that there's multiple annotations per header. Some of these are very long; apparently where sequences are identical the headers are collapsed. From the README:
"To be merged two sequences must have identical lengths and every residue at every position must be the same. The FASTA deflines for the different entries that belong to one record are separated by control-A characters invisible to most programs. In the example below both entries Q57293.1 and AAB05030.1 have the same sequence, in every respect:
>Q57293.1 RecName: Full=Fe(3+) ions import ATP-binding protein FbpC ^AAAB05030.1 afuC
[Actinobacillus pleuropneumoniae] ^AAAB17216.1 afuC [Actinobacillus pleuropneumoniae]
MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVTKSSIQNRDIC
IVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQQQRVALARALVLKPKVLILD
EPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMNKGTIMQKARQKIFIYDRILYSLRNFMGEST
ICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPEAIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLIN
ANPDQFDPDATKAFIHFTEQGIFLLNKE
"
How could this be handled? It doesn't seem feasible because when I search for Turicella (zcat nr.gz | grep 'Turicella' | head | less) I can see that some proteins' first annotation is [Bacteria] followed by several species of different genera. While it seems that the NR database contains a lot more information about Turicella (it appears in 8405 entries in the RefSeq database and 15566 in NR) I wonder if this is because it appears in the headers for more conserved proteins (along with several other species), compared to RefSeq where each entry has a single taxonomic annotation. Perhaps RefSeq is okay for my purposes? At least they are in the database! My current thoughts are to keep things simple and stick with RefSeq but make sure I have the latest release.
2. Is the SEED subsystems database still relevant for organisms like this - or was it built from the proteins in 'complete' genomes? Am I likely to miss a large chunk of functions if they are from less well-characterised genomes?
3. Similarly, could the results end up biased by the coverage of the organism in the databases? For example, Streptococcus pneumoniae is also in my samples and there are many more annotations to this than to my two species of interest because it's a well-characterised organism. Could the results show a larger proportion of transcription attributed to S. pneumoniae just because it's covered more thoroughly in the database? This isn't an issue specific to SAMSA2 but just wondering if I need to take this into account when interpreting my results.
4. Can you confirm what the columns are in the Step 5 RefSeq output results: is this the percentage of all mapped reads, then the number of mapped reads?
13.468642143 16663 Streptococcus pneumoniae
9.17658850441 11353 Staphylococcus aureus
Thank you and look forward to hearing your thoughts.
Cheers,
Rachael
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/6897d778-cb90-4912-b844-cde30ec25e77%40googlegroups.com.
DIAMOND_specific_organism_retriever.py -I $sample -SO Alloiococcus -D $Refseqdb
DIAMOND_analysis_counter.py -I $output -D $Refseqdb -F
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/6897d778-cb90-4912-b844-cde30ec25e77%40googlegroups.com.--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/6897d778-cb90-4912-b844-cde30ec25e77%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/886d2bf2-733c-4ac2-b0f9-405516550a27%40googlegroups.com.
Hi Sam,
At the moment this isn't working for me - it returns 0 results for Alloiococcus and Turicella, even with the results from the small database containing only proteins from those genomes (where there should be nothing but hits to those genera).
I use:
for sample in /data/samsa2/Ao_To/step_4_output/*Ao_To_annotated
do
python ~/bin/samsa2-2.0.0/python_scripts/DIAMOND_specific_organism_retriever.py -I $sample -SO Alloiococcus -D /data/genomes/samsa2_databases/Ao_To_database/Ao_To_refseq.fa
python ~/bin/samsa2-2.0.0/python_scripts/DIAMOND_specific_organism_retriever.py -I $sample -SO Turicella -D /data/genomes/samsa2_databases/Ao_To_database/Ao_To_refseq.fa
done
head -n 5 step_4_output/TSO_002_FR_RefSeq_Ao_To_annotated
700666F:268:CATL9ANXX:2:2202:19558:2890 WP_003777422.1 100.0 32 0 0 98 3 287 318 8.9e-13 67.8
700666F:268:CATL9ANXX:2:2202:17547:4779 WP_004600256.1 45.5 33 18 0 100 2 187 219 5.4e-05 42.0
700666F:268:CATL9ANXX:2:2202:5956:7022 WP_004601419.1 71.9 32 9 0 99 4 101 132 6.1e-09 55.1
700666F:268:CATL9ANXX:2:2202:11226:8200 WP_003778163.1 95.0 20 1 0 60 1 71 90 2.0e-04 40.0
700666F:268:CATL9ANXX:2:2202:11536:11478 WP_003776212.1 100.0 32 0 0 99 4 365 396 1.2e-12 67.4
grep 'WP_003777422.1' /data/genomes/samsa2_databases/Ao_To_database/Ao_To_refseq.fa
>WP_003777422.1 replicative DNA helicase [Alloiococcus otitis]
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/6897d778-cb90-4912-b844-cde30ec25e77%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/886d2bf2-733c-4ac2-b0f9-405516550a27%40googlegroups.com.--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/6897d778-cb90-4912-b844-cde30ec25e77%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/886d2bf2-733c-4ac2-b0f9-405516550a27%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/c6e0c2e8-a049-4d82-a020-d7d7cff15eb7%40googlegroups.com.
Now reading through the m8 results infile.
Analysis of TSO_002_FR_RefSeq_new_annot_Alloiococcus.tsv complete.
Number of total lines: 1115
Number of unique sequences: 1115
Time elapsed: 0.005615 seconds.
Starting database analysis now.
1M lines processed so far in 16.161206 seconds.
2M lines processed so far in 31.637007 seconds.
# Don't need to see all these lines
>NP_569145.1 unnamed protein product [Miscanthus streak virus - [91]]
Miscanthus streak virus -
>NP_569146.1 unnamed protein product, partial [Miscanthus streak virus - [91]]
Miscanthus streak virus -
>NP_569147.1 unnamed protein product [Miscanthus streak virus - [91]]
Miscanthus streak virus -
>NP_569143.1 unnamed protein product [Miscanthus streak virus - [91]]
Miscanthus streak virus -
>NP_569144.1 unnamed protein product [Miscanthus streak virus - [91]]
Miscanthus streak virus -
Success!
Time elapsed: 1673.980078 seconds.
Number of lines: 90579976
Number of errors: 66639358
Dictionary database assembled.
Time elapsed: 1673.980078 seconds.
Number of errors: 66639358
Top ten function matches:
117 hypothetical protein
95 YSIRK-type signal peptide-containing protein
15 ABC transporter ATP-binding protein
14 DNA starvation/stationary phase protection protein
14 formate C-acetyltransferase
13 SDR family NAD(P)-dependent oxidoreductase
12 dihydrolipoyl dehydrogenase
12 ISLre2-like element ISAot1 family transposase
11 3-hydroxyacyl-CoA dehydrogenase
11 manganese transporter
Annotations saved to file: 'TSO_002_FR_RefSeq_new_annot_Alloiococcus_function.tsv'.
Number of errors: 0
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/6897d778-cb90-4912-b844-cde30ec25e77%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/886d2bf2-733c-4ac2-b0f9-405516550a27%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/c6e0c2e8-a049-4d82-a020-d7d7cff15eb7%40googlegroups.com.--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/d3678036-f550-4661-8287-e9f8c41048d3%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/6897d778-cb90-4912-b844-cde30ec25e77%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/886d2bf2-733c-4ac2-b0f9-405516550a27%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsubsc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/c6e0c2e8-a049-4d82-a020-d7d7cff15eb7%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "SAMSA bioinformatics group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samsa-bioinformatics-group+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/samsa-bioinformatics-group/78f24b24-2a3b-4be1-8a98-3621c74685a7%40googlegroups.com.