Re: Best practices for preparing a dataset for oligotyping?

1,445 views
Skip to first unread message

Swathi Turlapati

unread,
Feb 21, 2014, 3:57:13 PM2/21/14
to oligo...@googlegroups.com
Hi Murat,

After alignment with PyNast and after removing uninformative gaps with o-trim.py I checked the length of my sequences and they are all  between 1..437 to 1..446.
I have tried running the o-pad-with-gaps script in the scripts folder to remove length variation and I wanted to set upto 1..427.
How do I change the script in order to get  sequences of length 427 ?

Thank you,
Swathi Turlapati

A. Murat Eren

unread,
Feb 21, 2014, 4:15:27 PM2/21/14
to oligo...@googlegroups.com
Hi Swathi,

Once you remove common gaps, your sequences must have equal number of characters. So you don't need to add gaps (therefore o-pad-with-gaps would not be useful).

Say, your reads look like this:

>073-PS_1
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-T-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC--G-CA-G-G----
>073-PS_2
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-T-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC--G-C---------
>073-PS_3
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-T-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC--G-CA-G-G-C--
>073-PS_4
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-T-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC--G-CA--------
>073-PS_5
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-T-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC--------------
>073-PS_6
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-G-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC--G-CA-G-G----
>073-PS_7
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-G-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC--G-CA-G-G-C--
>073-PS_8
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-G-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC--G-CA-G-G-C-G
>073-PS_9
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-G-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC--------------
>073-PS_11
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-G-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC--G-CA-G------

In this case the entropy analysis would yield something like this:


As you can see only the position at 39 is an actual variation, the ones at the end are artificial ones due to the length variation.

So in this case you would like to trim your reads further, say 80th position. Which is similar to what want to do I assume.

To do that you can use "o-trim" script (instead of the one that trims uninformative columns) like this:

o-trim YOUR_FASTA 0 80

Which would create a file called YOUR_FASTA-TRIMMED with this content:

>073-PS_1
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-T-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC
>073-PS_2
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-T-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC
>073-PS_3
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-T-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC
>073-PS_4
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-T-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC
>073-PS_5
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-T-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC
>073-PS_6
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-G-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC
>073-PS_7
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-G-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC
>073-PS_8
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-G-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC
>073-PS_9
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-G-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC
>073-PS_11
TA-AT--AC---GG-AG-GGT-GCA-A-G-CG-TTAA-G-CGG-AA-TT-A--C-T--GGGC-GTA----AA-GCGC-AC


I hope this helps.

Swathi Turlapati

unread,
Feb 21, 2014, 4:45:55 PM2/21/14
to oligo...@googlegroups.com
This worked well. I got the sequences of same length.
Thank you,
Swathi

Bo Adu-Oppong

unread,
Feb 21, 2014, 11:57:19 PM2/21/14
to oligo...@googlegroups.com
Hello Meren,

I have a huge dataset and I was wondering if there was a script that will allow me to look at my seq assignments and then search through my reads to create a new file with the genus of interest. Thank you.

A. Murat Eren

unread,
Feb 22, 2014, 4:55:31 AM2/22/14
to oligo...@googlegroups.com
Hi Bo,

If you send me an example (maybe a small subset of your FASTA file and your sequence assignments) I can write a script for you.


Best,

Bo Adu-Oppong

unread,
Feb 22, 2014, 8:06:05 AM2/22/14
to oligo...@googlegroups.com
Thank you very much!!! I have attached the assignments that I was able to get for all of my sequences (assignments.txt). And then the other file are my sequences (seqs_sorted.fasta).
assignments.txt
seqs_sorted.fasta

A. Murat Eren

unread,
Feb 22, 2014, 1:12:37 PM2/22/14
to oligo...@googlegroups.com
Hi Bo,

Thanks for the files, I took a quick look. I am traveling at the moment, but I will try post something as soon as possible. Sorry in advance for the delay.


Best,

A. Murat Eren

unread,
Feb 22, 2014, 2:36:43 PM2/22/14
to
Hi Bo,

I am looking at your files. These seem to have a frequency information.

Now, I have a critical question: Are these reads uniqued per sample (and the frequency indicates how many times that individual read was observed in the sample), or are they representative sequences of OTU's (and the frequency indicates the number of reads that OTU had)?

Because if its the second one, we have a problem :)

Thanks,


Update:

Here is the script:


When I run it on your files like this:

python forBo.py seqs_sorted.fasta assignments.txt Gammaproteobacteria

It generated a file with only reads with "Gammaproteobacteria" somewhere in their annotation.

You also need to "ununique" these reads. The oligotyping pipeline does not work with uniqued files. If you want to store them un-uniqued, you can use this script instead:


However, as I said before, if these are OTU representatives, this is not going to work. You need to annotate each truly unique read in your dataset separately. I hope I am being paranoid, and you already know this :)

I hope this helps.


Best,

Bo Adu-Oppong

unread,
Feb 22, 2014, 3:04:12 PM2/22/14
to oligo...@googlegroups.com
Thank you so much Murat!! But yes the dataset I gave you are the actual sequences after they have been debarcoded and then it is compilation of our reverse and forward reads. Each sequence should be a unique sequence and we just have the frequency of which that read was seen in the sample. The file is everything before picking representative OTUs. I will try this out on my complete dataset. You are such a lifesaver.

Bo Adu-Oppong

unread,
Feb 27, 2014, 12:46:25 PM2/27/14
to oligo...@googlegroups.com
Hello Murat,

I have another question. When I assign my individual sequences using RDP I am not able to call with 80% confidence the genus for the sequences. What would you suggest doing if we are studying understudied bacterial communities? Thank you.

Bo

A. Murat Eren

unread,
Feb 27, 2014, 1:10:06 PM2/27/14
to oligo...@googlegroups.com
Hi,

I think in that case focusing on reads from a higher taxonomic order may be useful (for instance goring one level up to family, instead of focusing on noisy genus level groupings individually).

Although it requires much more attention, it is possible to oligotype even an entire phylum.


Best,

Swathi Turlapati

unread,
Mar 6, 2014, 11:08:36 AM3/6/14
to oligo...@googlegroups.com
Hi Meren,

When I tried running the regular o-trim script I got this error message. In this case I do not intend to set up any boundaries for my aligned sequences but I guess it's not recognizing the original script because previously I ran o-trim script with set boundaries. What should I do?

Traceback (most recent call last):

File "o-trim", line 21, in <module>

trim_from = int(sys.argv[2])

IndexError: list index out of range

Swathi


A. Murat Eren

unread,
Mar 6, 2014, 11:18:41 AM3/6/14
to oligo...@googlegroups.com
Hi Swathi,

Sorry for the poor coding and user interface with that script. I will fix it at some point, but this is how it expected to be called:

o-trim FASTA_FILE 0 200


The one above would trim the reads in FASTA_FILE starting from the 0th position up to to 200th. Or you can call it like this:

o-trim FASTA_FILE 20



With this, it would trim the first 20 nts.

I hope this helps. If you can tell me what is your purpose, I may offer more direct solutions.


Best,

Swathi Turlapati

unread,
Mar 6, 2014, 11:38:43 AM3/6/14
to oligo...@googlegroups.com
Thank You Murat for your prompt response I was just  trying to run the regular script to remove uninformative gaps in the alignment and now the o-trim-uninformative-columns-from-alignment seems to work well. Thank you for the help.
Swathi

Joanito Liberti

unread,
Aug 15, 2014, 6:23:08 PM8/15/14
to oligo...@googlegroups.com
Dear Murat,

I've followed all your suggestions to prepare the fasta file for oligotyping but I have a doubt. After I run o-trim-uninformative-columns-from-alignment I see that my sequences still do not end in the same position, so when I run the entropy analysis I have many entropy peaks at the end of the alignment. You say that all the sequences should have the same length, so what do you suggest me to do? Should I keep them like this and move on with the analysis or should I do something to trim them all to the same length? 
It's a 454 generated dataset and I've now pulled out the genus of interest from the dataset. I see that most of the sequences have between 406 and 430 bp. But then there are just a few that are 300 something bp. Should I maybe trim everything down to 406 and remove sequences shorter than this? If so, how can I do this without going through every sequence, since I have 60,000 sequences in this dataset?

Sorry for many questions!

Thanks a lot for your help!

Best,
Joanito

A. Murat Eren

unread,
Aug 15, 2014, 7:34:11 PM8/15/14
to oligo...@googlegroups.com
Dear Joanito,

Can you please start a new discussion topic for this post with a new title? Maybe called "Alignment issues"?

I want to write an extensive response, but I don't want it to be buried down under a long discussion.


Thanks,

Joanito Liberti

unread,
Aug 15, 2014, 8:12:40 PM8/15/14
to oligo...@googlegroups.com
Dear Murat,

thanks a lot! I'll copy the message in a new post!

Joanito

A. Murat Eren

unread,
Mar 15, 2016, 4:27:34 PM3/15/16
to oligo...@googlegroups.com
I decided to start a discussion topic, since the title is quite a common question.

This is my recipe to analyze a dataset with the oligotyping pipeline.

  • Get all reads for all samples in a project into one FASTA file, and format the deflines to meet the standard explained here.
  • Use RDP or GAST to assign taxonomy to each read individually. If you are using QIIME and exporting your FASTA files for a given taxon via QIIME, be very careful: QIIME assigns taxonomy to representative sequences of OTUs, but you definitely don't want that, because generally OTUs are phylogenetically mixed units. Anyone who wants do a robust oligotyping analysis should assign taxonomy to their reads individually before selecting a group of reads.  
  • Filter reads that are identified as the genus of interest and put them into one file.
  • If you are working with 454 or Use PyNAST to align these reads against GreenGenes template. If you don't want to spend too much time, extract templates from the GreenGenes alignment for a given genus or family using o-create-GG-alignment-template-from-taxon. If you are working with Illumina HiSeq or MiSeq reads, skip the alignment step, but run o-pad-with-gaps on your FASTA file to accommodate for any length variation.
  • Remove common gaps from the resulting alignment file using o-trim-uninformative-columns-from-alignment. Don't forget to check the number of reads in the failed alignments file to make sure there aren't too many reads failed the alignment step.
  • Run entropy-analysis and start oligotyping.
  • Repeat the last three steps for each genera you wish to oligotype from the dataset.


Meren.


Note: Please start new threads for your questions with descriptive titles. Thanks!

eline Suzanne klaassens

unread,
Nov 9, 2015, 5:15:19 PM11/9/15
to Oligotyping and MED
Hi, The link to the explained standards doesn't work (first bullet point). Can you give me the correct link?


On Saturday, 22 February 2014 20:59:35 UTC+11, A. Murat Eren wrote:
I decided to start a discussion topic, since the title is quite a common question.

This is my recipe to analyze a dataset with the oligotyping pipeline.
  • Get all reads for all samples in a project into one FASTA file, and format the deflines to meet the standard explained here.
  • Use RDP or GAST to assign taxonomy to each read individually. If you are using QIIME and exporting your FASTA files for a given taxon via QIIME, be very careful: QIIME assigns taxonomy to representative sequences of OTUs, but you definitely don't want that, because generally OTUs are phylogenetically mixed units. Anyone who wants do a robust oligotyping analysis should assign taxonomy to their reads individually before selecting a group of reads.  
  • Filter reads that are identified as the genus of interest and put them into one file.
  • If you are working with 454 or Use PyNAST to align these reads against GreenGenes template. If you don't want to spend too much time, extract templates from the GreenGenes alignment for a given genus or family using this script (it is under the Scripts directory in your copy of the codebase). If you are working with Illumina HiSeq or MiSeq reads, skip the alignment step, but run this script on your FASTA file to accommodate for any length variation.
  • Remove common gaps from the resulting alignment file using this script (it is also under the Scripts directory in your copy of the codebase). Don't forget to check the number of reads in the failed alignments file to make sure there aren't too many reads failed the alignment step.
  • Run the entropy-analysis and start oligotyping.
  • Repeat the last three steps for each genera you wish to oligotype from the dataset.

A. Murat Eren

unread,
Nov 9, 2015, 5:17:40 PM11/9/15
to Oligotyping and MED
Hi,

This is the correct link:


I will also fix it in the post.


Best,

--

A. Murat Eren (meren)
http://merenlab.org :: gpg

--
You received this message because you are subscribed to the Google Groups "Oligotyping and MED" group.
To unsubscribe from this group and stop receiving emails from it, send an email to oligotyping...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/oligotyping/2f5ba5bf-1abc-4d20-8269-abf49abda947%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

eline Suzanne klaassens

unread,
Nov 9, 2015, 6:11:57 PM11/9/15
to Oligotyping and MED
Thanks! Actually all the links don't work. Thanks for fixing
Reply all
Reply to author
Forward
0 new messages