PacBio sequences - Barcode trimming and degenerate primers

572 views
Skip to first unread message

AnnaC

unread,
Mar 3, 2016, 1:44:13 PM3/3/16
to Qiime 1 Forum

Hi everybody,


I am working on 16S full length reads obtained through PacBio. The output are long sequences (~1,500bp) that can be in both orientations and that contain a unique combination of two barcodes (in each end of the sequence). 


I would like to know how I could extract barcodes from both ends (I suppose using extract_barcodes.py) and how I have to write the mapping file containing this information.

I am not very sure of which options from extract_barcodes.py I have to use. I was thinking on something similar to that:


extract_barcodes.py -f fastq/Chin.fastq -c barcode_paired_stitched --bc1_len 16 --bc2_len 16 --attempt_read_reorientation -m maps/map-chin.txt


However, I would like some advice on how formatting the mapping file to incorporate information of both barcodes.


Regarding the primers… I am using some degenerate primers, is there any option to deal with those?


Many thanks in advance!!!


Anna

TonyWalters

unread,
Mar 3, 2016, 2:03:11 PM3/3/16
to Qiime 1 Forum
Hello Anna,

It looks like you are on the right track for your command. For the mapping file, you want to follow the standard metadata mapping file format (http://qiime.org/documentation/file_formats.html#metadata-mapping-files), and add a column, after the LinkerPrimerSequence but before the Description column, that has ReversePrimer as its header. For both the LinkerPrimerSequence and ReversePrimer sequence data fields, you can put in degenerate primers. Make sure that the primers are listed in 5'->3' orientation. If you have a pool of different primers, you can comma separate them (no spaces).

Just to confirm how the reads should end up, a given read should be one of these two options:
16BPbarcode1-ForwardPrimer-Amplicon-ReversePrimer-16BPbarcode2
OR
16BPbarcode2-ReversePrimer-Amplicon-ForwardPrimer-16BPbarcode1

If that is correct, you should be able to create the mapping file, and run the extract_barcodes.py command as you listed above. You can run into issues if there are variable bases before the barcodes-the read should be starting and ending with the barcodes, rather than them being some number of bases in.

-Tony

AnnaC

unread,
Mar 14, 2016, 2:41:14 PM3/14/16
to Qiime 1 Forum
Many thanks Tony!

I just found that barcodes are already trimmed. However in some sequences, before the primer, there are one or two extra bases and also my reads are in both orientations...
How would you suggest dealing with that?

Thanks again!!!!

Anna


TonyWalters

unread,
Mar 14, 2016, 2:45:41 PM3/14/16
to Qiime 1 Forum
Hello Anna,

Do you have a separate fastq (or fasta?) file for each of your samples?

For removing the primers + random bases before/after them, there isn't a script in QIIME, but there are some custom scripts here:
1. For a fastq file https://gist.github.com/walterst/2c592044b3b9e44a4290
2. For a fasta file https://gist.github.com/walterst/ab88ae59a8900a2fa2da

AnnaC

unread,
Mar 14, 2016, 6:50:20 PM3/14/16
to Qiime 1 Forum
I have separate fastq files, but only a fasta file... I will try these custom scripts, many thanks!!! 

Just one more thing, how could the reverse sequences be reoriented?

TonyWalters

unread,
Mar 14, 2016, 7:02:09 PM3/14/16
to Qiime 1 Forum
The script looks for the forward and reverse primer in different orientations to determine what direction the read is. That's why the short primer sequences are problematic-it's a false positive hit for a read that it thinks are forward oriented instead of detecting it as reverse oriented and reverse complementing the output. If it finds the forward primer in its orientation as given in the mapping file or the reverse complement of the reverse primer, it detects it as a forward read. If it detects the reverse complement of the forward primer or the reverse primer in the orientation as given in the mapping file, it detects it as reverse oriented and reverse complements the output.

So, to confirm, you have one fastq file per sample? What is in the fasta file? Can the sequencing center clarify if the data were already demultiplexed and PCR construct sequences removed from the fasta file? 

AnnaC

unread,
Mar 15, 2016, 3:22:29 PM3/15/16
to Qiime 1 Forum
Many many thanks again Tony! This script will be super useful and I have very long primers, so I hope I will not have problems.

I have one fastq file per sample, in which sequences do not contain the barcodes. They have just told me that the fasta file contains all the raw sequences, without demultiplexing (also without the quality values...). So, I think I will be working with the already demultiplexed fastq files.

AnnaC

unread,
Mar 15, 2016, 7:26:34 PM3/15/16
to Qiime 1 Forum
Hi again Tony!
 
I am using the script you told me (fastq), in some cases works great!! But in other sequences I find that the primers are already there... Most of the cases is when the read is reverse orienteted, so when it begins with the reverse primer. Do I have to add some option to indicate the script the must look the primers in both orientations? Am I doing something wrong?
 
Does the script accounts for degenerate primers? I have used an universal primer (which is not degenerate) plus 16S primers, which have some degenerate bases. With the custom script I am attempting to extract the universal ones. And then with split_libraries.py and with the reads in the right orientation I will extract the degenerate ones. Do you think is the best approach?
 
Thank you!!

TonyWalters

unread,
Mar 15, 2016, 7:33:55 PM3/15/16
to Qiime 1 Forum
Yes, the scripts account for degenerate primers. Are your primers in 5'->3' orientation? Maybe you can post some reads (each orientation, reads that work, reads that don't), and the exact sequences you are using for the primers in the mapping file.

AnnaC

unread,
Mar 15, 2016, 8:09:45 PM3/15/16
to Qiime 1 Forum
Yes, my primers are in 5'->3' orientation!
I attach you the sequences of six reads before and after trimming (.txt). I also attach the map file I have used, the log for your script and the raw fastq file for this 6 sequences.
 
Many thanks!!
map-axilla.txt
log_axilla
raw-seqs.txt
trimmed-seqs.txt
raw-seqs.fastq

TonyWalters

unread,
Mar 15, 2016, 8:23:17 PM3/15/16
to Qiime 1 Forum
Okay, it looks like there is variability in how much of the sequencing adapters are present. I don't know of a perfect solution to orient every read, so you may lose some no matter what, but, maybe using a slightly smaller set of primers (the 3' ends of them) could orient more reads.

e.g. in read 2 of your run:
@m160213_082955_42145_c100933682550000001823210305251612_s1_p0/245/ccs 1 26
GATCACTTGTGCAAGCATCACATCGTAGTACCTTGTTACGACTTCACTCCAGTC
Reverse Primer (caps to keep the text lined up
TGGATCACTTGTGCAAGCATCACATCGTAG

There is an extra TG at the start, if this is removed, the sequence lines up.
GATCACTTGTGCAAGCATCACATCGTAGTACCTTGTTACGACTTCACTCCAGTC
GATCACTTGTGCAAGCATCACATCGTA

I would take about half of the forward primer (tagctgactcaggtcac) and reverse primer (agcatcacatcgtag) and see if that works better for orienting the read, considering the variability at the start that might be interfering with the scripts finding exact text matches.


AnnaC

unread,
Mar 16, 2016, 12:38:43 PM3/16/16
to Qiime 1 Forum
Hi Tony, many thanks again for this solution!!! 
So, I will try to shorten my primers.


Reply all
Reply to author
Forward
0 new messages