what to do when barcode is NOT right at the beginning of a read?

dbaqua

unread,

Jul 16, 2015, 5:11:56 AM7/16/15

to qiime...@googlegroups.com

Hi all,
when we get raw fastq files from illumina Miseq PE sequencing, they are structured as follows: first come four random nucleotides (NNNN) then comes the barcode and the sequence (this is amplicon sequencing and the barcodes are added to the 5' of the primer).
However, in those read files, it is not always the case that single reads are nicely structured exactly in the pattern „NNNN-barcode-sequence“; instead, sometimes only 3 or 2 Ns or maybe 5 random Ns come, then barcode, then sequence.

Therefore we cannot simply use extract_barcodes.py and cut off 12 nucleotides in the beginning (4Ns plus 8 barcode) and assume that the last 8 are the exact barcode, because many wrong barcodes would result when doing that (e.g. NBBBBBBB instead of BBBBBBBB) which then later, during demultiplexing, cannot be assigned to a sample because the wrong barcode does not match anything in the mapping file. Has anyone come across a similar problem and is there an easy solution? I know that e.g Flexbar uses the exact barcode sequence as input and finds it even if it is on variable positions within the sequence. Anything in qiime that can handle this?

I would greatly appreciate any advice!

D

Tony Walters

unread,

Jul 16, 2015, 7:59:25 AM7/16/15

to qiime...@googlegroups.com

Hello,

You are correct that extract_barcodes.py won't help in this situation, due to the variable length of the random nucleotide sequence at the beginning. There isn't anything in QIIME to handle this scenario, so it would require custom scripting (or outside software if the output is amenable to usage with QIIME).

I do have a script written here: https://gist.github.com/walterst/2c592044b3b9e44a4290
that searches for primers in a fastq file. I think it could be modified, in particular, lines 90-93, to try and get the index of the 8 bases (all barcodes are 8 bases right?) before the beginning of the forward primer, rather than the index of the end of the forward primer which it's doing now.

--

---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dbaqua

unread,

Jul 17, 2015, 9:09:33 AM7/17/15

to qiime...@googlegroups.com

Hi Tony, thanks for your reply! Maybe I'm overlooking something, but what about the mapping file, does it need to be structured exactly as "usual" i.e. with headers sample id, barcode sequence, linkerprimersequence etc? And then your script searches for what is written in the column linkerprimersequence and cuts away everything that comes before (for forward primer?) If that is what is happening, I think it would be enough for my purposes so i wouldnt have to change anything in the code... but maybe I'm not understanding the problem.
I tried running your script on a test file but could not get it to work. The error message was "Import error: no module named skbio.parse.sequences"
I would be thankful for your answers to these issues...
D

Tony Walters

unread,

Jul 17, 2015, 9:25:20 AM7/17/15

to qiime...@googlegroups.com

Hmm, you must be using an older (1.8.0?) version of QIIME. The skbio software can be installed from here to get the libraries needed: https://github.com/biocore/scikit-bio/tree/master/skbio

The mapping file is indeed the same as the normal QIIME requirements.

You would need to slightly change the code though-right now, it looks for the forward primer and cut's it out at its 3' position (and also finds the reverse primer, but since the barcode is before the forward primer, that part isn't relevant, except that you get rid of the reverse primer, which you probably want to do in any case).

E.g., if the XXXX sequence below is the forward primer:

NNNNNNbarcodeXXXXXAAAATTTCCCCGG

it finds the 3' position, and cuts it out, so you'll get

AAAATTTCCCCGG

So you'd need to modify how it's slicing out the primer, since you want to get the 8 bases before it. Line 92 in the code looks like this:

start_slice = int(curr_primer.search(seq).span()[1])
You could change it to something like this:

start_slice = int(curr_primer.search(seq).span()[0]) - 8 The change to [0] returns the position at the left of the primer, rather than the right, and -8 changes the index 8 bases to get the barcode before it. Because of the possibility that you could get a negative value here, you probably would want to add a couple of lines immediately after the above one like this (the spacing in the second line is important):

if start_slice < 0:

start_slice = 0

Hopefully this will lead to sequences that consistently start with the 8 base pair barcode, so from:

NNNNNNbarcodeXXXXXAAAATTTCCCCGG

you'll get:

barcodeXXXXXAAAATTTCCCCGG

in the output, so you can run extract_barcodes.py on it to get the first 8 bases out.

dbaqua

unread,

Jul 20, 2015, 4:04:14 AM7/20/15

to qiime...@googlegroups.com

Thanks a lot for your help Tony!
Indeed, I was working on an older qiime version... now updated.
Will try this right away!
Deniz

dbaqua

unread,

Jul 20, 2015, 12:17:19 PM7/20/15

to qiime...@googlegroups.com

Hi Tony,
it worked with the edited version! Thanks again for the help! One thing I was wondering about: As far as I can see the script needs 100% matches to the primers, right? But I guess the number of mismatches should be small...
Have a good day.
D

On Friday, July 17, 2015 at 3:25:20 PM UTC+2, TonyWalters wrote:

Tony Walters

unread,

Jul 20, 2015, 12:22:49 PM7/20/15

to qiime...@googlegroups.com

Hello again,

Glad it's working. It indeed does check for exact matches (although it does support degenerate DNA codes). One could alter the code to do local alignments (to handle indels) and mismatch counts, but that runs much slower, but as you say, there shouldn't be that many reads with errors.

Reply all

Reply to author

Forward