Filtering raw Illumina data

Mads B

unread,

May 24, 2016, 4:19:00 AM5/24/16

to qiime...@googlegroups.com

Hi there,

I am working with paired-end Illumina data in FASTQ format, that I want to prepare for analysis with Qiime. My data consists of amplicons from both archaea and bacteria. The same barcodes are used twice, for both the archaea and bacteria amplicons, so I want to filter them prior to the actual analysis. The reads look like this:

NNNN-Barcode-Primer-Target

I have tried to filter by searching for the primer sequences using a shell script:

cat CTTGTA_R1.fastq | paste - - - - | awk '(($3 ~ "TAGATCGCGTAATGGTCTGGCTTAGACG"))' | tr "\t" "\n" > bac_R1.fastq

where CTTGTA_R1.fastq is my raw data and the sequence used with awk is the primer.

If i do this with all primers for both the foward and reverse reads - for both archaea and bacteria - I end up with files of very different sizes (number of reads).

Can anyone suggest a smarter way for me to do it?

Message has been deleted

Greg Caporaso

unread,

May 26, 2016, 10:44:21 AM5/26/16

to Qiime 1 Forum

Hello,

What are the N's in your reads? Is that a fixed or variable number of bases?

Greg

TonyWalters

unread,

May 26, 2016, 10:45:39 AM5/26/16

to Qiime 1 Forum

Hello,

You might try modifying this custom script: https://gist.github.com/walterst/2c592044b3b9e44a4290 to only write out reads where the target primers are found-right now, if it doesn't find the primer(s) it writes out the read without cutting off the reads at the primer site.

Since you're only searching for the forward primer, and you don't want to strip off parts of the reads (although maybe you also want to modify this to slice out the adjacent barcode region, depending upon what sort of randomness you have in the nucleotide sequences before the barcodes) as the code currently does, so maybe a modification like this towards the end of the code, where you change this part:

f_count = 0

r_count = 0

no_seq_left = 0

for label,seq,qual in parse_fastq(seqs):

start_slice = 0

end_slice = -1

for curr_primer in forward_primers:

if curr_primer.search(seq):

start_slice = int(curr_primer.search(seq).span()[1])

f_count += 1

for curr_primer in reverse_primers:

if curr_primer.search(seq):

end_slice = int(curr_primer.search(seq).span()[0])

r_count += 1

curr_seq = seq[start_slice:end_slice]

curr_qual = qual[start_slice:end_slice]

if len(curr_seq) < 1:

no_seq_left += 1

continue

formatted_fastq_line = format_fastq_record(label, curr_seq, curr_qual)

out_seqs.write("%s" % (formatted_fastq_line))

log_out.write("Forward primer hits: %d\n" % f_count)

log_out.write("Reverse primer hits: %d\n" % r_count)

log_out.write("No seq left after truncation: %d" % no_seq_left)

to something like this:

no_primer_hit = 0

for label,seq,qual in parse_fastq(seqs):

found_primer = False

for curr_primer in forward_primers:

if curr_primer.search(seq):

found_primer = True

if not found_primer:

no_primer_hit += 1

continue

curr_seq = seq[start_slice:end_slice]

curr_qual = qual[start_slice:end_slice]

formatted_fastq_line = format_fastq_record(label, seq, qual)

out_seqs.write("%s" % (formatted_fastq_line))

log_out.write("No forward primer found: %d" % no_primer_hit)

Note that I haven't tested this, so you might have to fix the code, and do some slicing [] work if you want to try and pull out the barcode adjacent to the primer hit. The original script has the lines where the index of the primer is found in the reads.

Reply all

Reply to author

Forward