Filtering raw Illumina data

Skip to first unread message

Mads B

May 24, 2016, 4:19:00 AM5/24/16
Hi there,

I am working with paired-end Illumina data in FASTQ format, that I want to prepare for analysis with Qiime. My data consists of amplicons from both archaea and bacteria. The same barcodes are used twice, for both the archaea and bacteria amplicons, so I want to filter them prior to the actual analysis. The reads look like this:

I have tried to filter by searching for the primer sequences using a shell script:

cat CTTGTA_R1.fastq | paste - - - - | awk '(($3 ~ "TAGATCGCGTAATGGTCTGGCTTAGACG"))' | tr "\t" "\n" > bac_R1.fastq

where CTTGTA_R1.fastq is my raw data and the sequence used with awk is the primer.

If i do this with all primers for both the foward and reverse reads - for both archaea and bacteria -  I end up with files of very different sizes (number of reads).

Can anyone suggest a smarter way for me to do it?

Message has been deleted

Greg Caporaso

May 26, 2016, 10:44:21 AM5/26/16
to Qiime 1 Forum
What are the N's in your reads? Is that a fixed or variable number of bases? 



May 26, 2016, 10:45:39 AM5/26/16
to Qiime 1 Forum

You might try modifying this custom script: to only write out reads where the target primers are found-right now, if it doesn't find the primer(s) it writes out the read without cutting off the reads at the primer site.

Since you're only searching for the forward primer, and you don't want to strip off parts of the reads (although maybe you also want to modify this to slice out the adjacent barcode region, depending upon what sort of randomness you have in the nucleotide sequences before the barcodes) as the code currently does, so maybe a modification like this towards the end of the code, where you change this part:

f_count = 0
r_count = 0
no_seq_left = 0

for label,seq,qual in parse_fastq(seqs):
    start_slice = 0
    end_slice = -1
    for curr_primer in forward_primers:
            start_slice = int([1])
            f_count += 1
    for curr_primer in reverse_primers:
            end_slice = int([0])
            r_count += 1
    curr_seq = seq[start_slice:end_slice]
    curr_qual = qual[start_slice:end_slice]
    if len(curr_seq) < 1:
        no_seq_left += 1
    formatted_fastq_line = format_fastq_record(label, curr_seq, curr_qual)
    out_seqs.write("%s" % (formatted_fastq_line))
log_out.write("Forward primer hits: %d\n" % f_count)
log_out.write("Reverse primer hits: %d\n" % r_count)
log_out.write("No seq left after truncation: %d" % no_seq_left)

to something like this:

no_primer_hit = 0

for label,seq,qual in parse_fastq(seqs):
    found_primer = False
    for curr_primer in forward_primers:
            found_primer = True

    if not found_primer:
        no_primer_hit += 1
    curr_seq = seq[start_slice:end_slice]
    curr_qual = qual[start_slice:end_slice]
    formatted_fastq_line = format_fastq_record(label, seq, qual)
    out_seqs.write("%s" % (formatted_fastq_line))

log_out.write("No forward primer found: %d" % no_primer_hit)

Note that I haven't tested this, so you might have to fix the code, and do some slicing [] work if you want to try and pull out the barcode adjacent to the primer hit. The original script has the lines where the index of the primer is found in the reads.
Reply all
Reply to author
0 new messages