split_libraries_fastq.py on assembled paired-end illumina reads

530 views
Skip to first unread message

Mike

unread,
Feb 27, 2013, 3:42:36 PM2/27/13
to qiime...@googlegroups.com
Our collaborators have sent us a fastq file that contains the de-mulitiplexed, trimmed, cleaned assembled reads & reads_1 that didn't assemble. Currently, the fastq files for the barcodes are not attainable, so I made use of Tony's 'parse_bc_reads_labels.py' script  with some modification (my header format was slightly different):

@MISEQ03:74:000000000-A2TYR:1:1101:14024:2827#TCCACAGGAGT/1
TACAGAGGGTGCGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGCGGCTTTTTAAGTCGGATGTGAAATCCCTGAGCTTAACTTAGGAATTGCATTCGATACTGGGAAGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGC
+
<???<BB?B5?<BBBBCCEEFFHFH?EHFFGHFHHHHHHHHHHHCCEC>EEEGHHACCFHFDHHCCDECDCDBBFFFBFFFFBDEDDDEBDEEEEEEEEE;BEE?ECBEEEC?C;;BEEEEEEEEEEAEEEEEEEEAECCEEEECEEEEE:AEEEE??E8A:CEAE


I successfully ran the following command :

split_libraries_fastq.py -i /illumina/JGI-PMI/iTAGs_2013Jan7_run1/standard.fastq.txt,/illumina/JGI-PMI/iTAGs_2013Jan23_run2/standard.fastq.txt -o /illumina/JGI-PMI/DRAFT_01/sl_out -b /illumina/JGI-PMI/iTAGs_2013Jan7_run1/standard.barcode.fastq.txt,/illumina/JGI-PMI/iTAGs_2013Jan23_run2/standard.barcode.fastq.txt -m /illumina/JGI-PMI/iTAGs_2013Jan7_run1/Mapping_run1.txt,/illumina/JGI-PMI/iTAGs_2013Jan23_run2/Mapping_run2.txt --barcode_type 11

But got troubling results:

Length  Count
80.0    10
90.0    53
100.0   38
110.0   10
120.0   10
130.0   0
140.0   1
150.0   0
--

Length  Count
126.0   26
136.0   77
146.0   427
156.0   15
166.0   1
176.0   0
186.0   0
196.0   0
206.0   1
216.0   2
226.0   1
236.0   3
246.0   0
--

and...

Input file paths
Mapping filepath: /illumina/JGI-PMI/iTAGs_2013Jan7_run1/Mapping_run1.txt (md5: 4e1049e0e796e99ec117cc8aed38d43a)
Sequence read filepath: /illumina/JGI-PMI/iTAGs_2013Jan7_run1/standard.fastq.txt (md5: 898d19dfe5eb7ad20eb3f3ae2b9b8587)
Barcode read filepath: /illumina/JGI-PMI/iTAGs_2013Jan7_run1/standard.barcode.fastq.txt (md5: 85fdec50cac6ca0c7d4b1b3b51
fbcb24)

Quality filter results
Total number of input sequences: 5484634
Barcode not in mapping file: 0
Read too short after quality truncation: 5484512
Count of N characters exceeds limit: 0
Illumina quality digit = 0: 0
Barcode errors exceed max: 0

Result summary (after quality filtering)
Median sequence length: 99.00
RHIZ.O.2.BESC.29        5
RHIZ.O.3.BESC.103       4
RHIZ.O.1.SQMC.25.5      4
.
.
Total number seqs written       122
---
.
.
.
Input file paths
Mapping filepath: /illumina/JGI-PMI/iTAGs_2013Jan23_run2/Mapping_run2.txt (md5: f0962541227653ba9200175d220f4540)
Sequence read filepath: /illumina/JGI-PMI/iTAGs_2013Jan23_run2/standard.fastq.txt (md5: 49f5833abe767369f51fb8bf2c50f271
)
Barcode read filepath: /illumina/JGI-PMI/iTAGs_2013Jan23_run2/standard.barcode.fastq.txt (md5: 5762abd8797820e39153f9199
d286a16)

Quality filter results
Total number of input sequences: 5903511
Barcode not in mapping file: 0
Read too short after quality truncation: 5902958
Count of N characters exceeds limit: 0
Illumina quality digit = 0: 0
Barcode errors exceed max: 0

Result summary (after quality filtering)
Median sequence length: 147.00
ENDO.C.2.HOMC.21.3      21
ENDO.C.2.BESC.29        21
ENDO.C.1.BESC.890       14
.
.
.

Total number seqs written       553
---

Not sure why all the data is being culled. Does this have to do with the potentially bad quality of the overlapping sections of the paired end reads? Any suggestions on what quality filtering settings I should change? Something I am missing? Another issue perhaps?

-Thanks!
-M


Tony Walters

unread,
Feb 27, 2013, 3:47:08 PM2/27/13
to qiime...@googlegroups.com
Hello Mike,

I'd suggest raising -r and/or lowering the -p values from the default settings, to account for lower quality scores in the center of the reads.

-Tony

--
 
---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

fdbree

unread,
Feb 28, 2013, 3:46:11 AM2/28/13
to qiime...@googlegroups.com
Hi Mike,

I am not quite sure what kind of data you have, but maybe I can help by telling what we have and do.

We have amplicon-based sequencing data, which were paired-end assembled to increase the length of the amplicon-based sequence reads.
We tried assembling with pandaseq, which can definitely do the assembly per pair, but the quality scoring is then arbitrarily adjusted by pandaseq into basically 'rubbish-information' for the middle part. We also used/use 'cope', which is slightly better.
However, to make a long story short, in the end, we abolished the first step of 'split_libraries' from qiime by simply reformatting the assembled-reads to qiime-format-fasta ourselves with a very simple script. This gives us the seqs.fna (from the tutorials in qiime, if you're used to this). In fact, this is all you need to do, if you have already done your QC and filtration -before- qiime.
This works great. Added advantage is, that you have more control over the quality control and trimming/clipping/filtration with the tools you're used to. 

Freddy

Mike

unread,
Feb 28, 2013, 8:49:55 AM2/28/13
to qiime...@googlegroups.com
Thanks for the information Tony and Freddy!

Luckily the sequencing facility has provided us there estimate of usable reads per sample after QC etc.. So, I've been attempting to adjust the parameters to get as close to the number of reads we know should be there. It should not be to much trouble for me to write a script that outputs a qiime compatible seqs.fna as Freddy suggests. I'll give it a go!

-Thanks!
-Mike
Reply all
Reply to author
Forward
0 new messages