split_libraries_fastq.py on assembled paired-end illumina reads

Mike

unread,

Feb 27, 2013, 3:42:36 PM2/27/13

to qiime...@googlegroups.com

Our collaborators have sent us a fastq file that contains the de-mulitiplexed, trimmed, cleaned assembled reads & reads_1 that didn't assemble. Currently, the fastq files for the barcodes are not attainable, so I made use of Tony's 'parse_bc_reads_labels.py' script with some modification (my header format was slightly different):

@MISEQ03:74:000000000-A2TYR:1:1101:14024:2827#TCCACAGGAGT/1

TACAGAGGGTGCGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGCGGCTTTTTAAGTCGGATGTGAAATCCCTGAGCTTAACTTAGGAATTGCATTCGATACTGGGAAGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGC

+

<???<BB?B5?<BBBBCCEEFFHFH?EHFFGHFHHHHHHHHHHHCCEC>EEEGHHACCFHFDHHCCDECDCDBBFFFBFFFFBDEDDDEBDEEEEEEEEE;BEE?ECBEEEC?C;;BEEEEEEEEEEAEEEEEEEEAECCEEEECEEEEE:AEEEE??E8A:CEAE

I successfully ran the following command :

split_libraries_fastq.py -i /illumina/JGI-PMI/iTAGs_2013Jan7_run1/standard.fastq.txt,/illumina/JGI-PMI/iTAGs_2013Jan23_run2/standard.fastq.txt -o /illumina/JGI-PMI/DRAFT_01/sl_out -b /illumina/JGI-PMI/iTAGs_2013Jan7_run1/standard.barcode.fastq.txt,/illumina/JGI-PMI/iTAGs_2013Jan23_run2/standard.barcode.fastq.txt -m /illumina/JGI-PMI/iTAGs_2013Jan7_run1/Mapping_run1.txt,/illumina/JGI-PMI/iTAGs_2013Jan23_run2/Mapping_run2.txt --barcode_type 11

But got troubling results:

Length Count

80.0 10

90.0 53

100.0 38

110.0 10

120.0 10

130.0 0

140.0 1

150.0 0

--

Length Count

126.0 26

136.0 77

146.0 427

156.0 15

166.0 1

176.0 0

186.0 0

196.0 0

206.0 1

216.0 2

226.0 1

236.0 3

246.0 0

--

and...

Input file paths

Mapping filepath: /illumina/JGI-PMI/iTAGs_2013Jan7_run1/Mapping_run1.txt (md5: 4e1049e0e796e99ec117cc8aed38d43a)

Sequence read filepath: /illumina/JGI-PMI/iTAGs_2013Jan7_run1/standard.fastq.txt (md5: 898d19dfe5eb7ad20eb3f3ae2b9b8587)

Barcode read filepath: /illumina/JGI-PMI/iTAGs_2013Jan7_run1/standard.barcode.fastq.txt (md5: 85fdec50cac6ca0c7d4b1b3b51

fbcb24)

Quality filter results

Total number of input sequences: 5484634

Barcode not in mapping file: 0

Read too short after quality truncation: 5484512

Count of N characters exceeds limit: 0

Illumina quality digit = 0: 0

Barcode errors exceed max: 0

Result summary (after quality filtering)

Median sequence length: 99.00

RHIZ.O.2.BESC.29 5

RHIZ.O.3.BESC.103 4

RHIZ.O.1.SQMC.25.5 4

.

Total number seqs written 122

---

.

Input file paths

Mapping filepath: /illumina/JGI-PMI/iTAGs_2013Jan23_run2/Mapping_run2.txt (md5: f0962541227653ba9200175d220f4540)

Sequence read filepath: /illumina/JGI-PMI/iTAGs_2013Jan23_run2/standard.fastq.txt (md5: 49f5833abe767369f51fb8bf2c50f271

)

Barcode read filepath: /illumina/JGI-PMI/iTAGs_2013Jan23_run2/standard.barcode.fastq.txt (md5: 5762abd8797820e39153f9199

d286a16)

Quality filter results

Total number of input sequences: 5903511

Barcode not in mapping file: 0

Read too short after quality truncation: 5902958

Count of N characters exceeds limit: 0

Illumina quality digit = 0: 0

Barcode errors exceed max: 0

Result summary (after quality filtering)

Median sequence length: 147.00

ENDO.C.2.HOMC.21.3 21

ENDO.C.2.BESC.29 21

ENDO.C.1.BESC.890 14

.

Total number seqs written 553

---

Not sure why all the data is being culled. Does this have to do with the potentially bad quality of the overlapping sections of the paired end reads? Any suggestions on what quality filtering settings I should change? Something I am missing? Another issue perhaps?

-Thanks!

-M

Tony Walters

unread,

Feb 27, 2013, 3:47:08 PM2/27/13

to qiime...@googlegroups.com

Hello Mike,

I'd suggest raising -r and/or lowering the -p values from the default settings, to account for lower quality scores in the center of the reads.

-Tony

--

---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

fdbree

unread,

Feb 28, 2013, 3:46:11 AM2/28/13

to qiime...@googlegroups.com

Hi Mike,

I am not quite sure what kind of data you have, but maybe I can help by telling what we have and do.

We have amplicon-based sequencing data, which were paired-end assembled to increase the length of the amplicon-based sequence reads.

We tried assembling with pandaseq, which can definitely do the assembly per pair, but the quality scoring is then arbitrarily adjusted by pandaseq into basically 'rubbish-information' for the middle part. We also used/use 'cope', which is slightly better.

However, to make a long story short, in the end, we abolished the first step of 'split_libraries' from qiime by simply reformatting the assembled-reads to qiime-format-fasta ourselves with a very simple script. This gives us the seqs.fna (from the tutorials in qiime, if you're used to this). In fact, this is all you need to do, if you have already done your QC and filtration -before- qiime.

This works great. Added advantage is, that you have more control over the quality control and trimming/clipping/filtration with the tools you're used to.

Freddy

Mike

unread,

Feb 28, 2013, 8:49:55 AM2/28/13

to qiime...@googlegroups.com

Thanks for the information Tony and Freddy!

Luckily the sequencing facility has provided us there estimate of usable reads per sample after QC etc.. So, I've been attempting to adjust the parameters to get as close to the number of reads we know should be there. It should not be to much trouble for me to write a script that outputs a qiime compatible seqs.fna as Freddy suggests. I'll give it a go!