Upstream processing error: overlapping seq ids

28 views
Skip to first unread message

Roody_UF

unread,
Aug 30, 2016, 4:21:24 PM8/30/16
to qiime...@googlegroups.com
Hello all, 

I have some data from a collaborator that I have been tackling: 
1. It is a combination of 3, 454 runs. With 10 barcodes across 30 samples
2. I have .fastq files. Forward and reverse reads. 
-------
System information
==================
         Platform:      linux2
   Python version:      2.7.6 (default, Jun 22 2015, 17:58:13)  [GCC 4.8.2]
Python executable:      /usr/bin/python

QIIME default reference information
===================================
For details on what files are used as QIIME's default references, see here:

Dependency versions
===================
          QIIME library version:        1.9.1
           QIIME script version:        1.9.1
qiime-default-reference version:        0.1.3
                  NumPy version:        1.8.2
                  SciPy version:        0.13.3
                 pandas version:        0.17.1
             matplotlib version:        1.3.1
            biom-format version:        2.1.4
                   h5py version:        2.5.0 (HDF5 version: 1.8.11)
                   qcli version:        0.1.0
                   pyqi version:        0.3.2
             scikit-bio version:        0.2.3
                 PyNAST version:        1.2.2
                Emperor version:        0.9.51
                burrito version:        0.9.1
       burrito-fillings version:        0.1.1
              sortmerna version:        SortMeRNA version 2.0, 29/11/2014
              sumaclust version:        SUMACLUST Version 1.0.00
                  swarm version:        Swarm 1.2.19 [May 25 2016 14:36:46]
                          gdata:        Installed.
-----
so I went through and extracted barcodes for each read, and split_libraries_fastq.py 
I then cat the 3 resulting files, from the 3 runs.

To be more specific I extracted barcodes for all samples using the command:
extract_barcodes.py -f A1_R1.fastq -o barcodes_A1 -c barcode_single_end --bc1_len 8

Then I split libraries: 
split_libraries_fastq.py -m map_run1.txt -i A1_R1.fastq -o split_lib_A1 --barcode_read_fps barcodes.fastq --barcode_type 8

I then cat all 30 of the resulting seq.fna files together. 

Now that I am trying to pick_otus.py I am getting an error that leads me to believe that I have some errors in the upstream processing. I also read another thread that seems similar enough to my situation. 
----
bfillings.uclust.UclustParseError: A seq id was provided as a seed, but that seq id already represents a cluster. Are there overlapping seq ids in your reference and input files or repeated seq ids in either? Offending seq id is QiimeExactMatch.SS4_11
---
So my question is: Does the split_libraries_fastq.py have a -n option like split_libraries.py? I know I can convert the fasq files to .qual and .fasta files. Do I have to do that? Is my error in the one by one processing and concatenation? How can I do it more efficiently and avoid these errors?

Thanks for the help! 
-Roo

TonyWalters

unread,
Aug 30, 2016, 4:47:22 PM8/30/16
to Qiime 1 Forum
Hello Roody,

I think this option is the one you're looking for:
--start_seq_id

I would stick to using the fastq files in any case- the quality filtering is a bit different for the Illumina (split_libraries_fastq.py) and 454-style (split_libraries.py) data.

-Tony

Roody_UF

unread,
Aug 31, 2016, 10:27:11 AM8/31/16
to Qiime 1 Forum
1. I concatenate all the forward reads from 1 run (so not repeat barcodes)

2. extracted barcodes from the forward file

3. I then ran the join_paired_end script: 
join_paired_ends.py -f A1_R1.fastq -r A1_R2.fastq -b A1_barcode/barcodes.fastq -o A1_joined

4. Then tried to run split_libraries:
split_libraries_fastq.py -m map1.txt -i A1_joined/fastqjoin.join -o A1_split_lib --barcode_read_fps A1_barcode/barcodes.fastq --barcode_type 8

Traceback (most recent call last):
  File "/usr/local/bin/split_libraries_fastq.py", line 365, in <module>
    main()
  File "/usr/local/bin/split_libraries_fastq.py", line 344, in main
    for fasta_header, sequence, quality, seq_id in seq_generator:
  File "/usr/lib/python2.7/dist-packages/qiime/split_libraries_fastq.py", line 322, in process_fastq_single_end_read_file
    raise FastqParseError("Headers of barcode and read do not match. Can't continue. "
qiime.split_libraries_fastq.FastqParseError: Headers of barcode and read do not match. Can't continue. Confirm that the barcode fastq and read fastq that you are passing match one another.


Does this come because I concatenated the files, initially? Should I be doing in one at a time or comma separated? Should I forget about the joining paired ends? 

thanks!

TonyWalters

unread,
Aug 31, 2016, 11:52:44 AM8/31/16
to Qiime 1 Forum
Hello Roody,

My guess is that your step 4 --barcode_read_fps parameter needs to point to the barcodes fastq that is in the output folder (A1_joined/) from step 3, which should be filtered to have matching labels as the joined data.


Roody_UF

unread,
Sep 6, 2016, 4:07:02 PM9/6/16
to Qiime 1 Forum
Excellent! Thank you so much for your help! I just concatenated the 3 runs with their unique IDs and ran the pick_otus.py with no errors. 
Reply all
Reply to author
Forward
0 new messages