Importing UPARSE into qiime pipeline

6,120 views
Skip to first unread message

Carlos Ruiz

unread,
Nov 24, 2013, 12:14:22 AM11/24/13
to qiime...@googlegroups.com

Hello Qiime developers,

 

I am trying to analyse some Illumina reads I got for my thesis project. So far I’ve been working with Qiime with no problems but I would also like to try using UPARSE for the OTU picking. At this point I have an OTU table created with UPARSE pipeline but it has no taxonomy assignment yet. My question is: I know that you are planning to integrate UPARSE into QIIME but I will not happen anytime soon so, is there any way I can manually “import” this UPARSE OTU table into QIIME for the taxonomy assignment and the alpha diversity analysis? Thank you very much for your help.

 

Best regards,

 

Carlos Ruiz 

arp

unread,
Nov 25, 2013, 4:21:58 PM11/25/13
to qiime...@googlegroups.com
Hi Carlos,

I haven't done this before, but the last step in the UPARSE pipeline is converting the .uc file to a tab-separated value OTU table, which you can then (theoretically) convert to a biom file using biom convert.  I hope this helps!

Best,
Adam


--
 
---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Blair

unread,
Nov 25, 2013, 8:50:51 PM11/25/13
to qiime...@googlegroups.com
Hello,

I'm not part of the QIIME team, and am certainly not a computer scientist!  However,I've also been playing around with UPARSE, but using 454 data, and came across the same problem as Carlos.  I've come up with a very rudimentary work-around that I'm sure could be made far more elegant.  Indeed Carlos may already have done just what I've done.....so sorry if I'm teaching my grandmother to suck eggs.

I followed the UPARSE command line examples (using my data) and got to the point of generating the OTU table, without taxonomy assignment.  I then found the otu.fa file (generated in the third to last step of the UPARSE command lines example...'Label OTU sequences').  This should be a fasta file with rep sequences from the UPARSE pipeline.  I then went back to QIIME and ran this 'rep set' through the 'align_seqs.py', 'assign_taxonomy.py', 'filter_alignment.py' and 'make_phylogeny.py' scripts.  The taxa_assignment.txt file (generated through the 'assign_taxonomy.py' script) can be opened in excel and sorted according to OTU numbers.  This can then be aligned with the UPARSE OTU table and the taxonomy appended to the table.  It needs a bit of adjustment to convert this new table to a format that can be converted to biom and used in QIIME downstream analyses (alpha, beta diversity etc.).  But it's possible.

If someone has a better way to do this it'd be great to hear it (for example, I couldn't work out how to use the UPARSE otu mapping data to generate the OTU table directly in QIIME).

Cheers,

Blair

arp

unread,
Nov 25, 2013, 8:58:55 PM11/25/13
to qiime...@googlegroups.com
Blair, I'm curious about whether or not you used the uc2otutab.py script that's on the page I linked.  I haven't tried to use it, but I was just curious to hear about your experience with it.

Or are you saying that even after using that script, the OTU table that is output cannot be converted to biom format?

Adam


--

Blair

unread,
Nov 25, 2013, 9:35:20 PM11/25/13
to qiime...@googlegroups.com
Hi Adam,

Yes, I did use the uc2otutab.py script.  Basically I followed along with the UPARSE command line examples using my own data and ended up with an OTU table (in txt format).  This looks very much like one of the old QIIME tables, without the # QIIME OTU table line at the top, without the #OTU ID in the second line (it's OTUId instead) and without a Consensus Lineage column.

I've been trying out a range of OTU picking strategies on an old data set and really wanted to find out how many OTU's I'd get using the UPARSE pipeline, compared with the QIIME 1.3 denovo picking and the QIIME 1.7.0 open and closed reference methods.  The short answer to that question was 7500 otus using QIIME 1.3 denovo, around 4000 otus using 1.7.0 open reference, 900 otus using 1.7.0 closed reference and 150 otus using UPARSE.  I wanted the taxonomy so that I could check the UPARSE OTUs against the QIIME versions (with taxonomy included).  Thus I'd gotten around to attaching a QIIME generated taxonomy to the UPARSE OTU table but had not then attempted to convert this table to biom format.  So I've not tested out downstream QIIME scripts on the UPARSE OTU table.

I've just now modified the UPARSE OTU table to 'look' like a QIIME txt table (including the # QIIME OTU table top line, changing 'OTUId' to #OTU ID and pasting in the taxonomy information.  This does convert to biom format (convert_biom.py -i otu_table_qiime.txt -o otu_table_qiime.biom --biom_table_type="otu table").  Also using 'print_biom_table_summary.py' gives an output......but I've not tried using this table for anything 'downstream'.  Hopefully it'll work fine, but I just don't know yet.

Cheers,

Blair

Mike

unread,
Nov 26, 2013, 9:31:40 AM11/26/13
to qiime...@googlegroups.com
There are obviously different ways to get the data into QIIME. I opted for the procedure below, mainly because several of the UPARSE scripts did not work on my data. Below is a set of commands I have used to process my data via UPARSE and get the data into QIIME. I generally followed the UPARSE pipeline (http://drive5.com/usearch/manual/uparse_cmds.html) with modification: it is a combination of QIIME and UPARSE scripts used to analyze paired-end data:

# join paired ends
usearch7 -fastq_mergepairs R1.fastq -reverse R2.fastq -fastq_truncqual 3 -fastqout merged.fastq -fastaout merged.fasta

# remove unused barcodes, here is a link to the script I posted a while back:
remove_unused_barcodes.py barcodes.fastq merged.fastq merged.barcodes.fastq

# Use QIIME to demultiplex the data, with -q 0. Store output as fastq format (we will quality filter with usearch7)
split_libraries_fastq.py -v -q 0 --store_demultiplexed_fastq -m miseq2_mapping.txt --barcode_type golay_12 -b merged.barcodes.fastq --rev_comp_mapping_barcodes -i merged.fastq -o sl_out

# get quality stats
usearch7 -fastq_stats seqs.fastq -log seqs.stats.log

# remove low quality reads (trimming not required for paired-end data)
usearch7 -fastq_filter seqs.fastq -fastaout seqs.filtered.fasta -fastq_maxee 0.5 -threads 24

# dereplicate seqs
usearch7 -derep_fulllength seqs.filtered.fasta  -output seqs.filtered.derep.fasta -sizeout -threads 24

# filter singletons
usearch7 -sortbysize seqs.filtered.derep.fasta -minsize 2 -output seqs.filtered.derep.mc2.fasta

# cluster OTUs (de novo chimera checking can not be disabled in usearch7)
usearch7 -cluster_otus seqs.filtered.derep.mc2.fasta -otus seqs.filtered.derep.mc2.repset.fasta

# reference chimera check
usearch7 -uchime_ref seqs.filtered.derep.mc2.repset.fasta -db gold.fa -strand plus -nonchimeras seqs.filtered.derep.mc2.repset.nochimeras.fasta -threads 24

# label OTUs using UPARSE python script
python fasta_number.py seqs.filtered.derep.mc2.repset.nochimeras.fasta OTU_ > seqs.filtered.derep.mc2.repset.nochimeras.OTUs.fasta

# map the _original_ quality filtered reads back to OTUs
usearch7 -usearch_global seqs.filtered.fasta -db seqs.filtered.derep.mc2.repset.nochimeras.OTUs.fasta -strand plus -id 0.97 -uc otu.map.uc -threads 24

# make OTU table. I modified the function 'GetSampleID' in the script 'uc2otutab.py' and renamed the script 'uc2otutab_mod.py':
# The modified function is: function is:
# def GetSampleId(Label): 
#    SampleID = Label.split()[0].split('_')[0] 
#    return SampleID 

# I did this because my demultiplexed headers in the otu_map.uc looked like this:
  ENDO.O.2.KLNG.20.1_19 MISEQ03:119:000000000-A3N4Y:1:2101:28299:16762 1:N:0:GGTATGACTCA orig_bc=GGTATGACTCA new_bc=GGTATGACTCA bc_diffs=0

# and all I need is the SampleID: "ENDO.O.2.KLNG.20.1", so I split on '_' 
python uc2otutab_mod.py otu.map.uc > seqs.filtered.derep.mc2.repset.nochimeras.OTU-table.txt

# convert to biom
biom convert --table-type="otu table" -i seqs.filtered.derep.mc2.repset.nochimeras.OTU-table.txt -o seqs.filtered.derep.mc2.repset.nochimeras.OTU-table.biom

# assign taxonomy 
parallel_assign_taxonomy_rdp.py -v --rdp_max_memory 4000 -O 24 -t /gg_13_5/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt -r /gg_13_5/gg_13_8_otus/rep_set/97_otus.fasta -i sl_out_miseq_run_02/seqs.filtered.derep.mc2.repset.nochimeras.OTUs.fasta -o sl_out_miseq_run_02/assigned_taxonomy

# add taxonomy to BIOM table
biom add-metadata --sc-separated taxonomy --observation-header OTUID,taxonomy --observation-metadata-fp assigned_taxonomy/seqs.filtered.derep.mc2.repset.nochimeras.OTUs_tax_assignments.txt -i seqs.filtered.derep.mc2.repset.nochimeras.OTU-table.biom -o seqs.filtered.derep.mc2.repset.nochimeras.tax.OTU-table.biom

# Then off to QIIME-ing.  :-)

I hope the above helps some of you get started. I'd like to see how others have integrated their UPARSE pipeline into QIIME 

-Cheers! :-)
-Mike

Colleen

unread,
Dec 10, 2013, 2:04:59 PM12/10/13
to qiime...@googlegroups.com
Hi Mike,
I am also trying to use UPARSE OTU picking and then use that data for downstream analyses in QIIME.  I was struggling with getting my MiSeq data in the correct format to use in USEARCH, so plan to use the split_libraries_fastq.py script in QIIME, like you apparently did.  I have a quick question about about the command you show below.  What is the -v flag for in your split_libraries_fastq.py command line below?  I don't see that as one of the options to pass with split_libraries_fastq.py...

Thanks!

colleen

Colleen

unread,
Dec 10, 2013, 5:12:05 PM12/10/13
to qiime...@googlegroups.com
One additional question - for Mike, or any one really.  If we are trying to just get the sequence headers or labels in the correct format to use UPARSE for OTU picking with the split_libraries_fastq.py step (so no quality filtering in QIIME), is it also necessary to adjust the -r (--max_bad_run_length) and -p (--min_per_read_length_fraction) parameters to a minimum so that no quality filtering is done in QIIME.  Or, since we are using a q of 0, are the QIIME defaults for the -r and -p parameters irrelevant?

Thanks!

colleen

Mike

unread,
Dec 10, 2013, 6:30:54 PM12/10/13
to qiime...@googlegroups.com
Hi Colleen,

The -v option is just to set the verbose option for the script. Just a habit of mine, especially since some of the scripts do not print anything anyway. :-)

Since, I used joined paired-ends the quality scores are quite good throughout the read. So, I just left the --max_bad_run_length and --min_per_read_length_fraction as is. Though if we wanted to be strict I'd suspect we'd just set the following:

--max_bad_run_length : 250 (for miseq, alternatively set to the average size of your fragments, or close to it, just make it large)
--min_per_read_length_fraction : set very low (0.01)  or to 0.0


On another note, be _very_ wary of using the gold.fa file for your reference database when using UPARSE with using default settings. I've recently had ran into a case where the uchime_ref command removed some of the most abundant reads of a sample. That is, it was a read we expected to be there and was supposed to be the majority of the sample. uchime_ref removed this OTU (and other OTUs that we know should have been in the sample) and skewed our results heavily. This issue went away either, when I opted to use the the 13_8 greengenes database or if I used the gold.fa database with the -minh flag of chime_ref to 1.5-2.5 (default is 0.28). In a nutshell the gold.fa database is not as expansive as greengenes and may discard many valid reads. However, grenegenes may have chimeras. Either way, I'd suggest printing out the chimeras to file when using uchime_ref and playing around with the -minh setting. I also recommend BLAST checking the chimeras no matter which reference database you use, as all these methods have a small false-positive rate of detecting chimeras.

UPARSE is good, but more attention needs to be paid to the user options. I also wish there was on option to disable the de novo chimera checking. I am still validating UPARSE for my analysis and comparing it to the QIIME pipeline. So, if others have experience on this please let us know. :-)

-Mike

Mike

unread,
Dec 18, 2013, 4:49:46 PM12/18/13
to qiime...@googlegroups.com
I have corresponded with Robert Edger over e-mail on how to effectively disable the 'de novo' chimera-checking during '-cluster_otus' within the UPARSE pipeline (usearch v7). You can use the command '‑uparse_break -999'. Which effectively sets up the case for which "... chimeric models would never be optimal"

For example:
usearch7 -cluster_otus filt_derep_mc2.fasta ‑uparse_break -999 -otus filt_derep_mc2_rep_set.fasta

I was just curious as to how the data would look with and without 'de novo' chimera checking as I used to do with version 6.1 of usearch. But the above is a nice way to do something similar for the time being. Anyway, I just wanted to pass this along incase it was useful. I am still learning much abut the new UPARSE pipeline myself, so if anyone has other tricks they'd like to share please let us know.

-Mike

Yoshiki Vázquez Baeza

unread,
Dec 19, 2013, 1:28:52 AM12/19/13
to qiime...@googlegroups.com
Thanks for sharing with the forum Mike.

Yoshiki.


2013/12/18 Mike <soilbd...@gmail.com>

Serena Thomson

unread,
Feb 20, 2014, 9:36:19 AM2/20/14
to qiime...@googlegroups.com
Hi Mike

I am failing to see how your modification to the Sample_ID part of the uc2otutab.py gets round the issue of the error it spits out when the barcodelabel= not found. I noticed you don't use the fastq_strip_barcode_relabel2.py script either, so I'm not sure how you managed this. 

Serena

Mike R

unread,
Feb 20, 2014, 2:28:57 PM2/20/14
to qiime...@googlegroups.com
Hi Serena,

Did you follow my pipeline from the very beginning and only use the commands I posted? If so, then the "barcodelabel= not found" should not arise using the pipeline exactly as posted. In fact, the commands I posted precisely circumvent this issue, as I am not using the `fastq_strip_barcode_relabel2.py` (my data were not in a compatible format for that script). Thus, I circumvent the use of `fastq_strip_barcode_relabel2.py` by using `split_libraries_fastq.py` to demultiplex my data instead. This is why I modified `uc2otutab.py` the way I did. That modification will only work if you demultiplex via `split_libraries_fastq.py`, hence the reason for my posting an example of the OTU header I had to parse. If you notice, the OTU header is in the format you'd generally expect as output from `split_libraries_fastq.py` (e.g. "SampleID_SeqCount<whitespace>OtherInfo".

So, unless there are some peculiarities with your format, or I am missing something, the pipeline I posted should work. My post was just an initial "quick-and-dirty" stab in which I was able to get my data from UPARSE to QIIME with minimal effort. I would love to hear how others coerced there data from UPARSE into QIIME. Especially considering the myriad formats of our respective data. :-)

Does this help?
-Mike

Mike R

unread,
Feb 20, 2014, 2:36:47 PM2/20/14
to qiime...@googlegroups.com
Serena,
I forgot to mention, you can also simplify things by making use of join_paired_ends.py (with the -b flag set), instead of using the `usearch -mergepairs` and the `remove_unused_barcodes.py ` steps (the first two steps) I initially posted. So, just do `join_paired_ends.py` followed by `split_libraries_fastq.py`, then on to the rest of the pipeline.

-Mike
.

Serena Thomson

unread,
Feb 23, 2014, 10:00:25 AM2/23/14
to qiime...@googlegroups.com
Hi Mike,

Really appreciate you taking the time to respond to me personally, thank you. 
I can't follow your pipeline exactly, because I don't have paired end data. I am analysing single reads. Therefore I can't run this: split_libraries_fastq.py but I will try with the standard split_libraries.py command and see if I get this error. 

Ideally I just need to modify the uc2otutab script so that it doesn't look for this barcodelabel=

Thanks again

Serena


--
 
---
You received this message because you are subscribed to a topic in the Google Groups "Qiime Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/qiime-forum/zqmvpnZe26g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to qiime-forum...@googlegroups.com.

Mike R

unread,
Feb 24, 2014, 9:10:34 AM2/24/14
to qiime...@googlegroups.com
Hi Serena,

Not a problem. The split_libraries_fastq.py script can work on either single or paired-end reads. Just skip the "usearch7 -fastq_mergepairs" / "join_paired_ends.py" step and start directly with split_libraries_fastq.py. Also, the "barcodelabel=" is added to your data by the uparse python scripts  "fastq_strip_barcode_relabel.py" and/or "fastq_strip_barcode_relabel2.py". So, simply do not run these scripts on your raw data. Just take the raw data that you received from the sequencing facility and send it directly to split_libraries_fastq.py, then through the rest of the pipeline. 

-Mike

Carly

unread,
Mar 17, 2014, 10:39:22 PM3/17/14
to qiime...@googlegroups.com
Hi Mike,

Thank you for the pipeline. This has been very helpful. I am at the 'uc2otutab.py' command, and have run into an issue. It is that I do not know how to modify a command. You state you renamed the script 'uc2otutab_mod.py' so that the issue of 

### uc2otutab.py otu.map.uc

###**ERROR** barcodelabel= not found in read label '38B4_2'

does not happen. But how do I do that -- modify the command that is? At this point in my career I can only write commands, not modify them! 

Any help would be greatly appreciated. I am actually merging four 454 Junior runs together and using QIIME to demultiplex at first is very helpful. 

Thanks!

Carly

Mike R

unread,
Mar 17, 2014, 11:15:33 PM3/17/14
to qiime...@googlegroups.com
I mentioned the details of the `uc2otutab.py` script changes along with the original pipeline post earlier in this thread, here. Just open `uc2otutab.py` in any raw text editor and replace

these lines:

def GetSampleId(Label):
Fields = Label.split(";")
for Field in Fields:
if Field.startswith("barcodelabel="):
return Field[13:]
die.Die("barcodelabel= not found in read label '%s'" % Label)

with these:

def GetSampleId(Label): 
    SampleID = Label.split()[0].split('_')[0] 
    return SampleID 

I've attached the  `uc2otutab_mod.py` file to this post. 

-Mike
uc2otutab_mod.py

Carly

unread,
Mar 18, 2014, 12:21:28 AM3/18/14
to qiime...@googlegroups.com
Thanks Mike! I would not have been able to figure out using UPARSE without your code, so many thanks. I will let you know what I find in my comparisons of methods. UPARSE - versus the QIIME pipeline I was using (denoising, uclust, chimera slayer). 

Cheers,

Carly

aswa...@scigenom.com

unread,
Mar 27, 2014, 8:23:31 AM3/27/14
to qiime...@googlegroups.com
Hi Mike,
I am folowing your given pipeleine starting from the step  " dereplicate seqs". for joining paired ends and quality filtering I am using my own programs.
 My doubt is regarding the step "# map the _original_ quality filtered reads back to OTUs".

I also mapped my representative sequences to Initial quality filtered sequences. My Initial file contains 9,45,822 reads. But when I mapped it back total sequences mapped where 460101 in total. i used tghe usearch command with options -userach_global and id 0.97.

Is it obvious that we miss around 485721 sequences??
Regards,
Aswathi

Mike R

unread,
Mar 27, 2014, 7:48:21 PM3/27/14
to qiime...@googlegroups.com
I'd be happy to help, but it's hard to say what the issue is without knowing what you are doing with your own programs. If you are using UPARSE, then you should use use the entirety of the pipeline. Actually, other than the `split_libraries_fastq.py` bit and the modified `uc2otutab.py` script the pipeline I posted is basically copied directly from the original UPARSE page that linked to in my original post. Can you attach list the steps you used?

-Mike

Aswathi

unread,
Mar 28, 2014, 12:28:04 AM3/28/14
to qiime...@googlegroups.com
Hi Mike,

Thanks for the reply.
I am doing a V3 analysis sequenced by illumina miseq..PairedEnd data.
Quality filtering (removing reads with low average quality) and consenus making from R1 and R2 done by a perl script. 
Now I have all candidate V3 sequences in fasta in hand for the downstream analysis
Then I follo: the steps,
1.usearch7 -derep_fulllength sample.fa -output sample_derep.fa -sizeout
2. usearch -sortbysize sample_derep.fa -minsize 2 -output sample_NoSingleton.fa
3. usearch -uchime_denovo sample_NoSingleton.fa --chimeras chimera.fa --nonchimeras nonchimera.fa
My metagenomic set conians multiple samples. Till chimera detection I run individually. Then I combined all (non-chimeric sequences) to one file AllSamples.fa
4.Run  picked OTU, pick_rep_set, asiign taxonomy from qiime\
5. usearch -usearch_global InitialAll.fa -db AllSamples.rep_set.fa -strand plus -id 0.97 -uc AllSamples_otu.map.uc -e Allchimerea.fa

(here InitialAll.fa is my combined sequences even before dereplication.)

6. python uc2otutab_mod.py AllSamples_otu.map.uc >AllSamples_otu.map.txt
7.sed 's/Consensus Lineage/ConsensusLineage/' < AllSamples_otu.map.txt | sed 's/ConsensusLineage/taxonomy/' > AllSamples_otu.map.taxonomy.txt
8. biom convert -i AllSamples_otu.map.taxonomy.txt -o AllSamples_otu.map.taxonomy.biom --table-type="otu table" --process-obs-metadata taxonomy

This biom file was used for Alpha diveristy and beta diversity

8. alpha_rarefaction.py and jackniffed_beta_diversity.py

When I made the biom summary table, only half of the sequences from the initial file are mapped to each sample??

I will be really glad if you go through these steps and let me know any flaws in the pipeline?

Note: I am using usearch 7 and qiime 1.5.0

Regards,
Aswathi

Mike R

unread,
Mar 29, 2014, 10:10:53 AM3/29/14
to qiime...@googlegroups.com
Hi Aswathi, I will answer inline...


Quality filtering (removing reads with low average quality) and consenus making from R1 and R2 done by a perl script. 
Now I have all candidate V3 sequences in fasta in hand for the downstream analysis

For merging you pairs, is there a reason why you've upgraded to QIIME 1.8 so you can use `join_paired_ends.py` or make use of `usearch7 -fastq_mergepairs` to merge your reads as outlined in previous posts in this thread?

For quality filtering, why not simply use `split_libraries_fastq.py` using `-q 19` flag? or `usearch7 -fastq_filter `? NOTE: Do not do both, use one or the other.

 
Then I follo: the steps,
1.usearch7 -derep_fulllength sample.fa -output sample_derep.fa -sizeout
2. usearch -sortbysize sample_derep.fa -minsize 2 -output sample_NoSingleton.fa
3. usearch -uchime_denovo sample_NoSingleton.fa --chimeras chimera.fa --nonchimeras nonchimera.fa


I noticed you are using a mix of usearch7 and usearch (v6?) in the above steps. usearch7 no longer has the command `-uchime_denovo` This is because de novo chimera checking is always on by default. See earlier posts in this thread on how to disable de novo chimera checking in usearch7. Please see the documentation here (http://drive5.com/usearch/manual/uparseotu_algo.html). 

 
My metagenomic set conians multiple samples. Till chimera detection I run individually. Then I combined all (non-chimeric sequences) to one file AllSamples.fa

You have microbiomic (amplicon data) not metgenome data.   :-)
 
4.Run  picked OTU, pick_rep_set, asiign taxonomy from qiime\

I am confused here. Are you picking OTUs through usearch7, usearch6, or QIIME? If you are going to pick otus through QIIME and want to use, the de novo and reference-based chimera checking you can set  `usearch61` as your otu picking method for pick_open_reference_otus.py, etc... Or if you want to keep things separate you can also make use of identify_chimeric_seqs.py with `-m usearch61`. Then then pick otus with uclust (dafault).

 
5. usearch -usearch_global InitialAll.fa -db AllSamples.rep_set.fa -strand plus -id 0.97 -uc AllSamples_otu.map.uc -e Allchimerea.fa

 
Again why go back to the older version of usearch?

 
(here InitialAll.fa is my combined sequences even before dereplication.)

6. python uc2otutab_mod.py AllSamples_otu.map.uc >AllSamples_otu.map.txt
7.sed 's/Consensus Lineage/ConsensusLineage/' < AllSamples_otu.map.txt | sed 's/ConsensusLineage/taxonomy/' > AllSamples_otu.map.taxonomy.txt


No need to do step 7 if you upgrade to QIIME 1.8. See my initial post on how to add taxonomy to the table via `biom add-metadata ...`. 

 
8. biom convert -i AllSamples_otu.map.taxonomy.txt -o AllSamples_otu.map.taxonomy.biom --table-type="otu table" --process-obs-metadata taxonomy

This biom file was used for Alpha diveristy and beta diversity

8. alpha_rarefaction.py and jackniffed_beta_diversity.py

When I made the biom summary table, only half of the sequences from the initial file are mapped to each sample??


I think the issue has more to do with using two different versions of usearch. Also, I am not sure what your other scripts are actually doing to the data. Have you looked at the reduction of sequence count at each step of your process?  In general, the easiest way to track down any issues is to follow the pipeline as originally posted and then compare any differences to your pipeline.

 

I will be really glad if you go through these steps and let me know any flaws in the pipeline?

Note: I am using usearch 7 and qiime 1.5.0



Seems like you are using two versions of usearch. If you can, you should upgrade to QIIME 1.8.

Hope this helps!
-Mike
 

Mike R

unread,
Mar 29, 2014, 10:13:26 AM3/29/14
to qiime...@googlegroups.com
First comment should have been:

"For merging you pairs, is there a reason why you've not upgraded to QIIME 1.8 ..."  :-)

-Mike

Aswathi

unread,
Mar 30, 2014, 2:51:24 AM3/30/14
to qiime...@googlegroups.com
Hi Mike,

Thank a lot for the reply.

1. First of all I am going to update my Qiime to ver1.8 at the earliest.  :-)

2. For steps, 1,2,3 & 5 I am using usearch v7.0.1090_i86linux32 . (`-uchime_denovo`  step worked with this version .. ???? Confused... Is it possible)

3. I would like to use qiime for picking otu and rest of the analysis (pick rep set, assign taxonomy etc). If I  use identify_chimeric_seqs.py with `-m usearch61, will there be any usearch version problem I am using ver7 for other steps?(Definitely, I will check this out )
4. Step 5 also I am running using usearch v7. 

(I am running steps 3 & 5 through the version 7, but as you said if commands are no longer available, does it take the older version utility to run these steps , becoz I am getting the results without showing any error.)
I am totally new in this area  , so thanks for your patience:-)

-Aswathi.

.

Mike R

unread,
Mar 30, 2014, 11:59:52 AM3/30/14
to qiime...@googlegroups.com
See the command for -uchime_denovo.

Before continuing further, you should look through usearch documentation here and contact the developer for any questions regarding usearch and its various versions.

-Mike 
Message has been deleted
Message has been deleted

Kyle Bittinger

unread,
Apr 7, 2014, 8:40:54 AM4/7/14
to qiime...@googlegroups.com
Ida,

This is Kyle, the new QIIME forum master for the next two weeks.

Without implying that we provide support for the UCLUST Python scripts, it sounds like you might be using the unmodified version of uc2utotab.py, instead of the modified version suggested by Mike R.

Best,
Kyle


On Mon, Apr 7, 2014 at 6:43 AM, idamoverleir <idamov...@gmail.com> wrote:
Hi,
I´m using Mike´s approach to implement uparse into qiime, it´s brilliant.
However, when I try to use the uc2utotab.py script this is what happens:
$ python ~/qiime_tutorial/uc2otutab.py otu.map.uc > seqs.filtered.derep.mc2.repset.nochimeras.OTU-table.txt
--->
**ERROR** barcodelabel= not found in read label 'ST.1_3 M01337:22:000000000-A5PCV:1:1101:24481:4785 1:N:0:1 orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0'

My barcodes were removed by default by the MiSeq and most likely the barcode labels too.

How can I work around this?

Thanks

Ida

--

---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

idamoverleir

unread,
Apr 7, 2014, 1:12:52 PM4/7/14
to qiime...@googlegroups.com
Yeah, I looked at my data file and saw that the ´_´ is pretty early in the ID sequence and was worried 
the modified script would cut off important ID information too. I dont think it did though, I tried with 
the modified script and it seems fine.

Now I´m having some trouble with parallel_assign_taxonomy_rdp.py.
I downloaded all of the files on this page: http://greengenes.secondgenome.com/downloads/database/13_5
but I cant figure out from parallel_assign_etc -h which files should be -t (id_to_taxonomy_fp) or -r (reference_seqs_fp)?
Can you please send me in the right direction?

Mike´s command:
# assign taxonomy 
parallel_assign_taxonomy_rdp.py -v --rdp_max_memory 4000 -O 24 -t /gg_13_5/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt -r /gg_13_5/gg_13_8_otus/rep_set/97_otus.fasta -i sl_out_miseq_run_02/seqs.filtered.derep.mc2.repset.nochimeras.OTUs.fasta -o sl_out_miseq_run_02/assigned_taxonomy

Mine was the same as I had no original ideas here :-)

Thanks for any help!

Kyle Bittinger

unread,
Apr 7, 2014, 1:19:32 PM4/7/14
to qiime...@googlegroups.com

idamoverleir

unread,
Apr 7, 2014, 1:41:12 PM4/7/14
to qiime...@googlegroups.com
Yes, but what file is the rep_set file?

Thanks

Kyle Bittinger

unread,
Apr 7, 2014, 1:44:15 PM4/7/14
to qiime...@googlegroups.com
The rep_set files are in the rep_sets subfolder in GreenGenes 13_8.


On Mon, Apr 7, 2014 at 1:41 PM, idamoverleir <idamov...@gmail.com> wrote:
Yes, but what file is the rep_set file?

Thanks

--

idamoverleir

unread,
Apr 8, 2014, 5:30:10 AM4/8/14
to qiime...@googlegroups.com
Thanks, Kyle, I figured it out.
A new issue is that the parallel_assign_taxonomy_rdp.py is taking 
a very very long time. I left the command running over night and nothing had
happened this morning.
I use macqiime v1.8.0 and this was my command:
$ parallel_assign_taxonomy_rdp.py -v --rdp_max_memory 4000 -O 24 -t /Users/service/qiime_tutorial/97_otu_taxonomy.txt -r /Users/service/qiime_tutorial/97_otus.fasta -i seqs.filtered.derep.mc2.repset.nochimeras.OTUs.fasta -o assigned_taxonomy
All I got was a folder named assigned_taxonomy with an empty file in it.

What can be the problem?

Very grateful for replies,

thanks

Kyle Bittinger

unread,
Apr 8, 2014, 9:06:01 AM4/8/14
to qiime...@googlegroups.com
I would try the non-parallel version on a small set of input sequences (maybe like 200 seqs) and verify that the command can run all the way through.  Then try the parallel version on the same input file, but use only 2 cores, and verify that it works.

I would also increase the rdp_max_memory to 8000.  What is the available memory on your machine?

--Kyle


--

idamoverleir

unread,
Apr 8, 2014, 11:08:21 AM4/8/14
to qiime...@googlegroups.com
I have 4 GB system memory, of which 159 MB is available.

I tried with this command:
$ assign_taxonomy.py -i seqs.OTUs.subset.fasta -t /Users/service/qiime_tutorial/97_otu_taxonomy.txt -r /Users/service/qiime_tutorial/97_otus.fasta -o assigned_taxonomy

seqs.OTUs.subset.fasta contains about 300 seqs.

It went through but no OTUs were assigned in the output file.

Ida

idamoverleir

unread,
Apr 8, 2014, 11:26:50 AM4/8/14
to qiime...@googlegroups.com
Now Ive tried with parallell command:

$ parallel_assign_taxonomy_rdp.py -v --rdp_max_memory 8000 -O 24 -i seqs.OTUs.subset.fasta -t same as before -r same as before -o assigned_taxonomy

Same subset file. It has been running for over 15 minutes now.

Thanks

Ida

Kyle Bittinger

unread,
Apr 8, 2014, 11:49:46 AM4/8/14
to qiime...@googlegroups.com
Let's get the non-parallel version working first.  Try a larger input file until you start to see some reads assigned, and then you can validate the assignments by BLASTing at NCBI.


--

idamoverleir

unread,
Apr 8, 2014, 12:06:57 PM4/8/14
to qiime...@googlegroups.com
I´ve tried with 1000 seqs but all I get is this:
TU_248 Unassigned 1.00 1
OTU_249 Unassigned 1.00 1
OTU_246 Unassigned 1.00 1
OTU_247 Unassigned 1.00 1
OTU_244 Unassigned 1.00 1
OTU_245 Unassigned 1.00 1
OTU_242 Unassigned 1.00 1
OTU_243 Unassigned 1.00 1
OTU_240 Unassigned 1.00 1
OTU_3997 Unassigned 1.00 1
OTU_1972 Unassigned 1.00 1
OTU_1973 Unassigned 1.00 1
OTU_1970 Unassigned 1.00 1
OTU_1971 Unassigned 1.00 1
OTU_1976 Unassigned 1.00 1
OTU_1977 Unassigned 1.00 1

Command: $ assign_taxonomy.py -v --rdp_max_memory 8000 -t /Users/service/qiime_tutorial/97_otu_taxonomy.txt -r /Users/service/qiime_tutorial/97_otus.fasta -i seqs.OTUs.subset.fasta -o assigned_taxonomy

Ida

Kyle Bittinger

unread,
Apr 8, 2014, 12:17:33 PM4/8/14
to qiime...@googlegroups.com
If you BLAST these sequences against /Users/service/qiime_tutorial/97_otus.fasta, do you get many hits?

Also, build a small FASTA file with about 10 well-classified reads from the GreenGenes reference set, then try to assign these.

--Kyle


--

idamoverleir

unread,
Apr 8, 2014, 12:18:27 PM4/8/14
to qiime...@googlegroups.com
Hi again. I tried with my original file (approx 4000 seqs) and got 3 hits using the parallell command.
Those were all for the same fungi: Ascomyota. When I blasted one of the 3 I got lots of uncultures bacterium
clones, not fungi.

Same command as last time.

What can this mean?

Kyle Bittinger

unread,
Apr 8, 2014, 12:23:31 PM4/8/14
to qiime...@googlegroups.com
What amplicon did you sequence?  Are you targeting bacteria or fungi?


--

idamoverleir

unread,
Apr 8, 2014, 12:26:30 PM4/8/14
to qiime...@googlegroups.com
The 16S rRNA amplicon, targeting bacteria.
I tried to assign taxonomy for 7 seqs from the greengenes file gg_13_5.fasta and 
they were all unassigned.

Kyle Bittinger

unread,
Apr 8, 2014, 12:30:25 PM4/8/14
to qiime...@googlegroups.com
Can you send the command and input file you used to test the assignment of GreenGenes sequences?

What 16S region did you sequence?

Best,
Kyle


--

idamoverleir

unread,
Apr 8, 2014, 12:33:22 PM4/8/14
to qiime...@googlegroups.com
$ assign_taxonomy.py -v --rdp_max_memory 8000 -t /Users/service/qiime_tutorial/97_otu_taxonomy.txt -r /Users/service/qiime_tutorial/97_otus.fasta -i gg_13_5.seqs.fasta -o assigned_taxonomy

GG file is attached.

The V3V4 region.

Thanks!

Ida
gg_13_5.seqs.fasta

Kyle Bittinger

unread,
Apr 8, 2014, 12:46:02 PM4/8/14
to qiime...@googlegroups.com
I am running this on my end using QIIME 1.8 and GreenGenes 13_8 to give you an example of the output you should see.


--

Kyle Bittinger

unread,
Apr 8, 2014, 12:52:26 PM4/8/14
to qiime...@googlegroups.com
I think we need to focus on fixing your QIIME installation.  I'm going to ask you to use the non-parallel version only until we fix the problem.  You should be getting something like this as an answer.

1111876 k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rickettsiales;f__Anaplasmataceae;g__Anaplasma;s__       1.000
1111883 k__Bacteria;p__Gemmatimonadetes;c__Gemm-1;o__;f__;g__;s__       1.000
1111882 k__Bacteria;p__Bacteroidetes;c__Flavobacteriia;o__Flavobacteriales;f__Flavobacteriaceae;g__Flavobacterium;s__   1.000
1111886 k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Sphingomonadales;f__Sphingomonadaceae;g__Kaistobacter;s__       1.000
1111885 k__Bacteria;p__Proteobacteria;c__Deltaproteobacteria;o__Syntrophobacterales;f__Syntrophobacteraceae;g__;s__     0.940
1111879 k__Bacteria;p__Chloroflexi;c__Chloroflexi;o__[Roseiflexales];f__[Roseiflexaceae];g__Roseiflexus;s__     1.000

It is possible that you are running out of memory at the training stage, but the problem is not being reported.  Would you try again after closing all other applications EXCEPT for your system monitor?  While the RDP Classifier is running, look at your memory consumption and let me know the baseline value and the maximum value.

--Kyle

idamoverleir

unread,
Apr 8, 2014, 12:58:01 PM4/8/14
to qiime...@googlegroups.com
Thanks! I'll get back to you in half an hour

idamoverleir

unread,
Apr 8, 2014, 1:20:56 PM4/8/14
to qiime...@googlegroups.com
Hi, 
when I do that "Active memory" jumps from 219 MB to 900 something.

Best

Kyle Bittinger

unread,
Apr 8, 2014, 1:24:47 PM4/8/14
to qiime...@googlegroups.com
Can you confirm that, while running assign_taxonomy.py, it does not go above 900MB, and never goes close to your full 4G of memory?

--Kyle


--

idamoverleir

unread,
Apr 8, 2014, 1:29:17 PM4/8/14
to qiime...@googlegroups.com
It varies a bit for each time I run the command, but for the last five times it hasnt gone above 888 MB.
I have 1,78 GB free and this one never bumps to 0 during.

Best
Ida

Kyle Bittinger

unread,
Apr 8, 2014, 1:51:14 PM4/8/14
to qiime...@googlegroups.com
Very suspicious.  Can you send me the output of these 3 commands?

wc -l /Users/service/qiime_tutorial/97_otu_taxonomy.txt
wc -l /Users/service/qiime_tutorial/97_otus.fasta
print_qiime_config.py

--Kyle


--

idamoverleir

unread,
Apr 8, 2014, 1:55:30 PM4/8/14
to qiime...@googlegroups.com
wc -l /Users/service/qiime_tutorial/97_otu_taxonomy.txt
   55404 /Users/service/qiime_tutorial/97_otu_taxonomy.txt
MacQIIME dhcp984-ans:qiime_tutorial $ wc -l /Users/service/qiime_tutorial/97_otus.fasta
  110808 /Users/service/qiime_tutorial/97_otus.fasta
MacQIIME dhcp984-ans:qiime_tutorial $ print_qiime_config.py

System information
==================
         Platform: darwin
   Python version: 2.7.3 (default, Dec 19 2012, 09:12:08)  [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
Python executable: /macqiime/bin/python

Dependency versions
===================
                     PyCogent version: 1.5.3
                        NumPy version: 1.7.1
                   matplotlib version: 1.1.0
                  biom-format version: 1.3.1
                         qcli version: 0.1.0
                QIIME library version: 1.8.0
                 QIIME script version: 1.8.0
        PyNAST version (if installed): 1.2.2
                      Emperor version: 0.9.3
RDP Classifier version (if installed): rdp_classifier-2.2.jar
          Java version (if installed): 1.6.0_45

QIIME config values
===================
                     blastmat_dir: None
                         sc_queue: all.q
      topiaryexplorer_project_dir: None
     pynast_template_alignment_fp: /macqiime/greengenes/core_set_aligned.fasta.imputed
                  cluster_jobs_fp: /macqiime/QIIME/bin/start_parallel_jobs.py
pynast_template_alignment_blastdb: None
assign_taxonomy_reference_seqs_fp: /macqiime/greengenes/gg_13_8_otus/rep_set/97_otus.fasta
                     torque_queue: friendlyq
   template_alignment_lanemask_fp: /macqiime/greengenes/lanemask_in_1s_and_0s
                    jobs_to_start: 1
                cloud_environment: False
                qiime_scripts_dir: /macqiime/QIIME/bin/
            denoiser_min_per_core: 50
                      working_dir: /tmp/
                    python_exe_fp: /macqiime/bin/python
                         temp_dir: /tmp/
                      blastall_fp: blastall
                 seconds_to_sleep: 60
assign_taxonomy_id_to_taxonomy_fp: /macqiime/greengenes/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt

That was it.

This is highly appreciated!

Ida

Kyle Bittinger

unread,
Apr 8, 2014, 2:06:00 PM4/8/14
to qiime...@googlegroups.com
Can you try your test again, but use the following files for -r and -t respectively?  I will be perplexed if this does not result in some assignments.

/macqiime/greengenes/gg_13_8_otus/rep_set/97_otus.fasta
/macqiime/greengenes/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt



--

idamoverleir

unread,
Apr 8, 2014, 2:22:24 PM4/8/14
to qiime...@googlegroups.com
Definitely worked, also with the original file.
Im trying the same for the parallel_assign command right now,
still not working very fast :-)

Thanks

Ida

idamoverleir

unread,
Apr 8, 2014, 2:46:39 PM4/8/14
to qiime...@googlegroups.com
The parallel_assign command has been running for half an hour now.
I think it's the same situation as last night.

Can you see what's the issue here too?

Thanks, 
Ida

Kyle Bittinger

unread,
Apr 8, 2014, 3:00:27 PM4/8/14
to qiime...@googlegroups.com
If there is an issue with the parallel command, but the non-parallel command is working, we will have to track down the parallel issue separately.

Try taking the test command that works with the non-parallel version, and running with the parallel version, but only use 2 cores.

--Kyle


--

idamoverleir

unread,
Apr 8, 2014, 3:07:42 PM4/8/14
to qiime...@googlegroups.com
I´m not sure how to choose "2 cores".

Ida

Kyle Bittinger

unread,
Apr 8, 2014, 3:50:47 PM4/8/14
to qiime...@googlegroups.com
Use option "-O 2" instead of "-O 24"


On Tue, Apr 8, 2014 at 3:07 PM, idamoverleir <idamov...@gmail.com> wrote:
I´m not sure how to choose "2 cores".

Ida

--

BlakeS

unread,
Apr 9, 2014, 11:24:10 AM4/9/14
to qiime...@googlegroups.com
Strange error with the "uc2otutab_mod.py" script as downloaded. Thoughts?

import.im6: unable to open X server `' @ error/import.c/ImportImageCommand/368.

import.im6: unable to open X server `' @ error/import.c/ImportImageCommand/368.

import.im6: unable to open X server `' @ error/import.c/ImportImageCommand/368.

import.im6: unable to open X server `' @ error/import.c/ImportImageCommand/368.

/home/lab/qiime_software/qiime-1.8.0-release/bin/uc2otutab_mod.py: line 6: FileName: command not found

/home/lab/qiime_software/qiime-1.8.0-release/bin/uc2otutab_mod.py: line 17: syntax error near unexpected token `('

/home/lab/qiime_software/qiime-1.8.0-release/bin/uc2otutab_mod.py: line 17: `def GetSampleId(Label): ' 

Kyle Bittinger

unread,
Apr 9, 2014, 11:32:34 AM4/9/14
to qiime...@googlegroups.com
Sounds like the script is being run by BASH, not as a python script.  Try running as

python /full/path/to/uc2otutab_mod.py


--

BlakeS

unread,
Apr 9, 2014, 2:07:21 PM4/9/14
to qiime...@googlegroups.com
Yep, noticed my mistake after posting. Thought I'd make my life easier by scripting out all the commands. 

Thank you!

Matt Wade

unread,
Apr 14, 2014, 11:25:56 AM4/14/14
to qiime...@googlegroups.com

Hi,

I'm using Mike's pipeline to process some samples.

The samples come from 3 plates (454) with multiplexing. Repeated barcodes are apparent and so after completing the strip barcodes script, I then concatenate the reads.fa into a single fasta file and continue.

Everything goes well and no errors.

However, when using the original uc2otutab.py script, the taxonomy does not seem to get appended to the biom file. I suspected this had something to do with the naming of the OTUIDs. 

I then tried with the modified script (uc2otutab_mod.py) and the whole process took much much longer and generated a massive OTU table with many repeating columns (ext230; barcodelabel=B7, for example repeated). 

Because the barcode labels are different from duplicate barcodes, I assume that there is no problem with that (A7, B7), but why is the taxonomy not being appended to the OTU table in biom form?

Regards,
Matt

Mike R

unread,
Apr 14, 2014, 11:59:32 AM4/14/14
to qiime...@googlegroups.com
Hi Matt,

Can you post the header information as it appears in the `uc` file? Also, just to confirm, the `uc2otutab_mod.py` does not append the taxonomy. It just makes the OTU table in tab-delim format. You need to run the following to append the taxonomy to your OTU table:

biom convert ...
parallel_assign_taxonomy_rdp.py ...
biom add-metadata ...

How did you demultiplex your data? If you followed the pipeline exactly then you should _not_ have `barcodelabel=` in your headers (see the example header in my original post). It'd help if you can provide which steps you've deviated from in the originally posted pipeline. :-)

-Mike

Matt Wade

unread,
Apr 14, 2014, 12:28:18 PM4/14/14
to qiime...@googlegroups.com
Hi Mike,

Yes, I am aware of the functionality of uc2otutab_mod. Using that script basically results in a much slower progress of the function, plus a massive otu table.

i have gone back to the original file and here is the header:

H 161 250 100.0 + 0 0 250M ex91;barcodelabel=B7; OTU_162

H 9087 250 100.0 + 0 0 250M ex40;barcodelabel=B18; OTU_9088

H 799 250 100.0 + 0 0 250M ex106;barcodelabel=B19; OTU_800

H 7 250 100.0 + 0 0 250M ex79;barcodelabel=B4; OTU_8

H 31 250 100.0 + 0 0 250M ex104;barcodelabel=B17; OTU_32

H 3023 250 100.0 + 0 0 250M ex153;barcodelabel=B3; OTU_3024

H 7596 250 97.2 + 0 0 250M ex138;barcodelabel=B22; OTU_7597

H 0 250 99.6 + 0 0 250M ex139;barcodelabel=B23; OTU_1

H 2742 250 97.6 + 0 0 250M ex130;barcodelabel=B7; OTU_2743

H 0 250 100.0 + 0 0 250M ex142;barcodelabel=B23; OTU_1


So barcode labels are still present.

I've attached my shell script so you can see the deviations from your pipe.

Best regards and thanks

matt
uparse_pipe_full_new.sh

Mike R

unread,
Apr 14, 2014, 1:33:28 PM4/14/14
to qiime...@googlegroups.com
Thanks Matt! Hopefully, I can help narrow this issue down.

As you can see with the posted pipeline here, I did not use: `convert_fastaqual_fastq.py` or `fastq_strip_barcode_relabel2.py`. I used `split_libraries_fastq.py` (w/ modified parameter settings) for my illumina data to do the demultiplexing.

Okay, so your uc file lists this information at the end:
`ex142;barcodelabel=B23; OTU_1`

When I use `split_libraries_fastq.py` to do the demultiplexing instead of the `fastq_strip_barcode_relabel2.py` script, that same information at the end of my uc file looks like this:
`ENDO.O.2.KLNG.20.1_19 MISEQ03:119:000000000-A3N4Y:1:2101:28299:16762 1:N:0:GGTATGACTCA orig_bc=GGTATGACTCA new_bc=GGTATGACTCA bc_diffs=0`

Which is what the `uc2otutab_mod.py` is expecting, as detailed in the initial post (re-linked above). Assuming nothing else is wrong, I do not know why you'd have issues going about it the way you did. I can only assume something went wonky with the `fastq_strip_barcode_relabel2.py` script? Can you post a snippet of the first few reads resulting from that command? have you tried the other `fastq_strip_barcode_relabel.py` script?

Also, do the OTU labels in the `tax_assignments.txt` files match those in your OTU tables? I can only assume that the taxonomy is not being appended because the OTU labels do not match between the two files, for whatever reason. If not, then you may have to modify the `uc2otutab.py` script to suit your needs, or try the posted pipeline as is. If you end up with another method to import UPARSE data into QIIME, by all means post it here. Like I said before, the pipeline that I posted worked best for my use case. So, your mileage may vary. :-)

-Mike

Mike R

unread,
Apr 14, 2014, 3:47:46 PM4/14/14
to qiime...@googlegroups.com
Hi Matt, I forgot to mention this updated modificaiton of the pipeline as I also previously posted in this thread here.

-Mike

Mike R

unread,
Apr 14, 2014, 4:06:48 PM4/14/14
to qiime...@googlegroups.com
for illumina data that is...

Mike R

unread,
Apr 14, 2014, 4:22:06 PM4/14/14
to qiime...@googlegroups.com
Matt,

Sorry I keep forgetting that you are using 454 data. I wonder if you just used `split_libraries.py` with the  `--record_qual_scores` option in place of the `fastq_strip_barcode_relabel2.py` script. Then used convert_fastaqual_fastq.py on the resulting seqs.fna and seqs.qual output. Then you can continue from your "Quality filter, truncate length and convert to fasta" step. Then your downstream data should work with the `uc2otutab_mod.py` script as the format of `split_libraries.py` and `split_libraries_fastq.py` should be identical?

-Mike
Message has been deleted

Matt Wade

unread,
Apr 15, 2014, 8:16:46 AM4/15/14
to qiime...@googlegroups.com
Hi Mike,

I have successfully modified the script to use split libraries and the mod python script.

However, assigned taxonomy is still not getting appended.

It is indeed related to the mismatch in IDs between the two files.

A sample line from the taxonomy is:

A4_39776;size=28; k__Bacteria;p__Proteobacteria;c__Deltaproteobacteria;o__Desulfobacterales;f__Desulfobacteraceae 0.960

Whereas my OTU_IDs take the form OTU_1.

I will look at why the OTUID is being incorrectly stripped and used in the assign taxonomy step.

Regards
matt

Matt Wade

unread,
Apr 15, 2014, 11:14:33 AM4/15/14
to qiime...@googlegroups.com
Mike,
Scratch this, I was using the wrong otu file...

It seems to work now.

Thanks again,
Matt

Mike R

unread,
Apr 15, 2014, 12:19:35 PM4/15/14
to qiime...@googlegroups.com
Not a problem Matt. Glad you got it working! :-)

-Mike

DoH

unread,
Jul 10, 2014, 2:58:41 PM7/10/14
to qiime...@googlegroups.com
Hi,

My data from ion torrent is in SFF that I converted to fastq (not yet demultiplexed). I need to use UPARSE. Here I see that it's demultiplexed fastq that is being used. For my data, I can only run "split_libraries.py" instead of "split_libraries_fastq.py". After demultiplexing the dataset, I got .qual and .fna as one of the outputs (which I converted to fastq in order to use UPARSE (by starting at the following step: "usearch7 -fastq_filter seqs.fastq -fastaout seqs.filtered.fasta -fastq_maxee 0.5 -threads 24"). I believe that this fastq is now demultiplexed. I continued running the other steps, getting results, but upon reaching the generation of the OTU table (txt), nothing was written in the file. My .uc file has outputs in it but lack any headers. If one has wants to use UPARSE (then QIIME), where do he/she start from (given that the only things you have are fastq (before demultiplexing with single reads (so nothing like R1 and R2)) and mapping files. Kindly assist.

Harris


On Tuesday, 26 November 2013 16:31:40 UTC+2, Mike Robeson wrote:
There are obviously different ways to get the data into QIIME. I opted for the procedure below, mainly because several of the UPARSE scripts did not work on my data. Below is a set of commands I have used to process my data via UPARSE and get the data into QIIME. I generally followed the UPARSE pipeline (http://drive5.com/usearch/manual/uparse_cmds.html) with modification: it is a combination of QIIME and UPARSE scripts used to analyze paired-end data:

# join paired ends
usearch7 -fastq_mergepairs R1.fastq -reverse R2.fastq -fastq_truncqual 3 -fastqout merged.fastq -fastaout merged.fasta

# remove unused barcodes, here is a link to the script I posted a while back:
remove_unused_barcodes.py barcodes.fastq merged.fastq merged.barcodes.fastq

# Use QIIME to demultiplex the data, with -q 0. Store output as fastq format (we will quality filter with usearch7)
split_libraries_fastq.py -v -q 0 --store_demultiplexed_fastq -m miseq2_mapping.txt --barcode_type golay_12 -b merged.barcodes.fastq --rev_comp_mapping_barcodes -i merged.fastq -o sl_out

# get quality stats
usearch7 -fastq_stats seqs.fastq -log seqs.stats.log

# remove low quality reads (trimming not required for paired-end data)
usearch7 -fastq_filter seqs.fastq -fastaout seqs.filtered.fasta -fastq_maxee 0.5 -threads 24

# dereplicate seqs
usearch7 -derep_fulllength seqs.filtered.fasta  -output seqs.filtered.derep.fasta -sizeout -threads 24

# filter singletons
usearch7 -sortbysize seqs.filtered.derep.fasta -minsize 2 -output seqs.filtered.derep.mc2.fasta

# cluster OTUs (de novo chimera checking can not be disabled in usearch7)
usearch7 -cluster_otus seqs.filtered.derep.mc2.fasta -otus seqs.filtered.derep.mc2.repset.fasta

# reference chimera check
usearch7 -uchime_ref seqs.filtered.derep.mc2.repset.fasta -db gold.fa -strand plus -nonchimeras seqs.filtered.derep.mc2.repset.nochimeras.fasta -threads 24

# label OTUs using UPARSE python script
python fasta_number.py seqs.filtered.derep.mc2.repset.nochimeras.fasta OTU_ > seqs.filtered.derep.mc2.repset.nochimeras.OTUs.fasta

# map the _original_ quality filtered reads back to OTUs
usearch7 -usearch_global seqs.filtered.fasta -db seqs.filtered.derep.mc2.repset.nochimeras.OTUs.fasta -strand plus -id 0.97 -uc otu.map.uc -threads 24

# make OTU table. I modified the function 'GetSampleID' in the script 'uc2otutab.py' and renamed the script 'uc2otutab_mod.py':
# The modified function is: function is:
# def GetSampleId(Label): 
#    SampleID = Label.split()[0].split('_')[0] 
#    return SampleID 

# I did this because my demultiplexed headers in the otu_map.uc looked like this:
  ENDO.O.2.KLNG.20.1_19 MISEQ03:119:000000000-A3N4Y:1:2101:28299:16762 1:N:0:GGTATGACTCA orig_bc=GGTATGACTCA new_bc=GGTATGACTCA bc_diffs=0

# and all I need is the SampleID: "ENDO.O.2.KLNG.20.1", so I split on '_' 
python uc2otutab_mod.py otu.map.uc > seqs.filtered.derep.mc2.repset.nochimeras.OTU-table.txt

# convert to biom
biom convert --table-type="otu table" -i seqs.filtered.derep.mc2.repset.nochimeras.OTU-table.txt -o seqs.filtered.derep.mc2.repset.nochimeras.OTU-table.biom

# assign taxonomy 
parallel_assign_taxonomy_rdp.py -v --rdp_max_memory 4000 -O 24 -t /gg_13_5/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt -r /gg_13_5/gg_13_8_otus/rep_set/97_otus.fasta -i sl_out_miseq_run_02/seqs.filtered.derep.mc2.repset.nochimeras.OTUs.fasta -o sl_out_miseq_run_02/assigned_taxonomy

# add taxonomy to BIOM table
biom add-metadata --sc-separated taxonomy --observation-header OTUID,taxonomy --observation-metadata-fp assigned_taxonomy/seqs.filtered.derep.mc2.repset.nochimeras.OTUs_tax_assignments.txt -i seqs.filtered.derep.mc2.repset.nochimeras.OTU-table.biom -o seqs.filtered.derep.mc2.repset.nochimeras.tax.OTU-table.biom

# Then off to QIIME-ing.  :-)

I hope the above helps some of you get started. I'd like to see how others have integrated their UPARSE pipeline into QIIME 

-Cheers! :-)
-Mike

On Monday, November 25, 2013 9:35:20 PM UTC-5, Blair wrote:
Hi Adam,

Yes, I did use the uc2otutab.py script.  Basically I followed along with the UPARSE command line examples using my own data and ended up with an OTU table (in txt format).  This looks very much like one of the old QIIME tables, without the # QIIME OTU table line at the top, without the #OTU ID in the second line (it's OTUId instead) and without a Consensus Lineage column.

I've been trying out a range of OTU picking strategies on an old data set and really wanted to find out how many OTU's I'd get using the UPARSE pipeline, compared with the QIIME 1.3 denovo picking and the QIIME 1.7.0 open and closed reference methods.  The short answer to that question was 7500 otus using QIIME 1.3 denovo, around 4000 otus using 1.7.0 open reference, 900 otus using 1.7.0 closed reference and 150 otus using UPARSE.  I wanted the taxonomy so that I could check the UPARSE OTUs against the QIIME versions (with taxonomy included).  Thus I'd gotten around to attaching a QIIME generated taxonomy to the UPARSE OTU table but had not then attempted to convert this table to biom format.  So I've not tested out downstream QIIME scripts on the UPARSE OTU table.

I've just now modified the UPARSE OTU table to 'look' like a QIIME txt table (including the # QIIME OTU table top line, changing 'OTUId' to #OTU ID and pasting in the taxonomy information.  This does convert to biom format (convert_biom.py -i otu_table_qiime.txt -o otu_table_qiime.biom --biom_table_type="otu table").  Also using 'print_biom_table_summary.py' gives an output......but I've not tried using this table for anything 'downstream'.  Hopefully it'll work fine, but I just don't know yet.

Cheers,

Blair

On Tuesday, November 26, 2013 2:58:55 PM UTC+13, Adamrp wrote:
Blair, I'm curious about whether or not you used the uc2otutab.py script that's on the page I linked.  I haven't tried to use it, but I was just curious to hear about your experience with it.

Or are you saying that even after using that script, the OTU table that is output cannot be converted to biom format?

Adam


On Mon, Nov 25, 2013 at 6:50 PM, Blair <blair....@otago.ac.nz> wrote:
Hello,

I'm not part of the QIIME team, and am certainly not a computer scientist!  However,I've also been playing around with UPARSE, but using 454 data, and came across the same problem as Carlos.  I've come up with a very rudimentary work-around that I'm sure could be made far more elegant.  Indeed Carlos may already have done just what I've done.....so sorry if I'm teaching my grandmother to suck eggs.

I followed the UPARSE command line examples (using my data) and got to the point of generating the OTU table, without taxonomy assignment.  I then found the otu.fa file (generated in the third to last step of the UPARSE command lines example...'Label OTU sequences').  This should be a fasta file with rep sequences from the UPARSE pipeline.  I then went back to QIIME and ran this 'rep set' through the 'align_seqs.py', 'assign_taxonomy.py', 'filter_alignment.py' and 'make_phylogeny.py' scripts.  The taxa_assignment.txt file (generated through the 'assign_taxonomy.py' script) can be opened in excel and sorted according to OTU numbers.  This can then be aligned with the UPARSE OTU table and the taxonomy appended to the table.  It needs a bit of adjustment to convert this new table to a format that can be converted to biom and used in QIIME downstream analyses (alpha, beta diversity etc.).  But it's possible.

If someone has a better way to do this it'd be great to hear it (for example, I couldn't work out how to use the UPARSE otu mapping data to generate the OTU table directly in QIIME).

Cheers,

Blair

On Sunday, November 24, 2013 6:14:22 PM UTC+13, Carlos Ruiz wrote:

Hello Qiime developers,

 

I am trying to analyse some Illumina reads I got for my thesis project. So far I’ve been working with Qiime with no problems but I would also like to try using UPARSE for the OTU picking. At this point I have an OTU table created with UPARSE pipeline but it has no taxonomy assignment yet. My question is: I know that you are planning to integrate UPARSE into QIIME but I will not happen anytime soon so, is there any way I can manually “import” this UPARSE OTU table into QIIME for the taxonomy assignment and the alpha diversity analysis? Thank you very much for your help.

 

Best regards,

 

Carlos Ruiz 

--
 
---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mike Robeson

unread,
Jul 11, 2014, 9:58:16 AM7/11/14
to qiime...@googlegroups.com
Hi Harris,

Well, here is a pipeline I've used for 454-UPARSE. This should get most people started. Keep in mind much of this is just a quick-dirty example. So, be sure you are using the UPARSE pipeline as intended. You can read more about it here. It is up to you to decide if you want to denoise. In the example below I decided to denoise. 

Also, this pipeline is not final and _only_ serves as a guideline to help people think about what they should think about, when processing 454 data with UPARSE. Before you use this pipeline try the default  454-ish pipeline directly from the UPARSE web site. Maybe you can use a hybrid of the pipeline posted here and the one previously linked to. 

I am sure others have even better ways of doing the same thing. If so post here. I am sure there is lots to add, recommend, and correct here. Also, send any UPARSE specific questions to the developer.

-Cheers!
-Mike 


@@@@@@@@
@ START pipeline:
@@@@@@@@

### determine quality scores:
# I generated fasta and qual files from sff to make plots
quality_scores_plot.py -q run1.qual -v        # This suggested a truncation point of 318. 
quality_scores_plot.py -q run2.qual -v        # This suggested a truncation point of 333.

### amplicon noise commands. Truncate on 318 bp. As recommended by the above commands
### add additional split_libraries commands to your qiime parameters files as needed:
#### That is : --min-seq-length or --max-seq-length), as length trimming is important 
### for the UPARSE pipeline
ampliconnoise.py -m Mapping.txt -i run1.sff.txt -o run1/ampnoise_318_1.fna -p qiime_parameters.txt --suppress_perseus --truncate_len=318 --platform=titanium --seqnoise_resolution=25 -n 48 -v 
ampliconnoise.py -m Mapping2.txt -i run2.sff.txt -o run2/ampnoise_318_2.fna -p qiime_parameters.txt --suppress_perseus --truncate_len=318 --platform=titanium --seqnoise_resolution=25 -n 48 -v

### cat all ampliconnoise output
cat run1/ampnoise_318.fna run2/ampnoise_318.fna > all_318.fna

### sanity check - remove reverse primer  
truncate_reverse_primer.py -z truncate_only -m Mapping.txt -f all_318.fna

### UPARSE pipeline as recommended :
## Do not need to truncate as I already did this via ampliconnoise based on the quality score plots. Also made use of other split_libraries.py / ampliconnoise.py parameters in 
## my qiime_parameters.txt file. 
## Regardless, UPARSE does not provide a way to trim fasta files as far as I can tell. Or you can use other fasta processing tools, post ampliconnoise.py
##
## Just some thoughts:
## One possible option is to avoid denoising and use QIIME to trim the fasta and qual files, to a specified base with "truncate_fasta_qual_files.py". Then merge the fasta and qual 
## files with "convert_fastaqual_fastq.py"
## Then quality filter via UPARSE as per: http://drive5.com/usearch/manual/uparse_cmds.html
## then on to the following steps
## Again, just make sure you trim your data appropriately for UPARSE!
###

## dereplication the reverse primer truncated data 
usearch7 -derep_fulllength all_318_truncated.fna -output all_318_truncated_derep.fasta -sizeout 

### filter singletons
usearch7 -sortbysize all_318_truncated_derep.fasta -output all_318_truncated_derep_mc2.fasta -minsize 2

### cluster OTUs
usearch7 -cluster_otus all_318_truncated_derep_mc2.fasta -otus all_318_truncated_derep_mc2_rep_set.fasta

### reference chimera checking
### Set -minh based on your data, occasionally the default -minh setting can be to agressive
### In this example I used 1.5, again know your data!
### I do not like the gold database as it routinely discards highly abundant valid data (for me anyway).
### GreenGenes is becoming a 'cleaner' database over time anyway.
usearch7 -uchime_ref  all_318_truncated_derep_mc2_rep_set.fasta -minh 1.5 -db gg_13_8_otus/rep_set/97_otus.fasta -strand plus -nonchimeras all_318_truncated_derep_mc2_rep_set_nonchimeric.fasta -threads 24

### label OTUs
python /data1/qiime_software/usearch7.0.1001_i86linux64/drive5_py/fasta_number.py all_318_truncated_derep_mc2_rep_set_nonchimeric.fasta OTU_ > all_318_truncated_derep_mc2_rep_set_nonchimeric_OTUs.fasta

### Map original reads back to OTUs
usearch7 -usearch_global all_318_truncated.fna -db all_318_truncated_derep_mc2_rep_set_nonchimeric_OTUs.fasta -strand plus -id 0.97 -uc otu_map.uc -threads 24

### make otu table using modified uparse python script
python uc2otutab_mod.py otu_map.uc > all_318_truncated_derep_mc2_rep_set_nonchimeric_OTU_Table.txt

### convert to biom format
biom convert --table-type="otu table" -i all_318_truncated_derep_mc2_rep_set_nonchimeric_OTU_Table.txt -o all_318_truncated_derep_mc2_rep_set_nonchimeric_OTU_Table.biom

### assign taxonomy and to biom table
parallel_assign_taxonomy_uclust.py -v -O 48 -t gg_13_8_otus/taxonomy/97_otu_taxonomy.txt -r gg_13_8_otus/rep_set/97_otus.fasta -i all_318_truncated_derep_mc2_rep_set_nonchimeric_OTUs.fasta  -o uclust_assigned_taxonomy
biom add-metadata --sc-separated taxonomy --observation-header OTUID,taxonomy --observation-metadata-fp uclust_assigned_taxonomy/all_318_truncated_derep_mc2_rep_set_nonchimeric_OTUs_tax_assignments.txt -i all_318_truncated_derep_mc2_rep_set_nonchimeric_OTU_Table.biom -o all_318_truncated_derep_mc2_rep_set_nonchimeric_wtax_OTU_Table.biom

@@@@@@@@
@ END pipeline #
@@@@@@@@

DoH

unread,
Jul 23, 2014, 12:38:06 PM7/23/14
to qiime...@googlegroups.com
Hi,

I have checked for chimeras using both gold.fa db and greengenes (gg) db (13_8 release) and observed differences. Gold detected 6 chimeras whereas gg detected 10. Of the 6 detected by Gold, 5 were detected by gg. I BLAST checked each chimera. I looked at the e-values, sequence coverages and identities. Most chimeras by gg were identified as uncultured bacterium clones (with variable sequence identities ranging 94-99%). From these results, is there any chance that I could discard novel bacteria if I opt to use gg (assuming that it is not chimeric? I assume that a new bacteria will have even less than 100% sequence identity (maybe 97%) when BLASTed.

Kindly enlighten...
Harris.

On Wednesday, 11 December 2013 01:30:54 UTC+2, Mike Robeson wrote:
Hi Colleen,

The -v option is just to set the verbose option for the script. Just a habit of mine, especially since some of the scripts do not print anything anyway. :-)

Since, I used joined paired-ends the quality scores are quite good throughout the read. So, I just left the --max_bad_run_length and --min_per_read_length_fraction as is. Though if we wanted to be strict I'd suspect we'd just set the following:

--max_bad_run_length : 250 (for miseq, alternatively set to the average size of your fragments, or close to it, just make it large)
--min_per_read_length_fraction : set very low (0.01)  or to 0.0


On another note, be _very_ wary of using the gold.fa file for your reference database when using UPARSE with using default settings. I've recently had ran into a case where the uchime_ref command removed some of the most abundant reads of a sample. That is, it was a read we expected to be there and was supposed to be the majority of the sample. uchime_ref removed this OTU (and other OTUs that we know should have been in the sample) and skewed our results heavily. This issue went away either, when I opted to use the the 13_8 greengenes database or if I used the gold.fa database with the -minh flag of chime_ref to 1.5-2.5 (default is 0.28). In a nutshell the gold.fa database is not as expansive as greengenes and may discard many valid reads. However, grenegenes may have chimeras. Either way, I'd suggest printing out the chimeras to file when using uchime_ref and playing around with the -minh setting. I also recommend BLAST checking the chimeras no matter which reference database you use, as all these methods have a small false-positive rate of detecting chimeras.

UPARSE is good, but more attention needs to be paid to the user options. I also wish there was on option to disable the de novo chimera checking. I am still validating UPARSE for my analysis and comparing it to the QIIME pipeline. So, if others have experience on this please let us know. :-)

-Mike

On Tuesday, December 10, 2013 5:12:05 PM UTC-5, Colleen wrote:
One additional question - for Mike, or any one really.  If we are trying to just get the sequence headers or labels in the correct format to use UPARSE for OTU picking with the split_libraries_fastq.py step (so no quality filtering in QIIME), is it also necessary to adjust the -r (--max_bad_run_length) and -p (--min_per_read_length_fraction) parameters to a minimum so that no quality filtering is done in QIIME.  Or, since we are using a q of 0, are the QIIME defaults for the -r and -p parameters irrelevant?

Thanks!

colleen

On Tuesday, December 10, 2013 11:04:59 AM UTC-8, Colleen wrote:
Hi Mike,
I am also trying to use UPARSE OTU picking and then use that data for downstream analyses in QIIME.  I was struggling with getting my MiSeq data in the correct format to use in USEARCH, so plan to use the split_libraries_fastq.py script in QIIME, like you apparently did.  I have a quick question about about the command you show below.  What is the -v flag for in your split_libraries_fastq.py command line below?  I don't see that as one of the options to pass with split_libraries_fastq.py...

Thanks!

colleen

zhenjiang zech xu

unread,
Jul 23, 2014, 1:04:14 PM7/23/14
to qiime-forum
Hi Harris,

I am not sure I understand your question. The criteria for chimera identification and novel OTU clustering are different. To identify a chimera, many algorithms match the segments of the reads to the database to see if two segments match well to different 16S. So you can have a read of novel OTU that has lower similarity to any 16S but will not be predicted as chimera.

Zech


For more options, visit https://groups.google.com/d/optout.

DoH

unread,
Jul 23, 2014, 1:15:52 PM7/23/14
to qiime...@googlegroups.com
Hi Zech,

Thanks for your response. Between Gold.fa database and Greengenes (13_8), which is the best to use when it comes to chimera checking? Due to the differences in the performance of the two databases, I do not want to accidentally/blindly discard nonchimeras. Gold detected fewer (6) chimeras than gg (10). Only 1 chimera detected by Gold was not detected by gg.

Thanks,
Harris.

zhenjiang zech xu

unread,
Jul 23, 2014, 1:46:52 PM7/23/14
to qiime-forum
As far as I know, there are clear winners for databases or methods to identify chimeras. gg is a larger database than gold.fa, that is probably why more chimeras were identified by gg. If you wanna be aggressive on removing chimeras, then I would say using the result from gg. In many cases, I don't think you choice will have a big impact.

DoH

unread,
Jul 24, 2014, 3:05:59 AM7/24/14
to qiime...@googlegroups.com
Thanks for the information.

DoH

unread,
Jul 25, 2014, 10:50:52 AM7/25/14
to qiime...@googlegroups.com

Hi QIIMErs,

 

I am using both UPARSE and QIIME for my data analyses. I need to remove sequences have failed alignment (as I have blasted them and realized that they are only human sequences). I feel like UPARSE does not have that option (of excluding use-specified sequences). So after creating an OUT text table by UPARSE (uc2otutab.py), I used QIIME’s “make_otu_table.py” to make an OUT biom table that had sequences that failed alignment removed. Here I turned on the “-e” (--exclude_otus_fp) option. I got a biom file that I compared it with what was generated when UPARSE’s “biom convert”, assign_taxonomy.py” and “biom add-metadata” are run. QIIME’s biom file looks pretty ok, but has additional information in it that appears to affect the “biom summarize-table” command. The output (of biom summarize-table) appears wrong, thus the sub-sampling depth cannot be defined from it.

 

Kindly assist me on how to exclude samples failing alignment (so that the biom table is “clean”) for downstream processes. My current input is out.table.txt from UPARSE, but it appears slight incompatible with QIIME’s “make_otu_table.py”.

 

Thanks,

Harris.

zhenjiang zech xu

unread,
Jul 25, 2014, 7:49:43 PM7/25/14
to qiime-forum

DoH

unread,
Jul 26, 2014, 7:54:55 AM7/26/14
to qiime...@googlegroups.com
Hi Zech,

Great suggestion. I did make a .txt file to only include IDs of the sequences that had failed to align. Then I used the script that you suggested, and it WORKED!!! THANKS.

Harris.

zhenjiang zech xu

unread,
Jul 26, 2014, 3:35:10 PM7/26/14
to qiime-forum
glad to hear it worked.

-Zech

Soledad Benitez

unread,
Jun 3, 2015, 1:14:20 PM6/3/15
to qiime...@googlegroups.com
Mike,

Thanks for posting this modified script. I have been doing a similar processing pipeline for both 454 data and Illumina and using text editor grep function and/or sed to modify the labels and add the barcodelabel= field. This just made my life easier! 

Soledad

On Monday, March 17, 2014 at 10:15:33 PM UTC-5, Mike Robeson wrote:
I mentioned the details of the `uc2otutab.py` script changes along with the original pipeline post earlier in this thread, here. Just open `uc2otutab.py` in any raw text editor and replace

these lines:

def GetSampleId(Label):
Fields = Label.split(";")
for Field in Fields:
if Field.startswith("barcodelabel="):
return Field[13:]
die.Die("barcodelabel= not found in read label '%s'" % Label)

with these:

def GetSampleId(Label): 
    SampleID = Label.split()[0].split('_')[0] 
    return SampleID 

I've attached the  `uc2otutab_mod.py` file to this post. 

-Mike

On Monday, March 17, 2014 10:39:22 PM UTC-4, Carly wrote:
Hi Mike,

Thank you for the pipeline. This has been very helpful. I am at the 'uc2otutab.py' command, and have run into an issue. It is that I do not know how to modify a command. You state you renamed the script 'uc2otutab_mod.py' so that the issue of 

### uc2otutab.py otu.map.uc

###**ERROR** barcodelabel= not found in read label '38B4_2'

does not happen. But how do I do that -- modify the command that is? At this point in my career I can only write commands, not modify them! 

Any help would be greatly appreciated. I am actually merging four 454 Junior runs together and using QIIME to demultiplex at first is very helpful. 

Thanks!

Carly





On Monday, February 24, 2014 9:10:34 AM UTC-5, Mike R wrote:
Hi Serena,

Not a problem. The split_libraries_fastq.py script can work on either single or paired-end reads. Just skip the "usearch7 -fastq_mergepairs" / "join_paired_ends.py" step and start directly with split_libraries_fastq.py. Also, the "barcodelabel=" is added to your data by the uparse python scripts  "fastq_strip_barcode_relabel.py" and/or "fastq_strip_barcode_relabel2.py". So, simply do not run these scripts on your raw data. Just take the raw data that you received from the sequencing facility and send it directly to split_libraries_fastq.py, then through the rest of the pipeline. 

-Mike

On Sunday, February 23, 2014 10:00:25 AM UTC-5, Serena Thomson wrote:
Hi Mike,

Really appreciate you taking the time to respond to me personally, thank you. 
I can't follow your pipeline exactly, because I don't have paired end data. I am analysing single reads. Therefore I can't run this: split_libraries_fastq.py but I will try with the standard split_libraries.py command and see if I get this error. 

Ideally I just need to modify the uc2otutab script so that it doesn't look for this barcodelabel=

Thanks again

Serena


On Thu, Feb 20, 2014 at 7:36 PM, Mike R <soilbd...@gmail.com> wrote:
Serena,
I forgot to mention, you can also simplify things by making use of join_paired_ends.py (with the -b flag set), instead of using the `usearch -mergepairs` and the `remove_unused_barcodes.py ` steps (the first two steps) I initially posted. So, just do `join_paired_ends.py` followed by `split_libraries_fastq.py`, then on to the rest of the pipeline.

-Mike
.

On Thursday, February 20, 2014 12:28:57 PM UTC-7, Mike R wrote:
Hi Serena,

Did you follow my pipeline from the very beginning and only use the commands I posted? If so, then the "barcodelabel= not found" should not arise using the pipeline exactly as posted. In fact, the commands I posted precisely circumvent this issue, as I am not using the `fastq_strip_barcode_relabel2.py` (my data were not in a compatible format for that script). Thus, I circumvent the use of `fastq_strip_barcode_relabel2.py` by using `split_libraries_fastq.py` to demultiplex my data instead. This is why I modified `uc2otutab.py` the way I did. That modification will only work if you demultiplex via `split_libraries_fastq.py`, hence the reason for my posting an example of the OTU header I had to parse. If you notice, the OTU header is in the format you'd generally expect as output from `split_libraries_fastq.py` (e.g. "SampleID_SeqCount<whitespace>OtherInfo".

So, unless there are some peculiarities with your format, or I am missing something, the pipeline I posted should work. My post was just an initial "quick-and-dirty" stab in which I was able to get my data from UPARSE to QIIME with minimal effort. I would love to hear how others coerced there data from UPARSE into QIIME. Especially considering the myriad formats of our respective data. :-)

Does this help?
-Mike

On Thursday, February 20, 2014 7:36:19 AM UTC-7, Serena Thomson wrote:
Hi Mike

I am failing to see how your modification to the Sample_ID part of the uc2otutab.py gets round the issue of the error it spits out when the barcodelabel= not found. I noticed you don't use the fastq_strip_barcode_relabel2.py script either, so I'm not sure how you managed this. 

Serena

--
 
---
You received this message because you are subscribed to a topic in the Google Groups "Qiime Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/qiime-forum/zqmvpnZe26g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to qiime-forum...@googlegroups.com.
Message has been deleted

baoanxh2006

unread,
Dec 22, 2015, 4:39:32 AM12/22/15
to Qiime 1 Forum
Dear Mike and Qiime Users,

I am sorry for suddenly jump in this thread!

I do not have computational background, but want to improve analysis of my paired-end illumina amplicon data using QIIME and UPARSE. I use Virtual box QIIME 1.9.1 and also installed usearch and usearch61.
 
I used usearch61 and following your process smoothly until the step of calling python script "python fasta_number.py seqs.filtered.derep.mc2.repset.nochimeras.fasta OTU_ > seqs.filtered.derep.mc2.repset.nochimeras.OTUs.fasta".
error "python: can't open file ''fasta_number.py": [Errno 2] No such file or directory"
Is it possible for me to implement your process in Qiime virtual box? if yes, how can I call python script here?

I have not done UPARSE before, so please advise me: 1) how to prepare otu.map.uc; 2) can I do "modified the function 'GetSampleID' in the script 'uc2otutab.py' and renamed the script 'uc2otutab_mod.py'" in Qiime virtual box? and how?

For downstream diversity analysis, could you please advise me the way to produce phylogenetic tree?

I am looking forward to your advice!

Merry Christmas & wish you a very happy new year!

Best regards,
An

Colin Brislawn

unread,
Dec 22, 2015, 1:02:42 PM12/22/15
to Qiime 1 Forum, Yoshiki Vázquez Baeza
Hi An,

Can you post a new thread with your questions? Mike's answer is great, but is two years old! A lot has changed since then. 

Yoshiki, can we lock this thread?

Thanks!
Colin
Reply all
Reply to author
Forward
0 new messages