Similarity Matrix from QIIME

Skip to first unread message

Kang Jin Kim

Oct 31, 2016, 10:50:30 AM10/31/16
to Qiime 1 Forum
I got the OTU table from fasta file, using QIIME with Silva 123 release database with 97% uclust.

I used closed OTU pick. So, I'm sure that QIIME calculated similarity among the OTUs which are from the reference sequences file: rep_set/rep_set_16S_only/97/97_otus_16S.fasta

The thing is, I wonder how I can get the similarity matrix directly.
If it is impossible, is there any way to do it? or calculate the similarity matrix from phylogenetic tree distance?

Kangjin Kim

Seoul National University, Korea

Jamie Morton

Oct 31, 2016, 12:02:45 PM10/31/16
to Qiime 1 Forum
Hi Kangjin,

QIIME doesn't support pairwise similiarity matrices for sequences.  But this can be done using scikit-bio.

Let me know how that works out.


Colin Brislawn

Oct 31, 2016, 12:42:21 PM10/31/16
to Qiime 1 Forum
Hello Kangjin,

You could also use the program vsearch to do this.
Using conda: conda install vsearch -c biocore

This command should give you this list (in the .uc file), which you could then parse into a table.
vsearch --allpairs_global 97_otus_16S.fasta --id 0.1 --alnout 97_otus_16S_allpairs.aln --uc 97_otus_16S_allpairs.uc


Kang Jin Kim

Oct 31, 2016, 1:31:39 PM10/31/16
Colin Brislawn,

Thank you so much for the fast reply. But it looks like the command of vsearch gives an uclust based representative sequence of your clusters from 97_otus_16S_allpairs.aln.
This is not the one I wanted.

I already have representative sequence. It is the reference sequence of silva database "rep_set/rep_set_16S_only/97/97_otus_16S.fasta" because I picked the OTU with closed method.
What I want to do is to calculate similarity (not identity! like uclust) matrix from these reference sequences.

Kang Jin Kim

Oct 31, 2016, 1:37:41 PM10/31/16
Jamie Morton , Thank you for your considerate reply.

But the output of it is distance matrix. And hemming distance is related to identity not similarity.
I'm talking about pair wise sequence similarity like GOTOH, Smith-Watermann algorithm or so.

Colin Brislawn

Oct 31, 2016, 3:31:38 PM10/31/16
to Qiime 1 Forum
Hello Kang,

This vsearch function, --allpairs_global, will align every read with every other read, then produce a uc output. For example, I have my OTUs labeled as OTU_<number>, and get this result from allpairs.

H       3       253     89.3    +       0       0       253M    OTU_1   OTU_4
H       7       253     81.8    +       0       0       253M    OTU_1   OTU_8
H       1       253     75.5    +       0       0       253M    OTU_1   OTU_2
H       5       253     74.7    +       0       0       253M    OTU_1   OTU_6
H       9       253     73.9    +       0       0       253M    OTU_1   OTU_10
H       6       253     72.7    +       0       0       253M    OTU_1   OTU_7
H       4       253     71.1    +       0       0       253M    OTU_1   OTU_5
H       8       253     70.8    +       0       0       253M    OTU_1   OTU_9
H       2       253     68.0    +       0       0       253M    OTU_1   OTU_3

This would be the first row of the similarity table for OTU_1.

VSEARCH makes several notable improvements over USEARCH, one of which is exact alignments (same as Needleman-Wunsch). It also supports the --iddef flag, so you can pick out what number it returns from this list.
--iddef 0|1|2|3|4
Change the pairwise identity definition used in --id. Values accepted are:
  0. CD-HIT definition: (matching columns) / (shortest sequence length).
  1. edit distance: (matching columns) / (alignment length).
  2. edit distance excluding terminal gaps (same as --id).
  3. Marine Biological Lab definition counting each extended gap (internal or
terminal) as a single difference: 1.0 - [(mismatches + gaps)/(longest
sequence length)]
  4. BLAST definition, equivalent to --iddef 2 in a context of global pairwise

Does any of those 4 sound like the metric you are looking for? 


Reply all
Reply to author
0 new messages