Similarity Matrix from QIIME

Kang Jin Kim

unread,

Oct 31, 2016, 10:50:30 AM10/31/16

to Qiime 1 Forum

I got the OTU table from fasta file, using QIIME with Silva 123 release database with 97% uclust.

I used closed OTU pick. So, I'm sure that QIIME calculated similarity among the OTUs which are from the reference sequences file: rep_set/rep_set_16S_only/97/97_otus_16S.fasta

The thing is, I wonder how I can get the similarity matrix directly.

If it is impossible, is there any way to do it? or calculate the similarity matrix from phylogenetic tree distance?

Kangjin Kim

Seoul National University, Korea

Jamie Morton

unread,

Oct 31, 2016, 12:02:45 PM10/31/16

to Qiime 1 Forum

Hi Kangjin,

QIIME doesn't support pairwise similiarity matrices for sequences. But this can be done using scikit-bio.

I'd checkout skbio.sequence.distance and DistanceMatrix.from_iterable.

Let me know how that works out.

Best,

Jamie

Colin Brislawn

unread,

Oct 31, 2016, 12:42:21 PM10/31/16

to Qiime 1 Forum

Hello Kangjin,

You could also use the program vsearch to do this.

Github: https://github.com/torognes/vsearch/releases

Using conda: conda install vsearch -c biocore

This command should give you this list (in the .uc file), which you could then parse into a table.

vsearch --allpairs_global 97_otus_16S.fasta --id 0.1 --alnout 97_otus_16S_allpairs.aln --uc 97_otus_16S_allpairs.uc

Colin

Kang Jin Kim

unread,

Oct 31, 2016, 1:31:39 PM10/31/16

to qiime...@googlegroups.com

Colin Brislawn,

Thank you so much for the fast reply. But it looks like the command of vsearch gives an uclust based representative sequence of your clusters from 97_otus_16S_allpairs.aln.

This is not the one I wanted.

I already have representative sequence. It is the reference sequence of silva database "rep_set/rep_set_16S_only/97/97_otus_16S.fasta" because I picked the OTU with closed method.

What I want to do is to calculate similarity (not identity! like uclust) matrix from these reference sequences.

Kang Jin Kim

unread,

Oct 31, 2016, 1:37:41 PM10/31/16

to qiime...@googlegroups.com

Jamie Morton , Thank you for your considerate reply.

But the output of it is distance matrix. And hemming distance is related to identity not similarity.

I'm talking about pair wise sequence similarity like GOTOH, Smith-Watermann algorithm or so.

Colin Brislawn

unread,

Oct 31, 2016, 3:31:38 PM10/31/16

to Qiime 1 Forum

Hello Kang,

This vsearch function, --allpairs_global, will align every read with every other read, then produce a uc output. For example, I have my OTUs labeled as OTU_<number>, and get this result from allpairs.

H 3 253 89.3 + 0 0 253M OTU_1 OTU_4

H 7 253 81.8 + 0 0 253M OTU_1 OTU_8

H 1 253 75.5 + 0 0 253M OTU_1 OTU_2

H 5 253 74.7 + 0 0 253M OTU_1 OTU_6

H 9 253 73.9 + 0 0 253M OTU_1 OTU_10

H 6 253 72.7 + 0 0 253M OTU_1 OTU_7

H 4 253 71.1 + 0 0 253M OTU_1 OTU_5

H 8 253 70.8 + 0 0 253M OTU_1 OTU_9

H 2 253 68.0 + 0 0 253M OTU_1 OTU_3

This would be the first row of the similarity table for OTU_1.

VSEARCH makes several notable improvements over USEARCH, one of which is exact alignments (same as Needleman-Wunsch). It also supports the --iddef flag, so you can pick out what number it returns from this list.

--iddef 0|1|2|3|4

Change the pairwise identity definition used in --id. Values accepted are:

0. CD-HIT definition: (matching columns) / (shortest sequence length).

1. edit distance: (matching columns) / (alignment length).

2. edit distance excluding terminal gaps (same as --id).

3. Marine Biological Lab definition counting each extended gap (internal or

terminal) as a single difference: 1.0 - [(mismatches + gaps)/(longest

sequence length)]

4. BLAST definition, equivalent to --iddef 2 in a context of global pairwise

alignment.

Does any of those 4 sound like the metric you are looking for?

Colin

Reply all

Reply to author

Forward