Source Tracker analysis

1,215 views
Skip to first unread message

Fabrice Armougom

unread,
Jul 25, 2013, 5:02:33 AM7/25/13
to qiime...@googlegroups.com

Hi,

I'm trying to use Source tracker for assess the contamination or not of my sample (16S rRNA pyrosequencing, 454 FLX titanium).
I retrieved from the Qiime database (map.file and otu_table.biom) :
Body site samples (Study ID=449, Costello) and soil samples (ID=103, Lauber) that is going to be the sources.

So first I merge the Otu_table from this 2 sets and also the map file.

merge_mapping_files.py -m study_103_mapping_file.txt,study_449_mapping_file.txt -o merged_mapping.txt -n 'Data not collected'	
merge_otu_tables.py -i study_103_closed_reference_otu_table.biom,study_449_closed_reference_otu_table.biom -o merged_otu_table.biom
I convert the biom format into txt running:

convert_biom.py -i merged_otu_table.biom -o merged_otu_table.txt -b


To make a first positve test , in the field Env of the merge_map file I turn one soil sample into sink and all the remaining samples are defined as source

 I run the command line
R --slave --vanilla --args -i merged_otu_table.txt -m merged_mapping.txt -o sourcetracker_out1 < $SOURCETRACKER_PATH/sourcetracker_for_qiime.r

So that is OK, source tracker found that is "soil" at 82% with a little part of unknown.

THE PROBLEM:

Now I would try it to my own data (16S pyroseq) so I used qiime:

so I used the closed reference protocole after quality trimming check by :

split_libraries.py -m mapping_output/map_corrected.txt -f 16s.fna -q 16s.qual -o split_library_output -b 0 -M 1 -H 8 -a 0 -l 70 -s 30


 pick_closed_reference_otus.py -i seqs.fna -r $HOME/qiime_software/gg_otus-12_10-release/rep_set/97_otus.fasta -t $HOME/qiime_software/gg_otus-12_10-release/taxonomy/97_otu_taxonomy.txt -o OTUREF/

1400 OTU and only 5.000 seq failures

So after this step I get a otu_table.biom & I merge this table to the source samples (body sites & soil)
I add a line in the previuos merge map file with my sampleID and add sink in the Env field.

Then I run Source tracker as prevuiously :
no errors appears...but at the end the result is 100% unknown...
I try the same protocol on an another data set that I known is feces...however...the result is 100% unknown with source tracker.

Any idea on what is wrong??? a difference of the reference database used between the qiime samples loaded from the database and my own samples??
very difficult to find a complete procedure for using source tracker that compare personnal data with others dataset??
Hope that you can help me.
Best
Fabrice




 




Jose Carlos Clemente

unread,
Jul 25, 2013, 8:48:47 AM7/25/13
to qiime...@googlegroups.com
Hi Fabrice,

we are looking into this and we'll get back to you as soon as possible.

Thanks,
Jose

Fabrice




 




--
 
---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

John Chase

unread,
Jul 25, 2013, 5:10:21 PM7/25/13
to qiime...@googlegroups.com
Hi Fabrice, 

"I add a line in the previuos merge map file with my sampleID and add sink in the Env field."
In the mapping file are you looking at the source of just one sample? If so that should be fine, however my understanding of SourceTracker is that the results will not be as reliable as if had multiple samples that were defined as sink. Also 'sink' should go under the 'SourceSink' column and not the 'Env' column. The 'Env' column will contain a description of where the sample was taken from, for instance if the sample came from the bed in a NICU the Env field would be something like 'NICU Bed' and the 'SourceSink' column would contain 'sink'.  

If changing the fields does not improve your results or they still appear to be incorrect would it be possible to send a link to the mapping file and otu table you are using in your final data analysis? 

Finally are the files you are using for the source data from the SourceTracker tutorial? 

Hopefully I have understood your question fully and this helps!

John


Dan Knights

unread,
Jul 25, 2013, 5:17:04 PM7/25/13
to qiime...@googlegroups.com
Likely this is a problem with the reference sets. Fabrice, if you follow John's advice and get 100% unknown, I suggest making a heatmap of all samples organized by "Env" or by studyID. This will probably show why there is no overlap.

John, let me know if you have questions while handling this.

Dan

Fabrice Armougom

unread,
Jul 26, 2013, 6:21:58 AM7/26/13
to qiime...@googlegroups.com

Hi John,

sorry, No it's ok I put sink in the good column (Sourcesink) and for the line of my own sample.
Yes I have just one sample that I want to confirm the environment source (expected: feces).
I did not used the out_table source data from the source tracker tutorial but from the qiime database (www.microbio.me/qiime/) which allow to retrieve map file, otu file or seq.fasta of Costello (body site sample)& Lauber 2009 (soil).
good idea, I'm going to try with the source OTU_table.biom of the Source tracker tutorial.
If no improve result I will give you a like for my map file and OTU_table

Thanks for your help and rapidity, I rellay appreciate
Fabrice

Fabrice Armougom

unread,
Jul 26, 2013, 6:25:14 AM7/26/13
to qiime...@googlegroups.com

hi Dan,

the reference sets for OTU_picking??
Ok I will also try the heat map to check the non overlapping of taxonomy.
Thank for helping
Fabrice

Fabrice Armougom

unread,
Jul 26, 2013, 10:52:43 AM7/26/13
to qiime...@googlegroups.com
hi...again,

Finally when I tested Source tracker with the source data from the  "source tracker tutorial" merged with data, it's give me always 100% unknown...whereas I known it's gut!
So with this link you have map & OTU file for this example and also map and file with source data from qiime database.
https://www.dropbox.com/s/wd1nxhk2k8a1ek1/Qiime.zip

Hope that you help me can find my mistake(s)!!

Fabrice


Le jeudi 25 juillet 2013 23:10:21 UTC+2, John Chase a écrit :

John Chase

unread,
Jul 27, 2013, 4:54:02 PM7/27/13
to qiime...@googlegroups.com
Hi, 
When I originally read the question I assumed there was a problem either in SourceTracker, or with the formatting of the files being passed into SourceTracker. Looking at the BIOM table I'm inclined to agree with Dan that the reference sets/taxonomy assignments are the problem or something else upstream from SourceTracker. There is almost no overlap in the OTU vectors between the source and sink samples in the otu table. 

To be perfectly honest I am not sure why the results would be so different even if using different versions of the greengene database. Dan do you have any input on this?

John

Dan Knights

unread,
Jul 28, 2013, 9:32:24 AM7/28/13
to qiime...@googlegroups.com
If they are indeed different versions of greengenes, then they are different OTU IDs and hence no overlap.

Dan

Daniel McDonald

unread,
Jul 28, 2013, 9:41:28 AM7/28/13
to qiime...@googlegroups.com
Dan, the IDs are stable but the clusters are not yet


Fabrice Armougom

unread,
Jul 29, 2013, 5:07:07 AM7/29/13
to qiime...@googlegroups.com

In Plosone 2012 "Insights from characterizing extinct human gut", the authors used Source tracker and you are co-author in this paper.isn't it?
In this article did you build  the different source OTU table using  qiime pipeline or did you used source OTU table given at qiime database?
So If I understand clearly, next step for me, is to not used the source table of Qiime or source tracker tutorial but rather to build all of them with qiime comand line and the same reference greengene database!? 
Fa

Dan Knights

unread,
Jul 29, 2013, 11:24:43 AM7/29/13
to qiime...@googlegroups.com
Hi Fabrice,

To use OTUs, you can either (a) pick de-novo OTUs for all data sets jointly, or (b) pick reference-based OTUs for your data on the same greengenes version used in the QIIME database and merge the resulting OTU tables.

In the extinct human gut study we used genus-level taxonomy to combine several types of sequencing data. This is also an option for you, but I would recommend using OTUs if possible because you get better discrimination between sources.

Dan

Fabrice Armougom

unread,
Jul 30, 2013, 4:31:27 AM7/30/13
to qiime...@googlegroups.com
Hi Dan,

Yesterday, I've started pickoturef for different gut microbiota data of the Lab and using the last greengene data. Then I meged all the Otutable and turn one sample into sink in the mapfile.
Sourcetracker found about 40% of gut ... so it works, and so It's really a probleme  of the greengene version. Now I've to increase the source database for improve
the predictions. Many thanks for your help.
Fabrice 

Dan Knights

unread,
Jul 30, 2013, 1:31:49 PM7/30/13
to qiime...@googlegroups.com
Hi Fabrice,

Sounds good. Keep in mind that stool microbiomes from different populations (e.g. Malawi vs. USA) may be quite different and may belong in separate "source" environments.

Dan

Fabrice Armougom

unread,
Aug 1, 2013, 8:08:48 AM8/1/13
to qiime...@googlegroups.com

yes, of course, thanks. 
Fa

Juan Pedro Maestre Wic

unread,
Nov 18, 2014, 4:56:02 PM11/18/14
to qiime...@googlegroups.com
Hi Dan,

I am finding a similar problem. I am combining my dataset with a subset from Costello. I did the openref pick otus altogether. I am putting Costello as source and my samples as sink. I would expect my surfaces to have a significant % of skin... However, I just get high "unkown". I have done the heatmap as you suggested to Fabrice. I have seen interesting things like: OTUs related to Propionibacterium in my dataset are different than the Propionibacterium from Costello, that is, the OTUnumber is different. Should I do the sourcetracker base in lower similarity (like Class level) instead of at 97% level? I am guessing that those g_Propionibacterium are probably different species and that is probably why they are in different OTUs? I have seen that skin related taxa are abundant in my samples.

Thanks for the help,

JP.

Dan Knights

unread,
Nov 25, 2014, 5:26:02 PM11/25/14
to qiime...@googlegroups.com
Hi JP,

The best outcome would be if you could find a source data set that was generated using the same sequencing technology, variable region, and informatics, but if that is not possible then you might try genus level taxa instead of OTUs.

Dan
Reply all
Reply to author
Forward
0 new messages