Hi Colin,
What I found was that if I demultiplexed my data into separate fastqs prior to pushing into qiime (using fastq-multx), I could still map data to phix and produce a very nice-looking reference-guided assembly (the reads are indeed phix). I am less enthusiastic about exploring phix now than I was two weeks ago now. The reason is that I don't think it is actually doing much if anything to mos people's data. That said, I am finding interesting trends in phix contaminated data.
First, how does the phix get in the demultiplexed data? This happens when sequencing clusters (flowcell polonies) are in close proximity to one another and the optical registration of each cluster becomes confounded. Illumina instruments sequence the first read, then the first index, then second index (if present), then everything turns over, and then the second read is produced. If two clusters are sufficiently overlapping, the signal:noise ratio should cause the mixed cluster to be tossed out (not passing filter in Illumina jargon). However, there may be a proximity that still passes filter that can cause one read from one sample to be attributed to another (see Kircher et al 2012
http://www.ncbi.nlm.nih.gov/pubmed/22021376). The phix library used as Illumina control is non-indexed, meaning it "goes dark" during the indexing reads. So there will be no conflicting signal during indexing, though you might think it should not PF during cluster registration. At any rate, what must (probably) be happening is you get a splendid read1 from phix and then a nearby cluster (very close, in fact) from your data set, perhaps that didn't PF in the first place, generates signal during the index read and thus that phix read is attributed to that sample during demultiplexing. This means that cluster density as well as phix concentration will both contribute to the percentage of phix infiltration for a given data set.
So what can phix do to your amplicon data? As you mention, OTU inflation might be a concern as well as describing spurious uncharacterized diversity. However, most phix is probably removed during standard filtering steps (singleton/doubleton, 0.005% abundance). But not necessarily. Phix (PhiX174, Fred Sanger's first genome) is a bacteriophage with a puny genome (5kb). For the data set I was playing with, I had about 75k phix reads from over 3M reads. This is around 2%, and sufficient to cover phix 100 fold. If we still all used blast, this might be a non-issue, but with cdhit/uclust type algorithms with short words and such, many random phix sequences could conceivably be assigned to OTUs during de novo OTU picking steps. When I filter my data at 0.005%, it is using a threshold around 30 counts. If I am covering phix 100 times, it seems I might leave some parts of it in my data to be attributed to the "unknown" category. Since these will all be random reads, it should serve to subtly homogenize your data set so if you expect a small effect size it could cause real problems with data interpretation.
I also have been finding differences with respect to read joining. Playing with fastq-join at different allowable mismatches, I find very stringent joining (5% allowable mismatch or less) results in very high proportions of phix contamination in the joined data (around 10% of the data when joined at 1%). Less stringent joining (30%) yields far more reads with about the same data result, and phix contaminations down around 1%).
I got a response back from my Illumina FAS today after a couple of weeks of waiting. He confirmed that my suspicions are correct about how the phix gets in the data and the factors that affect it. He suggested using an indexed control library to eliminate this sort of contamination.
Because of the mechanism for phix infiltration (mixed clustering where one cluster has an index and the other does not), I do not think the phix rate is indicative of sample-sample bleed (though I do believe this happens, and more so with single than dual indexed data). The sample-sample bleed is probably an order of magnitude lower or better.
So what can you do? There is a utility from the Sanger institute called smalt that is perfect for removing phix. I wrote a script using this to do just that from my data (
https://github.com/alk224/akutils). Somewhere in my notes for it there is a link to the genbank sequence for the phix used for illumina runs. And most of all, grab your towel and keep calm, the phix probably isn't doing anything to most of us.
My workflow: fastq-multx to demultiplex, remove phix with smalt, join with fastq-join, second demultiplex with split_libraries_fastq, pick_open_reference_otus (uclust max accepts 1000 rejects 2000), pick_rep_set, assign tax (RDP), align seqs (pynast or mafft), build tree (though I generally use gg tree), make otu table, post-processing steps.
Whew!!