Removal of PhiX reads from MiSeq data

Brodie

unread,

Apr 29, 2014, 7:38:33 PM4/29/14

to qiime...@googlegroups.com

I have multiplexed X96 16S samples on MiSeq using the 12 base Golay barcodes recommended by the Earth Microbiome Project. I recieved the data in multiplexed format and therefore got three files containing: Read1, Read2 and Barcode reads. In reads 1 and 2, the primers have already been removed by the sequencing company. Obviously, I have PhiX reads in my data but because they are random fragments of the PhiX genome, it seems that many have the same sequence in their first 12 bases as my barcodes.... obviously this means that they will be assigned to my samples when using split_libraries.py. In fact, I know this is the case because we have an in-house PhiX filter here which identified the problem.

Is there a command in QIIME to remove PhiX reads from MiSeq data? As I said, we have a script here that identifies PhiX reads but it seems far easier to use a QIIME command rather than putting each split library file through this filter individually.... and surely others have had this problem...(?)

Thanks in advance for any comments/ answers you can give me.

Andrew Krohn

unread,

Dec 23, 2014, 3:26:51 PM12/23/14

to qiime...@googlegroups.com

Hi Brodie,

I've been investigating this as well and not finding much in the way of answers. It seems to me this may even create a nice opportunity for those of us with single-indexed data. Let me explain:

Kircher et al (NAR, 2012) showed that Illumina sequencers can misidentify single-indexed data to a surprising degree. This misidentification can be corrected by use of dual indexes, and Kozich et al (2013) published a protocol to update the Caporaso (2012) protocol. Fadrosh (2014) published another protocol in 2014 that included "heterogeneity spacers" which in principle I like, but I already have primers (Caporaso-style single-indexed) and data using said primers that I must deal with. So what opportunity if I have data produced by the oldest available miseq protocol? I think if we can quantify the amount of phix as a percentage of reads that are demultiplexed, we can subtract this percentage from each read count in an otu table and correct for sample to sample contamination whcih arguably might be a far worse problem than a little phix contributing to a bunch of low-level "unknown" taxa that you might filter out later on anyway. How to do this then...

Demultiplex your data first, as your run undoubtedly had some percentage of phix spiked in. What you are concerned about is the phix that for some reason was assigned to your sample data. Make sure you pass --store_demultiplexed_fastq, or use some other tool that lets you keep a fastq. If you are going to join your reads (say with fastq-join), do that prior to demultiplexing.

Download smalt and get that running (https://www.sanger.ac.uk/resources/software/smalt/). You will need it to map phix reads to a reference. Also you need the correct reference (http://www.ncbi.nlm.nih.gov/nucleotide/NC_001422).

Follow the instructions in smalt to build an index with the highest sensitivity (pass -k 11 -s 1).

Then map reads to reference with smalt: smalt map -o outfile.sam /path/to/referenceindex/without/extensions infile.fq

This will produce an alignment in sam format (outfile.sam). Sam files are like fastq, but all information is encoded in the same line. So we can easily do some regex exercise to find the reads that don't map to phix (data you want):

egrep "\w+:\w+:\w+-\w+:\w+:\w+:\w+:\w+\s4" outfile.sam > unmapped.sam

That regex should match any miseq header for the unmapped sequences (code 4), and the command will yield a separate (still large) sam file (unmapped.sam) which contains only the data you want.

Now determine the level of contamination using the command wc and a calculator:

wc -l outfile.sam (will return the number of sequences in outfile)

wc -l unmapped.sam (will return the number of sequences you wish to retain)

Then divide the second result by the first. Eventually you might want to filter each otu count in your table by this percentage relative to the total number of read counts for each sample (a bit of excel unpleasantness, I'm afraid).

Now use the filter_fasta.py command in qiime to remove all of the phix from the fastq you initially passed into smalt. Notice you need your fastq to end in .fastq or the script won't properly recognize the format. Finally, pass the filtered fastq into the convert_fastaqual_fastq.py script to change it to fasta for downstream processing.

Sorry it's not easier than this. I agree, there should be a simple tool...

Andrew Krohn

unread,

Dec 23, 2014, 3:53:15 PM12/23/14

to qiime...@googlegroups.com

One other note, I find it easier to use fastq-multx (part of ea-utils) to do the initial demultiplexing. Pass in -x to prevent index sequence removal and the map file is much simpler (see ea-utils site). Use -B for barcode type. It will spit out a bunch of demultiplexed fastqs for each read you pass in (I start with raw read1, read2, index1 here) and crucially, keep them all in phase so you can join ends, run split libraries whatever later without a software complaint. From your output directory, remove each output that is "unassigned" and then cat like files together like this:

cat index1.* > index.all.fq

And so on. This removes all the unassigned data prior to everything else and is crucial to having any hope of accurately estimating phix contamination (and by extension, sample to sample contamination as a result of mixed clusters during sequencing). If you were to pass in raw data with say, 30% phix, you would dramatically over-estimate the percentage of mixed clusters for your run. I strongly suspect this cluster mixing occurs at a different rate for every run contingent on the total cluster density, the indexing strategy, and the total percent of phix present.

Andrew Krohn

unread,

Dec 23, 2014, 7:29:04 PM12/23/14

to qiime...@googlegroups.com

I just ran this approach on a subset of some mock community data with 7 taxa for ITS2 sequences. Let me stress that the major trends do not change!! However, the stats tighten up and this is apparent in the beta diversity plots. On my mock data, I processed things "normally" via qiime, filtering at 0.005% according to Bokulich 2013, then I stripped out the phix first, and again processed the same way, then I did once more with the phix-stripped data but then also filtered every OTU count on a per-sample basis based on phix presence in the whole dataset (would be better if this was also calculated for each sample).

Normal processing: 12 taxa, pseudoF on braycurtis-dm = 32.7775

Phixremoved: 10 taxa, pseudoF = 29.6114

PhiXremoved and OTU table filtered proportionally: 8 taxa (so close!), pseudoF = 39.3108

I think just stripping out the phix isnt enough as I am feeling fairly vindicated in my assertion that the phix contamination level is diagnostic of the mixed cluster rate for your particular sequencing run. It appeared in this data (all ascomycetes) that phix was manifesting as low-level "visitor" OTUs from basidiomycota, and the dreaded "Other" category. However, I did also remove too much data from one of the communities and taxa I know are present were no longer detected (more argument for assessing on a per-sample basis). Since I was working on a small subset of data in the interest of time (even sampling at 50 for core_diversity), perhaps this was due to low sampling effort. I'll be playing with this in the near future for my own data where I know the effects are likely subtle and I can't afford to get hosed on the technicality of mixed clusters obscuring real trends.

Colin Brislawn

unread,

Jan 6, 2015, 7:01:43 PM1/6/15

to qiime...@googlegroups.com

Hello Andrew,

I'm vary interested in your continued work on this project. I agree that phix levels have potential to be diagnostic/control of MiSeq data but can also lead to contamination and inflated OTU counts.

For 16S work, the present understanding, although it may be wrong, is that phix reads will simply be left behind during split_libraries.py. You mentioned that phix can be similar enough to the golay barcodes that it will be split out, which is mitigated by dual barcodes. (Did I get that right?) Is this true for all barcodes, or some more then others? Could some of this be due to the primers, such that different regions of 16S or ITS be impacted differently?

You used the term 'mixed clustering.' What do you mean by that?

Finally, the biological relevance and number of clusters strongly depends on the OTU picking algorithm used. Which clustering algorithm did you use for this analysis?

Thanks for taking the time with all my questions,

I look forward to learning more,

Colin

Andrew Krohn

unread,

Jan 6, 2015, 7:39:41 PM1/6/15

to qiime...@googlegroups.com

Hi Colin,

What I found was that if I demultiplexed my data into separate fastqs prior to pushing into qiime (using fastq-multx), I could still map data to phix and produce a very nice-looking reference-guided assembly (the reads are indeed phix). I am less enthusiastic about exploring phix now than I was two weeks ago now. The reason is that I don't think it is actually doing much if anything to mos people's data. That said, I am finding interesting trends in phix contaminated data.

First, how does the phix get in the demultiplexed data? This happens when sequencing clusters (flowcell polonies) are in close proximity to one another and the optical registration of each cluster becomes confounded. Illumina instruments sequence the first read, then the first index, then second index (if present), then everything turns over, and then the second read is produced. If two clusters are sufficiently overlapping, the signal:noise ratio should cause the mixed cluster to be tossed out (not passing filter in Illumina jargon). However, there may be a proximity that still passes filter that can cause one read from one sample to be attributed to another (see Kircher et al 2012 http://www.ncbi.nlm.nih.gov/pubmed/22021376). The phix library used as Illumina control is non-indexed, meaning it "goes dark" during the indexing reads. So there will be no conflicting signal during indexing, though you might think it should not PF during cluster registration. At any rate, what must (probably) be happening is you get a splendid read1 from phix and then a nearby cluster (very close, in fact) from your data set, perhaps that didn't PF in the first place, generates signal during the index read and thus that phix read is attributed to that sample during demultiplexing. This means that cluster density as well as phix concentration will both contribute to the percentage of phix infiltration for a given data set.

So what can phix do to your amplicon data? As you mention, OTU inflation might be a concern as well as describing spurious uncharacterized diversity. However, most phix is probably removed during standard filtering steps (singleton/doubleton, 0.005% abundance). But not necessarily. Phix (PhiX174, Fred Sanger's first genome) is a bacteriophage with a puny genome (5kb). For the data set I was playing with, I had about 75k phix reads from over 3M reads. This is around 2%, and sufficient to cover phix 100 fold. If we still all used blast, this might be a non-issue, but with cdhit/uclust type algorithms with short words and such, many random phix sequences could conceivably be assigned to OTUs during de novo OTU picking steps. When I filter my data at 0.005%, it is using a threshold around 30 counts. If I am covering phix 100 times, it seems I might leave some parts of it in my data to be attributed to the "unknown" category. Since these will all be random reads, it should serve to subtly homogenize your data set so if you expect a small effect size it could cause real problems with data interpretation.

I also have been finding differences with respect to read joining. Playing with fastq-join at different allowable mismatches, I find very stringent joining (5% allowable mismatch or less) results in very high proportions of phix contamination in the joined data (around 10% of the data when joined at 1%). Less stringent joining (30%) yields far more reads with about the same data result, and phix contaminations down around 1%).

I got a response back from my Illumina FAS today after a couple of weeks of waiting. He confirmed that my suspicions are correct about how the phix gets in the data and the factors that affect it. He suggested using an indexed control library to eliminate this sort of contamination.

Because of the mechanism for phix infiltration (mixed clustering where one cluster has an index and the other does not), I do not think the phix rate is indicative of sample-sample bleed (though I do believe this happens, and more so with single than dual indexed data). The sample-sample bleed is probably an order of magnitude lower or better.

So what can you do? There is a utility from the Sanger institute called smalt that is perfect for removing phix. I wrote a script using this to do just that from my data (https://github.com/alk224/akutils). Somewhere in my notes for it there is a link to the genbank sequence for the phix used for illumina runs. And most of all, grab your towel and keep calm, the phix probably isn't doing anything to most of us.

My workflow: fastq-multx to demultiplex, remove phix with smalt, join with fastq-join, second demultiplex with split_libraries_fastq, pick_open_reference_otus (uclust max accepts 1000 rejects 2000), pick_rep_set, assign tax (RDP), align seqs (pynast or mafft), build tree (though I generally use gg tree), make otu table, post-processing steps.

Whew!!

Andrew Krohn

unread,

Jan 6, 2015, 7:45:52 PM1/6/15

to qiime...@googlegroups.com

I forgot to mention. What phix is probably doing to all of us is wasting a bit of computational time during de novo steps. I have no idea how to quantify this and is probably hardware-specific anyway. Otherwise, it's effects are probably minimal.

Reply all

Reply to author

Forward