MetaPhlAn 1.7.0 supports fastq files

Nicola Segata

unread,

Jul 27, 2012, 11:58:58 AM7/27/12

to metaphl...@googlegroups.com

Dear MetaPhlAn users,

we have just implemented one of the features that several of you requested: the support for fastq files (rather than fasta files only) as input format for the metagenomes.

In the new version (1.7) this is possible when using BowTie2 for the mapping (Blast does not support fastq). No changes are needed in the command line options as MetaPhlAn will automatically detect the fastq format and will apply BowTie2 accordingly.

As usual the new version is available at:

http://huttenhower.sph.harvard.edu/metaphlan

with the code hosted at:

https://bitbucket.org/nsegata/metaphlan/

I would highly recommend to use fastq files as input becase (a) BowTie2 will map the reads with higher sensitivity and accuracy, (b) no fastq->fasta conversion and read quality control are needed. This means that you can now apply MetaPhlAn basically on the output of the sequencing machine directly (even non-microbial reads are not a problem because they will not be mapped to microbial clade and they will not affect at all the estimated relative microbial abundances.).

Let us know if you have any comment or question.

many thanks

Nicola

Nicola Segata

unread,

Feb 4, 2013, 8:26:48 AM2/4/13

to metaphl...@googlegroups.com

Hi Maria,

that's a good question!

Very generally speaking, sequencing artifacts and low-quality reads should not considerably impact the MetaPhlAn profiles as those reads will not map against any marker (at least above a certain noise level that we remove). Exact duplicates are not removed by the BowTie2 application and do influence to a certain degree the profiled relative abundances (especially if the distribution of duplicates is uneven within the metagenome) but not the presence of the microbes. Are the differences you noticed related to the abundances or also to the presences/absences?

One other aspect to consider is that if your QC pipeline removes many reads, some microbes may go below the level of detection in the QC'd metagenome and abundant microbes will have more noisy abundance estimates as a consequence of the lower coverage.

Based on the testing we performed in the last few months, we are now recommending to use the "sensitive" or "very-sensitive" parameters for BowTie2 mapping instead of the "-local" variants. The default is still "very sensitive local" for backward compatibility but we'll switch to "very-sensitive" in the next release. With "sensitive" I expect you'll notice smaller differences between the QC'd and the non QC'd metagenomes. Also, looking at the "-t clade_profiles" output can help you understand which reads are causing the differences (I can also have a look at that if you'd like).

I hope this helps in tuning the pipeline for your analysis!

thanks

Nicola

On Thursday, January 31, 2013 2:47:46 PM UTC+1, bluep...@gmail.com wrote:

Dear Nicola,
Hi, we have a question regarding the usage of fastq files with the adjusted sensitivity setting to use [sensitive-local].
May I know how the quality filtering was done during the bowtie2 search and if the process remove duplicate reads from the sequencer [e.g. in the case of 454 sequences]?
We did a quick run on our illumina reads, where one input is the untreated file [then we use the sensitive-local option], and the other is pre-treated for quality filtering [using prinseq to trimming (remove tails) + remove low quality sequences (mean q_score = 20) + remove exact duplicates], then we use this to run metaphlan with default setting [very sensitive local].
The results were quite different, and wondering whether there is a way to calculate the possible rate of false positives? [personally I would very much like to use the files directly without quality filtering to avoid the trouble!]. But since we are also interested in the microbes that are of low abundance so sensitivity would benefit too.
Thank you very much for your kind attention!
Maria

bluep...@gmail.com

unread,

Feb 22, 2013, 7:22:25 AM2/22/13

to metaphl...@googlegroups.com

Dear Nicola,

Thank you very much for your reply! We have adjusted the sensitivity to "sensitive" or "very-sensitive" [without the -local variant]. And we have tested the search with both QC and non-QC samples and yes, the results are quite similar.

However, when we compare the output from blast (e-value cut off e-10, ws 12), it seems that blast is picking up much more hits (like 10x more hits). Is it a general observation between samples searched using bowtie vs blast [where bowtie has lower hits], esp. for non-human samples? Do you suggest us to use blast instead of bowtie when searching against samples that might be of lower similarity to the database?

Thanks for your help again!

Maria

bluep...@gmail.com

unread,

Feb 22, 2013, 7:41:25 AM2/22/13

to metaphl...@googlegroups.com

On another quick note, FYI the relative abundance profile output from same sample searched using bowtie "very-sensitive-local" to blastn is more similar, but we can't figure out why...Does that mean the blastn search is not specific? What does the "local" variant mean for bowtie? Thank you so much! -maria-

Nicola Segata

unread,

Feb 22, 2013, 9:33:58 AM2/22/13

to metaphl...@googlegroups.com

Hi Maria,

thanks for the feedback.

The local strategy in BowTie2 (and in any aligner in general) will find a hit of a read against a target sequence even if only a fraction of the read matches the target. A non-local (or global) strategy requires a good sequence homology along the full length of the read. A 28nt-long match for a 30nt-long portion of a 101nt-long read is a hit for a local strategy but not for a global one. A 97% identity over the full length of 100nt would be a hit for both the global and the local strategy.

In MetaPhlAn, a local strategy can produce false positives because of conserved very short DNA sequences that can sometimes be present even in marker genes.

Blastn is, by definition, a local aligner. The "local" behavior is frequently not evident for short reads because the default word size (ws) is quite large (around 30) avoiding local hits for non-identical matches of even longer length (even 50nt unless the mismatch is peripheral). Given that you greatly decreased the ws to 12 to improve sensitivity (I agree that the default value makes Blastn not accurate enough), you also made Blastn to behave in a much more local way than the default.

A check you could do would be to post-process the blastn hits (e-value cut off e-10, ws 12) and consider only those with a matching length longer than, say, 75nt (the matching length is in one of the columns of the standard Blastn tabular output format). In this way you are making Blastn global (assuming Blastn is reporting long matches even if their e-value is lower than shorter subsequences) and the MetaPhlAn behavior should be much more similar to BowTie2 (non-local) "sensible".

In summary, I think that the real differences in mapping are not depending too much on the algorithm (BowTie2 vs Blastn) but more on the local vs global mapping policy. I would recommend a global strategy to avoid false positives, whereas a local strategy can sometimes be useful for metagenomes without closely related sequences genomes (like in some non-human microbiomes).

An aspect that I would like to underline not only in this context, is that for MetaPhlAn, differently form other approaches, the number of mapped reads is _not_ a proxy for the quality of the profiling.

Sorry for the technicalities here, but I hope this clarifies a bit what you are seeing in the results...

many thanks

Nicola

bluep...@gmail.com

unread,

Feb 25, 2013, 1:55:37 AM2/25/13

to metaphl...@googlegroups.com

Thank you very much for your kind response! And sorry to have asked so many questions! The information you've provide is very useful. As we are trying to work on comparing the profiles of non-human microbiome metagenomes of different read length (e.g. 454 [relatively long, >400bp] and MiSeq [short, 250bp] data), and deciding whether we can use one "universal" mapping parameter to map the reads, and be able to compare the results.

We will adjust the blastn hit results as suggested, the blast was done for 454 reads. Is 75nt matching length a good cut off for 454 reads too, in the case when we want to compare against other libraries of different read length? Or would you advise us to adjust the match length to the average length of the illumina reads we are going to compare with [which will be searched using bowtie with global alignment)?

Or do you think we should break the 454 sequences into lengths similar to the illumina reads and do bowtie using the same parameters altogether (global alignment)?

Thank you very much once again!

Maria

Nicola Segata

unread,

Feb 26, 2013, 2:56:46 AM2/26/13

to metaphl...@googlegroups.com

Hi Maria,

given that in both cases your reads are relatively quite long (>200), I would use blastn even at 100nt length cutoff. And this should give you consistent results for both technologies. This would implement a sort of "semi-local" version. There is not a BowTie2 preset option for semi-local (i.e. a local mapping with a minimum matching length), but I'm going to include such option in future MetaPhlAnl releases.

I hope this helps,

many thanks

Nicola

bluep...@gmail.com

unread,

Feb 26, 2013, 4:05:31 AM2/26/13

to metaphl...@googlegroups.com

Dear Nicola

Thank you so much for your prompt reply! Much appreciated :)
Have a nice day.

Maria

Reply all

Reply to author

Forward