have Uniref50 and Uniref90 in the same output _genefamilies.tsv

274 views
Skip to first unread message

tiphaine...@gmail.com

unread,
Feb 9, 2017, 7:44:57 AM2/9/17
to HUMAnN Users
Hi,

My colleague run

humann2 --input $fastafile --output $workdir --metaphlan /home/test/humann2/metaphlan2/biobakery-metaphlan2-1f76aaeb2f66 --bowtie2 /home/test/metagenome/bowtie2/bowtie2-2.2.9 --diamond /home/test/.local/bin --threads 16

without specific --search-mode

in the folder "humann2/DB/uniref", we have
uniref50_annotated.dmnd
uniref90_annotated.dmnd

but you put in our document:

"--search-mode {uniref50,uniref90}
search for uniref50 or uniref90 gene families
[DEFAULT: based on translated database selected]"
How can we select the translated database ?

Moreover, in the output file type _genefamilies.tsv

we have some lines like that:

UniRef50_A0A015T442 6015.917931
UniRef50_A0A015T442|unclassified 6015.917931
UniRef90_A0A015T442 6015.917931
UniRef90_A0A015T442|unclassified 6015.917931

where we have some repetitions and some values with UniRef90 and UniRef50.
or somethimes we have only data from UniRef90
UniRef50_R6N0W6|unclassified 7.902537394
UniRef90_J9CWL1 3515.841598
UniRef90_J9CWL1|unclassified 3515.841598
UniRef50_R6HM03 3515.658306
UniRef50_R6HM03|g__Bacteroides.s__Bacteroides_dorei 3077.908288

Does it mean that humann2 run it on the two databases or if some data come from only UniRef50?

Should we need to rerun humann2 and how to specify clearly the database that we want to use?

Regards,
Tiphaine

Eric Franzosa

unread,
Feb 9, 2017, 7:27:37 PM2/9/17
to humann...@googlegroups.com
Hi Tiphaine,

Sorry for the trouble! If pointed at a folder with more than one database, HUMAnN2 will perform separate searches against each one and merge the results (this allows a user to break up a large database, e.g. one that is too big to fit in memory, and search it serially). That is what you are seeing here.

If you store the UniRef50 and UniRef90 databases in separate folders you will not have this issue. To tell HUMAnN2 which database to use, you can point you individual runs at a folder with the "--protein-database" flag (OR) you can configure a default translated search database with the humann2_config utility:


For example:

humann2_config --update database_folders protein $DIR

Would update the default protein database to $DIR.

====

The reason why some protein IDs show up as both UniRef90 entries and UniRef50 entries is that UniRef50 is a clustering of UniRef90. Hence, some UniRef90 centroids become UniRef50 centroids.

Thanks,
Eric


tiphaine...@gmail.com

unread,
Feb 10, 2017, 8:42:59 AM2/10/17
to HUMAnN Users, tiphaine...@gmail.com
oh ok, thanks,

Do you know if it is possible to split the humann2's files into two files (Uniref50 and Uniref90) or should we need to rerun from fasta files on only one database or can we do from data from metaphlan?

Sorry to ask you but it was run on 300 samples, It took about 3days per sample with a node of 64Go and 16CPU.

Regards,

Tiphaine

Eric Franzosa

unread,
Feb 13, 2017, 2:52:04 PM2/13/17
to humann...@googlegroups.com, tiphaine...@gmail.com
Hi Tiphaine,

You could split the UniRef classes after-the-fact with grep, e.g.:

grep -v "UniRef50" mixed_unirefs.tsv > only_uniref90.tsv

You could then run the resulting TSV files back through HUMAnN2 to recompute pathways (this step is fast, ~1 minute per sample).

However, because reads were initially mapped to both UniRef50s and UniRef90s, their weights will be distributed over both classes of database sequences, and so the result of the "grep" command probably won't be identical to running on (e.g.) UniRef90 in isolation. If you have the computing bandwidth, it's probably safer to just re-run using a single UniRef database.

Thanks,
Eric


Reply all
Reply to author
Forward
0 new messages