Collapsing gene families/humann_regroup_table

1,328 views
Skip to first unread message

Wiley Barton

unread,
Sep 7, 2015, 7:38:08 AM9/7/15
to HUMAnN Users
Dear HUMAnNs,

A few of my colleagues and I have been working through HUMAnN2 and have reached a point where we are beginning to dig through the heaps of data from the various outputs. We were very intrigued by the regroup function as it would grant a resolution of the picture that is both easier to assess and run through R. Unfortunately we've hit a wall with running regroup with the GO option and would appreciate some insight with getting it sorted out. Furthermore, applying any of the uniref50_XX flags yields the "Problem opening uniref50_XX" message as seen below.

Thanks again and all the best,

Wiley



Treating NFATGF.tsv as stratified output, e.g. ['UniRef50_A0A008B8R8: Secretory antigen SsaA', 'unclassified']
Problem opening uniref50_go
Traceback (most recent call last):
  File "/opt/python2.7/bin/humann2_regroup_table", line 9, in <module>
    load_entry_point('humann2==0.1.10', 'console_scripts', 'humann2_regroup_table')()
  File "/opt/python2.7/lib/python2.7/site-packages/humann2-0.1.10-py2.7.egg/humann2/tools/regroup_table.py", line 94, in main
    groups = util.load_polymap( args.groups )
  File "/opt/python2.7/lib/python2.7/site-packages/humann2-0.1.10-py2.7.egg/humann2/tools/util.py", line 162, in load_polymap
    with try_zip_open( path ) as fh:
AttributeError: __exit__

Eric Franzosa

unread,
Sep 8, 2015, 11:33:24 AM9/8/15
to humann...@googlegroups.com
Hi Wiley,

Are you using v0.3+? The convenience flags for the mapping script were a relatively recent addition (previously you had to provide the grouping files yourself).

Thanks,
Eric


Wiley Barton

unread,
Sep 8, 2015, 3:42:56 PM9/8/15
to HUMAnN Users
Dear Eric,

Ah ha! We're currently running v.1.10. I had a sense it would be something like that. Thanks very much for your help Eric.

All the best,
Wiley

Xabier Vázquez Campos

unread,
Sep 15, 2015, 12:38:36 AM9/15/15
to HUMAnN Users
Similar problem here, but with HUMAnN2 v0.3.1

(mypythonenv)[z3382651@katana 028-LFB]$ regroup_table.py -i /srv/scratch/z3382651/meta_analysis/028-LFB/028-LFB_all.trimmed_genefamilies.tsv -g uniref50_ko -o /srv/scratch/z3382651/meta_analysis/
Loading table from: /srv/scratch/z3382651/meta_analysis/028-LFB/028-LFB_all.trimmed_genefamilies.tsv
This is a large file, one moment please...
Treating /srv/scratch/z3382651/meta_analysis/028-LFB/028-LFB_all.trimmed_genefamilies.tsv as stratified output, e.g. ['UniRef50_G9ZMQ8', 'unclassified']
Loading mapping file from: /srv/scratch/z3382651/humann2_v0.3.1/humann2/tools/../data/misc/map_ko_uniref50.txt.gz
Original Feature Count: 2597156; Grouped 1+ times: 142880 (%5.5); Grouped 2+ times: 2597 (%0.1)
Problem opening /srv/scratch/z3382651/meta_analysis/

Traceback (most recent call last):
  File "/srv/scratch/z3382651/humann2_v0.3.1/humann2/tools/regroup_table.py", line 162, in <module>
    main()
  File "/srv/scratch/z3382651/humann2_v0.3.1/humann2/tools/regroup_table.py", line 159, in main
    table.write( args.output )
  File "/srv/scratch/z3382651/humann2_v0.3.1/humann2/tools/util.py", line 142, in write
    writer = csv.writer( fh, dialect='excel-tab' )
TypeError: argument 1 must have a "write" method

Eric Franzosa

unread,
Sep 15, 2015, 10:10:03 AM9/15/15
to humann...@googlegroups.com
Hi Xabier,

It looks like you supplied a folder here:

-o /srv/scratch/z3382651/meta_analysis/

However, this should be the name of a new file you'd like to write to.

Thanks,
Eric


O. AlZahal

unread,
Oct 20, 2015, 4:20:12 PM10/20/15
to HUMAnN Users
Hello, i have a related question. sorry for resurrecting this thread. I ran HUMAnN2 but without the "regroup_table uniref50_ko". Do i need to re-run HUMAnN2 all over again (`15 hrs/sample)? How can i bypass all the steps? 

many thanks

OA

Eric Franzosa

unread,
Oct 20, 2015, 4:35:48 PM10/20/15
to humann...@googlegroups.com
Hello,

Definitely no need to rerun HUMAnN2 for this. The regroup_table script takes HUMAnN2 genefamilies output as its input to generate additional output files. There's a minimal usage case spelled out here in the manual.

Thanks,
Eric


Ousama AlZahal

unread,
Oct 21, 2015, 11:33:40 AM10/21/15
to humann...@googlegroups.com
Many many thanks Eric, I believe that the regrouped file maintain that same unit (RPK). Following the regrouping, how important is it to renormalize? in the annual (Collapsing gene families section), it is stated that: "After regrouping features, users may wish to renormalize the resulting table using the humann2_renorm_table script. This allows users to investigate the relative abundance of their groups of interest, discounting the abundance of features that did not map to a group.” 
Also, it is stated int the Gene families file section of the manual that: "RPK values can be further sum-normalized to adjust for differences in sequencing depth across samples.” Does this mean to calculate a percentage i.e. RPK / sum of RPK * 100. 
if so, the third step would be to normalize the distribution and get the numbers ready for ANOVA? 
What would be an acceptable cut-off point above which, an abundance can be included in the analysis and below which can be merged with “others”?
what do you think the best way to visualize the normalized the gene family data i.e. 0-100% ? a heatmap?

sorry for all theses questions 

many thanks 

OA

Eric Franzosa

unread,
Oct 23, 2015, 11:33:55 AM10/23/15
to humann...@googlegroups.com
It's almost always going to be a good idea to normalize the RPK outputs using the renorm_table script (to adjust for differences in sequencing depth between samples). In some cases it's useful to have the RPK numbers, which is why they are left unnormalized by default. For example, if you want to talk about the coverage of a genome in your sample, you need unnormalized RPK values.

If you sum-normalize RPKs to "copies per million (CPM)" units and then regroup, it's not guaranteed that the new groups will sum up to 100% (1 million CPM). For example, imagine genes A, B, C and D each have abundance 25% (250K CPM), genes A and B belong to the ribosome, and C and D have unknown function. If we regroup by function, we'll assign 50% (500K CPM to the ribosome), and 50% unknown. In another sample with only A and B, we'd assign 100% to the ribosome. The fact that 50% of the first sample had unknown function vs. 0% of the second sample may or may not be interesting to you. If it isn't, then you might renormalize the first sample to add up to 1 million again (so both samples would now be 100% ribosome). This is conceptually similar to forcing a taxonomic profile to sum to 100%, even though one can't necessarily identify all species in the community.

Thanks,
Eric


Ousama AlZahal

unread,
Oct 23, 2015, 1:04:34 PM10/23/15
to humann...@googlegroups.com
Thank you Eric for the detailed answer. I like using the taxonomic profile as an example as i have experience only with 16S and i am new to meta’omics. Regarding the normalization, my approach in my 16s has been always to keep all the unclassifed reads. On average, in most of my samples, 30% of the reads were unclassified. The reason i keep them is to be able to compare studies or batches on a fare ground. I don’t believe that i should compare my study (with 30% unclassified) with someone else’s who had 60% unclassifiable reads. 

In this study compare the shift in rumen's function between two groups of animals (4 received yeast vs. 4 who received placebo) using metatranscriptomics. 
I tried to use my “approach”  by calculating the relative abundance (i.e. 100%) as this approach will be simple and meaningful in comparing the two groups (ANOVA). 
Figuring out the following questions will help decide what to do:

-is renorm_table necessary only if you regroup, or does the script also can be used to convert to CPM or relab.
-the way i understand it that renorm-table does two things 1) omit the ungrouped genes to any function and 2) convert the unit from RPK to CPM (RPK per million?) or relativ abundance (total of 1).
-Can i normalize without ommitting any genes (consider using a ungroups group..). I am curious to know who many did not group to a function.
- what is the sequence of events in this case, regroup > rename > renormalize? 
- When i collapsed my uniref50 (1,1 million gene families) to uniref_ko (19k), the screen said only 85% were renamed. I checked the file and there were some of the “NO NAME” lines. would this be a large loss of information.
- even when collapsing to KO, it is still very large number of gene families. my plan is to use a cut off point i.e. >XX% in all samples and then present only the statistically significant (ANOVA) between the groups of animals. what would be an acceptable cut-off point?

many many thanks for all the help

OA

Eric Franzosa

unread,
Oct 26, 2015, 12:36:57 PM10/26/15
to humann...@googlegroups.com
* It's a good idea to run renorm on all tables before doing statistical analysis. Renorm will by default convert to CPM units, but you can also request relative abundance.

* renorm only does the conversion to CPM/relab. Regroup_table will sum rows of a table (whether they are normalized or not) if they below to the same "group" (typically UniRef50 -> broader gene family or functional module). It is recommended to renorm before regrouping.

* regroup doesn't explicitly output the "ungrouped" features. It gives a summary on the command line of how many were grouped/ungrouped. If you provide normalized input to regroup, then you can infer the abundance of the ungrouped features. For example, if your original features were normalized to CPM units, then the community totals would add up to a million. If, after regrouping, the totals only add up to 700K, this suggests that 300K worth of original features did not belong to a group. (Caveat: This becomes somewhat more complicated if features belong to more than one group.)

* If, after regrouping, you want all samples to sum to the same value again, you would run renorm_table again. In the above example, this is equivalent to saying that you aren't interested in the 300K CPM of features that don't belong to a group (which may differ from sample to sample).

* Not being able to rename all KOs is an unfortunate side effect of KEGG's commercialization. We are able to systematically map UniRef50s to KOs based on free data in UniProt, but this service doesn't include the KO names. We extract KO names from the last free version of KEGG, which does not include names for some of the newer KOs (though you could still look them up individually on the KEGG website if they prove interesting in your analyses).

* To improve statistical power, we usually limit the number of features we test. One easy way to do this is with a variance filter: e.g. selecting the top 10% most variable features across your samples. This tends to isolate features that are fairly abundant and fairly prevalent, while discounting super-conserved features (e.g. ribosomal proteins) that are unlikely to differ in abundance across samples.

Thanks,
Eric


Ousama AlZahal

unread,
Oct 26, 2015, 3:48:07 PM10/26/15
to humann...@googlegroups.com
Many thanks Eric for the detailed answer. It is very clear now. I normalized (relab) my samples but since the file were too big, i could not sum up the numbers (if anybody has a linux code to do that, it is appreciated), but i expect it would be 2, which is twice as much due to the total line. I regrouped the samples and it sounds i got a low grouping rate:
"Original Feature Count: 1191104; Grouped 1+ times: 81031 (%6.8); Grouped 2+ times: 1566 (%0.1)” 
The regrouped file was much smaller so i was able to open it with Excel. The lines summed up to 39% i.e. the grouped is about 20% only.
Is my interpretation correct? is this low number acceptable? Shall i skip the regrouping and just focus on filtering the variances? 
My samples come from the rumen of cows. Each samples contained ~ 80 million quality checked sequences. Only 3% were classifiable and only ~%50 trasnlated to protein (350 million hits). I find my samples are close to someone else’s in the group, but their samples were environmental samples. 
I have not started looking into the pathway files and i am not sure if i should skip the gene families file. 

many thanks Eric

OA 
Reply all
Reply to author
Forward
0 new messages