How to associate reads with the MetaCyc pathways of the genes they are mapped to by humann ?

assa...@gmail.com

unread,

Oct 3, 2015, 4:04:18 PM10/3/15

to HUMAnN Users

Hi all,

After running Huammn2, I need to retrieve all Huammn2 mapped-reads that belong to specific MetaCyc pathways, for further experimental and in-silico analyses.

Humann2 read-level data is found in the temporary output directory, as two files:

assembled_diamond_aligned.tsv (diamond mapping)
assembled_bowtie2_aligned.tsv (metaphlan2 mapping)

Attached below the content of the above tsv files, and in these files we can see that read names are associated with UniRef50 annotations, and other information.

I would like to ask:

1) How is it possible to associate each read name to the MetaCyc pathways of its gene ?

2) In the above tsv files, many ref50 annotations have "UPI" prefix (e.g., UniRef50_UPI000377882A), while in many cases this prefix is missing (e.g., UniRef50_Q31PL8).
Only the second type of annotation is found in the online UniRef50 database.
What is the difference between these two? is it possible to convert the first type to the second type ?

thanks,
Assaf

diamond output:

SBS123:274:HCJKLADXX:1:1101:1202:2184   UniRef50_UPI000377882A|519      49.2    63      27      2       183     10      97      159     9.2e-09 63.5
SBS123:274:HCJKLADXX:1:1101:1202:2184   UniRef50_UPI000465167D|1155     41.4    58      34      0       183     10      92      149     8.6e-07 57.0
SBS123:274:HCJKLADXX:1:1101:1202:2184   UniRef50_UPI0003B64A9E|1149     34.5    55      36      0       180     16      100     154     5.8e-03 44.3

bowtie (metaphlan2) output:

SBS123:274:HCJKLADXX:1:1101:1393:2135   gi|422303309|ref|NZ_HE973236.1|:c40675-39428|1126|g__Microcystis.s__Microcystis_aeruginosa|UniRef90_I4HGB4|UniRef50_Q31PL8|1248 97.0802919708   274.00
SBS123:274:HCJKLADXX:1:1101:1641:2075   gi|166362741|ref|NC_010296.1|:c3748736-3748173|1126|g__Microcystis.s__Microcystis_aeruginosa|UniRef90_B0JR36|UniRef50_K9YV63|564        330.88235294168.0                                                     0
SBS123:274:HCJKLADXX:1:1101:4818:2169   gi|425445115|ref|NZ_HE972958.1|:c50681-50229|1126|g__Microcystis.s__Microcystis_aeruginosa|UniRef90_I4INT3|UniRef50_I4GHF4|453  96.8468468468   222.00

Eric Franzosa

unread,

Oct 5, 2015, 11:17:53 AM10/5/15

to humann...@googlegroups.com

Hi Assaf,

We associate UniRef50s to pathways by a two-step process: 1) mapping genes to reactions, and then 2) mapping reactions to pathways. #1 is handled by the file metacyc_reactions_level4ec_only.uniref.gz and #2 is handled by the file metacyc_pathways (both under humann2/data/pathways). If you need to map reads to pathways, you would start from the read->uniref50 mappings from the raw alignment output (mentioned in your message), then map uniref50->rxn, and finally rxn->pathway.

In reference to your 2nd question, UniRef50 is a clustering of a few different protein databases. The largest of these is UniProt, whose proteins have swissprot-style IDs (e.g. Q31PL8). UniRef also includes UniParc, whose proteins have IDs like UPI000377882A. I don't believe we can map from UniParc-style IDs to UniProt-style-IDs. Indeed, if a UniParc protein occurs as a centroid in UniRef, I believe that means it is not represented in UniProt.

Thanks,

Eric

assa...@gmail.com

unread,

Oct 9, 2015, 2:02:34 PM10/9/15

to HUMAnN Users

Thanks Eric,

I guess that 'PWY-2681' and 'RXN-4304' like names represent pathway IDs, and reaction IDs, respectively. I hope I'm correct.
Yet, I cannot see a description of pathways there.
The files metacyc_pathways_to_organisms possibly has a pathway name, e.g. '4-HYDROXYMANDELATE-DEGRADATION-PWY'. I hope I'm correct.
So, since metacyc_pathways_to_organisms and metacyc_pathways have the same number of lines, is it that each record in the fist file corresponds to record in the second file in the line?

Assaf

assa...@gmail.com

unread,

Oct 9, 2015, 2:42:21 PM10/9/15

to HUMAnN Users

OK I think I understand - the file with the mapping from pathway ID to pathway name is map_metacyc-pwy_name.txt in the misc directoy

Assaf

Eric Franzosa

unread,

Oct 9, 2015, 3:51:19 PM10/9/15

to humann...@googlegroups.com

Hi Assaf,

Sounds like you worked this out, but to confirm: the items in the pathway definitions are MetaCyc's internal identifiers for pathways and reactions, and the human-understandable names are in the pathway names file.

Thanks,

Eric

Josh Espinoza

unread,

Aug 9, 2018, 2:37:18 PM8/9/18

to HUMAnN Users

Extend this conversation a bit to include some of the edge cases regarding the series of events associating reads to the MetaCyc/taxonomy attributes in the resulting abundance profile.

I would like to include the HUMAnN2 analysis in a publication we plan on submitting to a high impact journal but I need to explain exactly what is happening to the collaborators; getting the reads organized is a critical component.

According to the conversation above this is what appears to be going on, PLEASE CORRECT WHERE NECESSARY:

1. Reads would be mapped to genes in the following files in the [name]_temp/ directory:

1a. [name]_diamond_aligned.tsv

NS500647:186:HV3F5BGX2:1:11101:17415:10831:N:0:CTCAGA gi|400294433|ref|NZ_ALJK01000240.1|:c8692-7928|1655|g__Actinomyces.s__Actinomyces_naeslundii|UniRef90_J2ZLR9|UniRef50_R5IQW2|765 99.18032786885246 122.0 0 
NS500647:186:HV3F5BGX2:1:11101:20301:11561:N:0:CTCAGA gi|288801553|ref|NZ_GG740010.1|:c46051-44267|28132|g__Prevotella.s__Prevotella_melaninogenica|UniRef90_D9RTD5|UniRef50_R5FHH6|1785 93.27731092436974 119.0 0 
NS500647:186:HV3F5BGX2:1:11101:21205:85401:N:0:CTCAGA gi|512460964|ref|NZ_KE150253.1|:c257004-255034|45242|g__Capnocytophaga.s__Capnocytophaga_granulosa|UniRef90_J5Y4E1|UniRef50_F8EB68|1971 95.86206896551724 145.0

1b. [name]_bowtie2_aligned.tsv

NS500647:186:HV3F5BGX2:1:11101:17082:23151:N:0:CTCAGA|146 UniRef90_D1BPC2|753 75.0 40 10 0 6 125 212 251 1.5e-11 70.9 
NS500647:186:HV3F5BGX2:1:11101:17082:23151:N:0:CTCAGA|146 UniRef90_K0Y074|828 73.2 41 11 0 3 125 236 276 2.5e-11 70.1 
NS500647:186:HV3F5BGX2:1:11101:17082:23151:N:0:CTCAGA|146 UniRef90_E4L934|762 75.0 40 10 0 3 122 214 253 2.5e-11 70.1

2. These genes/proteins/protein-cluster from [name]_bowtie2_aligned.tsv without taxonomy information and from [name]_diamond_aligned.tsv with taxonomy information would be associated with the reactions from the following file:

2a. humann2/data/pathways/metacyc_reactions_level4ec_only.uniref.bz2

GLUCOSE-1-PHOSPHATE-PHOSPHODISMUTASE-RXN 2.7.1.41 UniRef50_G6EMD2 UniRef50_Q48UI5 UniRef50_T0UKK6 UniRef90_F4BRY1 UniRef90_G4L3Q0 UniRef90_G6EMD2 UniRef90_K0J4V1 UniRef90_R9TSN7 UniRef90_T0T682 UniRef90_T0UKK6

3. The reactions from this would be associated with the pathway identifiers from this file:

humann2/data/pathways/metacyc_pathways

PWY-2681 RXN-4303 RXN-4304 RXN-4310 RXN-4305 RXN-4306 RXN-4312 RXN-4308 RXN-4314 RXN-4307 RXN-4313 RXN-4317
PWY1G-126 1.8.1.15-RXN RXN1G-6
METHGLYUT-PWY 1.1.1.283-RXN LACTALDDEHYDROG-RXN L-LACTDEHYDROGFMN-RXN RXN0-4281 RXN-8632 GLYOXIII-RXN GLYOXI-RXN GLYOXII-RXN DLACTDEHYDROGFAD-RXN

Is the above pipeline correct or am I missing details?

My specific questions about edge cases:

(i) `PWY-5030: L-histidine degradation III|g__Streptococcus.s__Streptococcus_sanguinis`

Would this HUMAnN2 attribute from the abundance profile contain all of the genes from all of the reactions in the [name]_diamond_aligned.tsv since there is taxonomy associated with the identifier?

(ii) `PWY-2942: L-lysine biosynthesis III`

Would this one be from all of the genes in [name]_bowtie2_aligned.tsv since there is no taxonomy information in the identifier and this does not contain taxonomy information because it was identifier to an orthologous group?

(iii) `UNINTEGRATED|g__Streptococcus.s__Streptococcus_sanguinis`

I'm not sure where the read -> organism mapping file is located.

(iv) Many of the UniRef(5/9)0_XYZ identifiers are not in in the humann2/data/pathways/metacyc_reactions_level4ec_only.uniref.bz2 file. How could these be handled? Is there another file where I should be looking for this information?

Thank you,

Josh

Eric Franzosa

unread,

Aug 13, 2018, 1:21:05 PM8/13/18

to humann...@googlegroups.com

Hi Josh,

In answer to your specific questions:

i) Information from the diamond output would not be used to quantify PWY-X|Species-Y: all species-stratified results are based on the read-level mapping performed with bowtie2 against the species' pangenome. The diamond mapping factors into the abundance of the unclassified stratum (e.g. PXY-X unclassified) and the pathway's community total (i.e. the initial line for the pathway without a "|").

ii) This is a pathway total, so it would be based on any reads that mapped to sequences (by bowtie2 or diamond) associated with the pathway's reactions.

iii) UNINTEGRATED|Species-Y is based on all of the UniRef abundances from the genefamilies output that were not used in pathway quantification (i.e. which were not regrouped into a reaction that then contributed to a reported pathway).

iv) Indeed, many UniRef families are _not_ associated with known metabolic reactions. These UniRefs are included in the UNINTEGRATED abundance feature described above.