Database and output doubts

127 views
Skip to first unread message

IÑIGO OYARZUN LAFUENTE

unread,
Sep 14, 2020, 6:02:31 AM9/14/20
to NGLess
Hello NGLess developing group!

I am starting a project on metatranscriptomics for the first time. First I will use different pipelines (ng-meta-profiler is one of them) with different combinations of reference databases (IGC and Uniref90). This way I will carry on the project with the best combination of them.

So, my first doubt is about how to implement the Uniref90 reference database. I've been looking through the documentation and I am not sure if I have to create a brand new module for it or simply add the lines below to the human_gut_profiler.ngl script:
         references:
             -
                name: 'Uniref'
                fasta-file: '../humann3/uniref/uniref90.fa'
          ... ... :..
          igc_mapped = map(non_human_reads, reference='Uniref', mode_all=True)

My second doubt is about the results.On specI.raw.counts.txt, motus.counts.txt and specI.scaled.counts.txt files the result are organized on cluster instead than in microorganisms species and I have tried to look for those clusters on motus database but without success, so i don't really know where those clusters come from, how to interpret them or what are they refering to.
Similar problem with the eggNOG.traditional.counts.txt file, I don't really know where to look for the codes (for instance,  004S6@aciNOG) for retrieve some information from the results.

Thank you, have a good day!
Iñigo.
         



Luis Pedro Coelho

unread,
Sep 29, 2020, 5:39:17 AM9/29/20
to IÑIGO OYARZUN LAFUENTE, NGLess List
Hi Iñigo,

(Sorry for the delay, your message go lost in the shuffle a bit).


So, my first doubt is about how to implement the Uniref90 reference database. I've been looking through the documentation and I am not sure if I have to create a brand new module for it or simply add the lines below to the human_gut_profiler.ngl script:
         references:
             -
                name: 'Uniref'
                fasta-file: '../humann3/uniref/uniref90.fa'
          ... ... :..
          igc_mapped = map(non_human_reads, reference='Uniref', mode_all=True)

No, this is mixing up two different ways of doing things.

If you want to just use it in the script, you can pass the fasta file directly

uniref_mapped = map(non_human_reads, fafile='../humann3/uniref/uniref90.fa', mode_all=True)

Alternatively, you can build a "module" that encodes these paths using the yaml format.

My second doubt is about the results.On specI.raw.counts.txt, motus.counts.txt and specI.scaled.counts.txt files the result are organized on cluster instead than in microorganisms species and I have tried to look for those clusters on motus database but without success, so i don't really know where those clusters come from, how to interpret them or what are they refering to.


I don't know if you have followed some of the most recent discussions, but we are now advising people to use a more recent version of motus: https://github.com/ngless-toolkit/ngless-contrib/tree/master/motus.ngm


Information as to what the clusters are can be found in the original motus tool website:



Similar problem with the eggNOG.traditional.counts.txt file, I don't really know where to look for the codes (for instance,  004S6@aciNOG) for retrieve some information from the results.

These refer to eggnog OGs: http://eggnog5.embl.de/#/app/results

Best
Luis

IÑIGO OYARZUN LAFUENTE

unread,
Oct 4, 2020, 4:47:40 PM10/4/20
to NGLess
Hello Luis,

Do not worry about the delay and thank you for answering :)

No, this is mixing up two different ways of doing things.

If you want to just use it in the script, you can pass the fasta file directly

uniref_mapped = map(non_human_reads, fafile='../humann3/uniref/uniref90.fa', mode_all=True)

Alternatively, you can build a "module" that encodes these paths using the yaml format.
Yes, reading again the documentation and looking through how other modules are built I figured out the way to do a module and also used the "fafile=" argument in different ngl scripts. Anyway, the scripts failed and then I realized that NGLess reference databases are in DNA format while Uniref90 is in protein format so it can not be used.

I don't know if you have followed some of the most recent discussions, but we are now advising people to use a more recent version of motus: https://github.com/ngless-toolkit/ngless-contrib/tree/master/motus.ngm
I already updated to mOTUs 2.5 and it is much more intuitive, thank you!

These refer to eggnog OGs: http://eggnog5.embl.de/#/app/results
I found it in other chats of the forum, so sorry for asking it before looking through the other chats. Anyhow I can't find some of the codes in the link, I guess because its eggnog5 while NGLess is built with eggnog4.5 (if i remember well). So, do you know where can I find the Eggnog4.5 website, annotations, database or whatever?

Also, I would like to comment a doubt I have about the preprocessing step. I am working with paired end metatrancriptomics data that we processed with kneaddata software in order to remove human RNA and ribosomal RNA. So, I can skip the preprocessing step in the ngl scripts, and then I have been trying different combinations of functions and steps on the scripts. I have figured out that, when I skip the preprocessing step,  during the mapping step the total reads number is similar to the sum of the two fastq.gz files sequences. Meanwhile, if I keep the preprocessing step the number of reads is bit lower than the number of sequences in one of the files. In both cases, the load_mocat_sample recognised the two fastq.gz files as paired end files. So it seems like the preprocessing step put together the paired reads, Do you know what is it due to or what is the fundament of this? Please if you have any doubt about the question or need the log files, don't hesitate in asking for them.

Thank you very much, best regards.
Iñigo.
 

Luis Pedro Coelho

unread,
Oct 7, 2020, 6:36:24 AM10/7/20
to NGLess List


These refer to eggnog OGs: http://eggnog5.embl.de/#/app/results
I found it in other chats of the forum, so sorry for asking it before looking through the other chats. Anyhow I can't find some of the codes in the link, I guess because its eggnog5 while NGLess is built with eggnog4.5 (if i remember well). So, do you know where can I find the Eggnog4.5 website, annotations, database or whatever?


Also, I would like to comment a doubt I have about the preprocessing step. I am working with paired end metatrancriptomics data that we processed with kneaddata software in order to remove human RNA and ribosomal RNA. So, I can skip the preprocessing step in the ngl scripts, and then I have been trying different combinations of functions and steps on the scripts. I have figured out that, when I skip the preprocessing step,  during the mapping step the total reads number is similar to the sum of the two fastq.gz files sequences. Meanwhile, if I keep the preprocessing step the number of reads is bit lower than the number of sequences in one of the files.
In both cases, the load_mocat_sample recognised the two fastq.gz files as paired end files. So it seems like the preprocessing step put together the paired reads, Do you know what is it due to or what is the fundament of this?

This seems strange, then. My first guess was that the you had changed the file names so that load_mocat_sample() was no longer recognizing them, but you say it's not that.

Please if you have any doubt about the question or need the log files, don't hesitate in asking for them.

Please do send the logs, yes (especially if you ran it with `--trace`). Feel free to use my personal email instead of the listserv if you are more comfortable with that.

Best
Luis

Thank you very much, best regards.
Iñigo.
 


--
You received this message because you are subscribed to the Google Groups "NGLess" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ngless+un...@googlegroups.com.

Xianghui Liu

unread,
Dec 14, 2020, 5:28:39 AM12/14/20
to NGLess
Dear Luis:
I also have the same concern about the database creation using my own data, for example, the genes from my denovo assembly of metagenomics dataset.
Is there any requirement for those gene nucleotide fasta file?  e.g., naming, special characters?
And I am suppose to use bwa to create index and database from that fasta file or it will be automatically created when it find out that there are no bwa index for that?
I asked this because I noticed that there are bwa files for the fasta file, e.g, under om-rgc.ngm folder, there are those files.
OM-RGC.fna-bwa-0.7.17.gz.amb  OM-RGC.fna-bwa-0.7.17.gz.ann  OM-RGC.fna-bwa-0.7.17.gz.bwt  OM-RGC.fna-bwa-0.7.17.gz.pac  OM-RGC.fna-bwa-0.7.17.gz.sa  OM-RGC.fna.gz  OM-RGC.functional.map.gz
Moreover, for the current reference database, you provide annotation file OM-RGC.functional.map.gz.
May I know how you link the annotation file to the fasta file.,,, is it by module.yaml file?

version: '1.0.0'
name: 'om-rgc'
references:
    -
        name: 'om-rgc'
        fasta-file: 'data/OM-RGC.fna.gz'
        map-file: 'data/OM-RGC.functional.map.gz'
citation: >

I also noticed that you have the Makefile there. 
Please kindly suggest me how shall I link them in command line or how to create module like what you did.
Do you have any documents for that?
Regards,
Xianghui

Luis Pedro Coelho

unread,
Dec 14, 2020, 6:06:58 AM12/14/20
to NGLess List

I also have the same concern about the database creation using my own data, for example, the genes from my denovo assembly of metagenomics dataset.
Is there any requirement for those gene nucleotide fasta file?  e.g., naming, special characters?

No. Anything that would work with bwa works here.

And I am suppose to use bwa to create index and database from that fasta file or it will be automatically created when it find out that there are no bwa index for that?

Automatically created on the first use.

Moreover, for the current reference database, you provide annotation file OM-RGC.functional.map.gz.
May I know how you link the annotation file to the fasta file.,,, is it by module.yaml file?

Yes, although you can also use it separately.

In the end, the only big advantage of using the module.yaml file is that it makes sure that you are using the right annotation file with the right FNA file. But you can also just use them outside the module system by passing them to the map and count functions, respectively.



I also noticed that you have the Makefile there. 

The Makefiles just help with downloading and preprocessing the files, but once we the module.yaml file (+ the FASTA + the annotation files, with these being optional), then they are not used anymore.

Please kindly suggest me how shall I link them in command line or how to create module like what you did.
Do you have any documents for that?

Not directly, but this seems to be a common enough workflow, that we should have an actual tutorial with details.

(It's the end of the day in my timezone, but maybe tomorrow, I can write a little doc)

Best
Luis

Regards,
Xianghui

在2020年9月29日星期二 UTC+8 下午5:39:17<lu...@luispedro.org> 写道:

Hi Iñigo,

(Sorry for the delay, your message go lost in the shuffle a bit).


So, my first doubt is about how to implement the Uniref90 reference database. I've been looking through the documentation and I am not sure if I have to create a brand new module for it or simply add the lines below to the human_gut_profiler.ngl script:
         references:
             -
                name: 'Uniref'
                fasta-file: '../humann3/uniref/uniref90.fa'
          ... ... :..
          igc_mapped = map(non_human_reads, reference='Uniref', mode_all=True)

No, this is mixing up two different ways of doing things.

If you want to just use it in the script, you can pass the fasta file directly

uniref_mapped = map(non_human_reads, fafile='../humann3/uniref/uniref90.fa', mode_all=True)

Alternatively, you can build a "module" that encodes these paths using the yaml format.

My second doubt is about the results.On specI.raw.counts.txt, motus.counts.txt and specI.scaled.counts.txt files the result are organized on cluster instead than in microorganisms species and I have tried to look for those clusters on motus database but without success, so i don't really know where those clusters come from, how to interpret them or what are they refering to.


I don't know if you have followed some of the most recent discussions, but we are now advising people to use a more recent version of motus: https://github.com/ngless-toolkit/ngless-contrib/tree/master/motus.ngm


Information as to what the clusters are can be found in the original motus tool website:



Similar problem with the eggNOG.traditional.counts.txt file, I don't really know where to look for the codes (for instance,  004S6@aciNOG) for retrieve some information from the results.

These refer to eggnog OGs: http://eggnog5.embl.de/#/app/results

Best
Luis


--
You received this message because you are subscribed to the Google Groups "NGLess" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ngless+un...@googlegroups.com.

Xianghui Liu

unread,
Dec 16, 2020, 2:22:53 AM12/16/20
to NGLess
Thank you.  Luis.

fasta file can be passed at parameter fafile.... but how about the annotation? 
Moreover, count function has parameter features but not for the database to search. Map seems to have a parameter to specify reference.

mapped = map(input, reference='om-rgc', mode_all=True)
count(mapped,
                features=['KEGG_ko', 'eggNOG_OG'],
                normalization={scaled})

Accordingly ngless-count.py , ngless-map.py use commands to do that.

ngless-count.py [-h] -i INPUT -o OUTPUT [-f FEATURES] [-m {dist1,all1,1overN,unique_only}] [--auto-install] [--debug]

ngless-map.py [-h] -i INPUT [-i2 INPUT_REVERSE] [-s INPUT_SINGLES] -o OUTPUT [--auto-install] [--debug] (-r {sacCer3,susScr11,ce10,dm3,gg4,canFam2,rn4,bosTau4,mm10,hg19} | -f FASTA)

My question is how to specify the reference file. -r use the reference database name , so we still need to create the database first?

Moreover, if I only have the database fasta file and do not have an annotation, will the the map function won't check the annotation file and count function only try to look for annotation file if the features are specified.

Please kindly suggest,

Regards 

Xianghui


Xianghui Liu

unread,
Dec 16, 2020, 6:46:15 AM12/16/20
to NGLess
Dear Luis:
I managed to create one using a test fasta file and eggnog annotation file.

However, it complained the columns of annotation file does not match.
Here is the error message.
(base) [xianghui@ln-0001 loreal]$ ngless loreal-demo.ngl
NGLess v1.1.0 (C) NGLess authors

When publishing results from this script, please cite the following references:

- Coelho, L.P., Alves, R., Monteiro, P., Huerta-Cepas, J., Freitas, A.T., and Bork, P.,
NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language. in
Microbiome 7:84 (2019). DOI: http://doi.org/10.1186/s40168-019-0684-8

-

- Li, H., 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv
preprint arXiv:1303.3997.

- MOCAT2: a metagenomic assembly, annotation and profiling framework. Kultima JR, Coelho LP, Forslund
K, Huerta-Cepas J, Li S, Driessen M, et al. (2016) Bioinformatics (2016)
doi:10.1093/bioinformatics/btw183

- MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit. Kultima JR, Sunagawa S, Li J, Chen W, Chen
H, Mende DR, et al. (2012) PLoS ONE 7(10): e47656. doi:10.1371/journal.pone.0047656


[Wed 16-12-2020 18:30]: Script OK. Starting interpretation...
[Wed 16-12-2020 18:30] Line 7: lock1: Obtained lock file: 'ngless-locks/f5eb6f97/SAMEA2621033.sampled.lock'
[Wed 16-12-2020 18:30] Line 7: Writing stats to 'ngless-stats/f5eb6f97/SAMEA2621033.sampled'
[Wed 16-12-2020 18:30] Line 8: load_mocat_sample found single-end sample 'SAMEA2621033.sampled/ERR594391_1.fastq.gz.short.fq.gz'
[Wed 16-12-2020 18:30] Line 8: load_mocat_sample found single-end sample 'SAMEA2621033.sampled/ERR594391_2.fastq.gz.short.fq.gz'
[Wed 16-12-2020 18:30] Line 14: Start BWA index creation for ./Modules/loreal.ngm/1.0/data/loreal.fa
[Wed 16-12-2020 18:31] Line 14: Success
[Wed 16-12-2020 18:31] Line 14: Starting mapping to ./Modules/loreal.ngm/1.0/data/loreal.fa

[Wed 16-12-2020 18:31] Line 14: Success
[Wed 16-12-2020 18:31] Line 14: Mapped readset stats (./Modules/loreal.ngm/1.0/data/loreal.fa):
[Wed 16-12-2020 18:31] Line 14: Total reads: 368045
[Wed 16-12-2020 18:31] Line 14: Total reads aligned: 230 [0.06%]
[Wed 16-12-2020 18:31] Line 14: Total reads Unique map: 80 [0.02%]
[Wed 16-12-2020 18:31] Line 14: Total reads Non-Unique map: 150 [0.04%]
[Wed 16-12-2020 18:31] Line 16: Annotate with reference: "loreal"
[Wed 16-12-2020 18:31] Line 16: Loading map file ./Modules/loreal.ngm/1.0/data/loreal.functional.map
Exiting after fatal error while loading and running script
Data Error (the input data did not conform to NGLess' expectations)
Loading functional map file './Modules/loreal.ngm/1.0/data/loreal.functional.map' [line 60697]: wrong number of columns



 The following is the format of my annotation file. I did the annotation using # emapper version: emapper-2.0.1-14-gbf04860 emapper DB: 2.0  with # command: ./emapper.py  --translate --override -i predsResultProteins.uniq.fa.cdhit --output predsResultProteins.uniq.fa.cdhit_maNOG -m diamond and later I change the predsResultProteins.uniq.fa.cdhit  to loreal.fa and the predsResultProteins.uniq.fa.cdhit_maNOG to loreal.functional.map.
Please kindly suggest    I found out the one you provided for igc IGC_catalog-v1.0.0.emapper.annotations-v2.tsv has the following header which is different from mine.  I don't understand that why they are different.  
#query_name     seed_eggNOG_ortholog    seed_ortholog_evalue    seed_ortholog_score     predicted_gene_name     GO_terms        KEGG_KOs        BiGG_reactions  Annotation_tax_scope    OGs     bestOG|evalue|score     COG cat eggNOG annot



#query_name     seed_eggNOG_ortholog    seed_ortholog_evalue    seed_ortholog_score     best_tax_level  Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   taxonomic scope eggNOG OGs      best eggNOG OG  COG Functional cat.     eggNOG free text desc.
A0A2R8PGX8|f_P023_S10_k141_135708|-|138|1.569e-31|2|300|592|592[592]:548[548]:45[45]|467[473]:300[300]:168[162] 9593.ENSGGOP00000025807 1.3e-28 131.7   Hominidae       IGHV1-3                                                                                         Mammalia        2CD1S@1,2SQQV@2759,35SE1@314146,3AKYV@33154,3BST0@33208,3D990@33213,3JKSN@40674,48FYJ@7711,49D47@7742,4MQFD@9443,4N8U7@9604     NA|NA|NA        S       Immunoglobulin V-Type
A4QMX0|f_P043_S36_k141_599865|-|86|7.064e-16|1|285|530|530[530]:285[285]:246[246]       9598.ENSPTRP00000060741 4.5e-12 77.0    Hominidae                                                                                                       Mammalia        2CZE2@1,2S9XJ@2759,35RYM@314146,3AJ63@33154,3BX24@33208,3DD5X@33213,3JF0N@40674,48KPS@7711,49HE8@7742,4MNYG@9443,4N91J@9604     NA|NA|NA
H2PRJ6|f_P043_S36_k141_599865|-|99|8.623e-20|2|1738|1995|1995[1995]:1846[1846]:150[150]|1827[1827]:1738[1738]:90[90]    9544.ENSMMUP00000020050 2e-20   104.8   Metazoa                                                                                                 Metazoa 2EWTI@1,2SYJN@2759,39988@33154,3C3WA@33208      NA|NA|NA
A0A1A9ASW2|f_P043_S36_k141_351291|+|477|1.401e-133|4|438|1942|438[438]:671[671]:234[234]|850[850]:1092[1092]:243[243]|1316[1316]:1735[1735]:420[420]|1790[1802]:1942[1942]:153[141]     9606.ENSP00000451270    2.8e-79 302.4   Hominidae               GO:0000976,GO:0000977,GO:0000978,GO:0000981,GO:0000982,GO:0000987,GO:0001012,GO:0001067,GO:0001077,GO:0001228,GO:0003674,GO:0003676,GO:0003677,GO:0003690,GO:0003700,GO:0005488,GO:0005575,GO:0005622,GO:0005623,GO:0005634,GO:0005635,GO:0005654,GO:0005730,GO:0005737,GO:0005794,GO:0005829,GO:0006139,GO:0006351,GO:0006355,GO:0006357,GO:0006366,GO:0006725,GO:0006807,GO:0006915,GO:0008150,GO:0008152,GO:0008219,GO:0008285,GO:0009058,GO:0009059,GO:0009889,GO:0009891,GO:0009893,GO:0009987,GO:0010467,GO:0010468,GO:0010556,GO:0010557,GO:0010564,GO:0010604,GO:0010628,GO:0010948,GO:0012501,GO:0012505,GO:0016020,GO:0016070,GO:0018130,GO:0019219,GO:0019222,GO:0019438,GO:0031090,GO:0031323,GO:0031325,GO:0031326,GO:0031328,GO:0031965,GO:0031967,GO:0031974,GO:0031975,GO:0031981,GO:0032774,GO:0034641,GO:0034645,GO:0034654,GO:0042127,GO:0043170,GO:0043226,GO:0043227,GO:0043228,GO:0043229,GO:0043231,GO:0043232,GO:0043233,GO:0043565,GO:0044212,GO:0044237,GO:0044238,GO:0044249,GO:0044260,GO:0044271,GO:0044422,GO:0044424,GO:0044428,GO:0044444,GO:0044446,GO:0044464,GO:0045786,GO:0045893,GO:0045935,GO:0045944,GO:0046483,GO:0048518,GO:0048519,GO:0048522,GO:0048523,GO:0050789,GO:0050794,GO:0051171,GO:0051173,GO:0051252,GO:0051254,GO:0051726,GO:0060255,GO:0065007,GO:0070013,GO:0070316,GO:0070317,GO:0071704,GO:0080090,GO:0090304,GO:0097159,GO:0097659,GO:0140110,GO:1901360,GO:1901362,GO:1901363,GO:1901576,GO:1902680,GO:1903506,GO:1903508,GO:1990837,GO:2000112,GO:2001141                                                                                  Mammalia        35QRQ@314146,3AKJQ@33154,3C0G1@33208,3D7UI@33213,3JH5Z@40674,48F2Z@7711,49BWF@7742,4MJTC@9443,4N505@9604,KOG0849@1,KOG0849@2759 NA|NA|NA        K       PRD class homeoboxes and pseudogenes
A0A0E1YH99|f_P064_S7_k141_30663|+|632|3.067e-180|1|297|1337|297[297]:1337[1337]:1041[1041]      765123.HMPREF9621_00755 1.6e-156        558.9   Propionibacteriales     fhuG7                   ko:K02015       ko02010,map02010        M00240                  ko00000,ko00001,ko00002,ko02000 3.A.1.14                        Bacteria        2IHH9@201174,4DWTB@85009,COG0609@1,COG0609@2    NA|NA|NA        P       FecCD transport family
U7JP37|f_P006_S1_k141_14012|+|697|8.313e-200|1|0|1130|0[0]:1130[1130]:1131[1131]        1292373.H640_01012      2.4e-153        548.5   Propionibacteriales     hemA    GO:0005575,GO:0005623,GO:0009288,GO:0042597,GO:0042995,GO:0043226,GO:0043228,GO:0044464,GO:0055040      1.2.1.70        ko:K02407,ko:K02492     ko00860,ko01100,ko01110,ko01120,ko02040,map00860,map01100,map01110,map01120,map02040    M00121  R04109  RC00055,RC00149 ko00000,ko00001,ko00002,ko01000,ko02035                 iSB619.SA_RS08420       Bacteria        2GJRA@201174,4DPDT@85009,COG0373@1,COG0373@2    NA|NA|NA        H       Catalyzes the NADPH-dependent reduction of glutamyl- tRNA(Glu) to glutamate 1-semialdehyde (GSA)
F7CHG7|f_P006_S1_k141_107219|-|128|1.606e-28|2|2967|3242|3242[3242]:3048[3048]:195[195]|3032[3032]:2967[2967]:66[66]    9544.ENSMMUP00000006312 4.9e-25 120.2   Metazoa                                                                                                 Metazoa 2EWTI@1,2SYJN@2759,39988@33154,3C3WA@33208      NA|NA|NA
A0A2I3LWX5|f_P006_S1_k141_107219|+|106|6.737e-22|2|2950|3171|2950[2950]:3021[3021]:72[72]|3043[3043]:3171[3171]:129[129]        61853.ENSNLEP00000009137        1.4e-19 101.7   Primates        SIRT7   GO:0000122,GO:0000228,GO:0000278,GO:0000280,GO:0000785,GO:0000790,GO:0003674,GO:0003682,GO:0003824,GO:0004407,GO:0005488,GO:0005575,GO:0005622,GO:0005623,GO:0005634,GO:0005694,GO:0005730,GO:0005731,GO:0005737,GO:0006139,GO:0006325,GO:0006351,GO:0006355,GO:0006357,GO:0006464,GO:0006476,GO:0006725,GO:0006807,GO:0006996,GO:0007049,GO:0007072,GO:0008150,GO:0008152,GO:0009058,GO:0009059,GO:0009303,GO:0009889,GO:0009890,GO:0009891,GO:0009892,GO:0009893,GO:0009987,GO:0010458,GO:0010467,GO:0010468,GO:0010556,GO:0010557,GO:0010558,GO:0010604,GO:0010605,GO:0010628,GO:0010629,GO:0016043,GO:0016070,GO:0016072,GO:0016569,GO:0016570,GO:0016575,GO:0016787,GO:0016810,GO:0016811,GO:0017136,GO:0018130,GO:0019213,GO:0019219,GO:0019222,GO:0019438,GO:0019538,GO:0022402,GO:0030874,GO:0031323,GO:0031324,GO:0031325,GO:0031326,GO:0031327,GO:0031328,GO:0031974,GO:0031981,GO:0032774,GO:0032991,GO:0033558,GO:0034641,GO:0034645,GO:0034654,GO:0034660,GO:0034739,GO:0034979,GO:0035601,GO:0036211,GO:0043170,GO:0043226,GO:0043227,GO:0043228,GO:0043229,GO:0043231,GO:0043232,GO:0043233,GO:0043412,GO:0044237,GO:0044238,GO:0044249,GO:0044260,GO:0044267,GO:0044271,GO:0044422,GO:0044424,GO:0044427,GO:0044428,GO:0044446,GO:0044452,GO:0044454,GO:0044464,GO:0044770,GO:0044772,GO:0045892,GO:0045893,GO:0045934,GO:0045935,GO:0046483,GO:0048285,GO:0048518,GO:0048519,GO:0048522,GO:0048523,GO:0050789,GO:0050794,GO:0051171,GO:0051172,GO:0051173,GO:0051252,GO:0051253,GO:0051254,GO:0051276,GO:0060255,GO:0065007,GO:0070013,GO:0070932,GO:0070933,GO:0071704,GO:0071840,GO:0080090,GO:0090304,GO:0097372,GO:0097659,GO:0098732,GO:0098781,GO:0140014,GO:0140096,GO:1901360,GO:1901362,GO:1901564,GO:1901576,GO:1902679,GO:1902680,GO:1903047,GO:1903506,GO:1903507,GO:1903508,GO:2000112,GO:2000113,GO:2001141              ko:K11417                                       ko00000,ko01000,ko03036                         Mammalia        35KV9@314146,39C9C@33154,3BEI0@33208,3CRZW@33213,3J3E0@40674,487S1@7711,490HQ@7742,4MGQE@9443,COG0846@1,KOG1905@2759    NA|NA|NA        K       Sirtuin 7
G7N4V4|f_P006_S1_k141_107219|-|92|1.104e-17|2|2959|3242|3242[3242]:3066[3066]:177[177]|3009[3009]:2959[2959]:51[51]     9544.ENSMMUP00000038925 4.9e-21 106.7   Vertebrata                                                                                                      Metazoa 2F9WG@1,2TB2G@2759,398VA@33154,3CDE1@33208,3DUR8@33213,48M5R@7711,49I18@7742    NA|NA|NA        S       Pfam:GVQW

Xianghui Liu

unread,
Dec 16, 2020, 8:48:06 PM12/16/20
to NGLess
Sorry. it is because the end of annotation file shall be deleted.
# 60695 queries scanned
# Total time (seconds): 269019.287695
# Rate: 0.23 q/s

Xianghui Liu

unread,
Dec 16, 2020, 9:20:45 PM12/16/20
to NGLess
I did not use parallel. The collect() call at line 16 could not be executed as there are partial results missing.
What does that mean?
That gene does not have annotation?   line 16 of which file?
Xianghui

(base) [xianghui@ln-0001 loreal]$ ngless loreal-demo.ngl
NGLess v1.1.0 (C) NGLess authors

When publishing results from this script, please cite the following references:

- Coelho, L.P., Alves, R., Monteiro, P., Huerta-Cepas, J., Freitas, A.T., and Bork, P.,
NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language. in
Microbiome 7:84 (2019). DOI: http://doi.org/10.1186/s40168-019-0684-8

-

- Li, H., 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv
preprint arXiv:1303.3997.

- MOCAT2: a metagenomic assembly, annotation and profiling framework. Kultima JR, Coelho LP, Forslund
K, Huerta-Cepas J, Li S, Driessen M, et al. (2016) Bioinformatics (2016)
doi:10.1093/bioinformatics/btw183

- MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit. Kultima JR, Sunagawa S, Li J, Chen W, Chen
H, Mende DR, et al. (2012) PLoS ONE 7(10): e47656. doi:10.1371/journal.pone.0047656


[Thu 17-12-2020 10:00]: Script OK. Starting interpretation...
[Thu 17-12-2020 10:00] Line 7: lock1: Obtained lock file: 'ngless-locks/f5eb6f97/SAMEA2621033.sampled.lock'
[Thu 17-12-2020 10:00] Line 7: Writing stats to 'ngless-stats/f5eb6f97/SAMEA2621033.sampled'
[Thu 17-12-2020 10:00] Line 8: load_mocat_sample found single-end sample 'SAMEA2621033.sampled/ERR594391_1.fastq.gz.short.fq.gz'
[Thu 17-12-2020 10:00] Line 8: load_mocat_sample found single-end sample 'SAMEA2621033.sampled/ERR594391_2.fastq.gz.short.fq.gz'
[Thu 17-12-2020 10:00] Line 14: Starting mapping to ./Modules/loreal.ngm/1.0/data/loreal.fa
[Thu 17-12-2020 10:01] Line 14: Success
[Thu 17-12-2020 10:01] Line 14: Mapped readset stats (./Modules/loreal.ngm/1.0/data/loreal.fa):
[Thu 17-12-2020 10:01] Line 14: Total reads: 368045
[Thu 17-12-2020 10:01] Line 14: Total reads aligned: 230 [0.06%]
[Thu 17-12-2020 10:01] Line 14: Total reads Unique map: 80 [0.02%]
[Thu 17-12-2020 10:01] Line 14: Total reads Non-Unique map: 150 [0.04%]
[Thu 17-12-2020 10:01] Line 16: Annotate with reference: "loreal"
[Thu 17-12-2020 10:01] Line 16: Loading map file ./Modules/loreal.ngm/1.0/data/loreal.functional.map
[Thu 17-12-2020 10:01] Line 16: Mapped readset stats (select.lno15):
[Thu 17-12-2020 10:01] Line 16: Total reads: 79
[Thu 17-12-2020 10:01] Line 16: Total reads aligned: 79 [100.00%]
[Thu 17-12-2020 10:01] Line 16: Total reads Unique map: 79 [100.00%]
[Thu 17-12-2020 10:01] Line 16: Total reads Non-Unique map: 0 [0.00%]
[Thu 17-12-2020 10:01]: Interpretation finished.
[Thu 17-12-2020 10:01]: The collect() call at line 16 could not be executed as there are partial results missing.
When you use the parallel module and the collect() function,
you typically need to run ngless *multiple times* (once per sample)!


Luis Pedro Coelho

unread,
Dec 17, 2020, 4:20:13 AM12/17/20
to NGLess List

I did not use parallel. The collect() call at line 16 could not be executed as there are partial results missing.
What does that mean?
That gene does not have annotation?   line 16 of which file?

This refers to line 16 of the `loreal-demo.ngl` file, which should invoke the `collect()` function.

Note that this means that everything is working fine, you just need to run `ngless loreal-demo.ngl` multiple times.

HTH,
Luis

Xianghui Liu

unread,
Dec 17, 2020, 6:48:23 AM12/17/20
to NGLess
Thank you. I ran multiple times. Seems problem got solved. 
I have one more question, if I do not have an annotation file, how shall I count the genes that have reads mapped? 

Luis Pedro Coelho

unread,
Dec 17, 2020, 6:54:00 AM12/17/20
to NGLess List
Use `features=['seqname']`.

We may introduce a better API in the near future (ie, a better parameter name), but this is what works for now (and we will keep it working for a few years at least, even if something better is introduced)

HTH,
Luis

Luis Pedro Coelho | Fudan University | http://luispedro.org

Xianghui Liu

unread,
Dec 18, 2020, 12:18:57 AM12/18/20
to NGLess
Thank you. That's great. I asked this question because I want to check against multiple databases and some of the database does not have annotation file.  I hope to filter out those genes with reads mapped and do the annotation using eggnog-mapper then use these annotation to do the count again...
Is there any good way for me to do this? 
or Can I do the summary of count of KEGG kos by myself? Did you use any normalisation or scaling at this step?
 
Xianghui

Luis Pedro Coelho

unread,
Dec 21, 2020, 4:32:50 AM12/21/20
to NGLess List
You can feed the output of eggnog-mapper as an annotation file to NGLess (maybe you need to filter out comments, ie, lines beginning with #, but other than that, it should be fine).

HTH
Luis
--
Luis Pedro Coelho | Fudan University | http://luispedro.org


Xianghui Liu

unread,
Dec 28, 2020, 10:09:20 PM12/28/20
to NGLess
Thank you Luis.
I still need to confirm with you several things before starting a large job.
1. If I run the job without annotation file and summarise the result based on sequence, later I filter only those sequences with reads mapped, and do annotation with eggnog with those genes,  ( this can save time to annotate the huge database with eggnog),   can I combine the  'summarise the result based on sequence'  with 'annotation with eggnog with those genes' to generate the to new summary table based on KOs.
2. I am running the job in a HPC and using PBS qsub system, I have multiple samples, I managed to create a sample file like tara.demo.sampled, however, I do not know how to work with your 'parallel'. Could you kindly give an example sh file with ngl file.
3. It seems that your ngless each time process one sample in the igc.demo.short list, right?  Since there are 3 samples in the list, I shall run ngless 3 times. So this is my qsub sh file.  If  the result file, e.g. igc.profiles.txt, is already existed, will it try to override? or the program will try to check ?

#!/bin/bash
### Specify the job Name
#PBS -N sra
### Specify the project code
#PBS -P all
### Specify the queue
#PBS -q std
### The directive below merges standard output & error
#PBS -j oe
### Specify the requested cpu resource
#PBS -l select=1:ncpus=48
### Specify the standard output file / folder
#PBS -o sra.logs
### Change to current working directory
cd $PBS_O_WORKDIR
pwd
which ngless
ngless gut-demo.ngl
ngless gut-demo.ngl
ngless gut-demo.ngl

Luis Pedro Coelho

unread,
Dec 29, 2020, 1:40:46 AM12/29/20
to NGLess List

1. If I run the job without annotation file and summarise the result based on sequence, later I filter only those sequences with reads mapped, and do annotation with eggnog with those genes,  ( this can save time to annotate the huge database with eggnog),   can I combine the  'summarise the result based on sequence'  with 'annotation with eggnog with those genes' to generate the to new summary table based on KOs.

I am not 100% sure of what you mean. If you save the mapped read files (BAMs), then it will be very fast to later use the count() in different ways. 

2. I am running the job in a HPC and using PBS qsub system, I have multiple samples, I managed to create a sample file like tara.demo.sampled, however, I do not know how to work with your 'parallel'. Could you kindly give an example sh file with ngl file.
3. It seems that your ngless each time process one sample in the igc.demo.short list, right?  Since there are 3 samples in the list, I shall run ngless 3 times. So this is my qsub sh file.  If  the result file, e.g. igc.profiles.txt, is already existed, will it try to override? or the program will try to check ?

If the output file already exists, it will be over-written (a warning message is printed, but this is less relevant in batch mode).

#!/bin/bash
### Specify the job Name
#PBS -N sra
### Specify the project code
#PBS -P all
### Specify the queue
#PBS -q std
### The directive below merges standard output & error
#PBS -j oe
### Specify the requested cpu resource
#PBS -l select=1:ncpus=48
### Specify the standard output file / folder
#PBS -o sra.logs
### Change to current working directory
cd $PBS_O_WORKDIR
pwd
which ngless
ngless gut-demo.ngl
ngless gut-demo.ngl
ngless gut-demo.ngl

This will work, but is sub-optimal. The best is to have each job run ngless a single time and then submit multiple jobs (or a job array, but I am not sure how to do that with PBS). You can also use the "-j auto" option to automatically use the right number of CPUs.

HTH,
Luis

Reply all
Reply to author
Forward
0 new messages