NGLess -> featureCounts?

Josh Sekela

unread,

Jul 15, 2022, 8:31:00 PM7/15/22

to NGLess

Hello,

I'm attempting to transition our lab's human gut shotgun metagenomics pipeline from MOCAT to NGLess. Currently, one thing we do is concatenate the .faa files for all samples resulting from the last of our 3 steps (-rtf, -a, -gp) and mine for enzymes of interest with blastp. We then use featureCounts to process a gff from Prodigal along with the sam resulting from post-MOCAT bowtie2 processing of paired reads, yielding a table of all genes with sample-specific barcodes. This featureCounts.csv is then subsetted by the headers in our resultant enzymes.faa - all to yield the counts of all genes which code for proteins of interest, for each sample. How should I approach this?

What I have tried/would like to do for each sample:

- use NGLess to write an orf.fna and out.sam

- Run the .fna through Prodigal to generate .gff and .faa

- Run the .gff and .sam through featureCounts

- mine the concatenated .faa files

- subset each sample's .csv to those headers contained in .faa

- concatenate featureCounts.csv

I appeared to successfully create an .fna and .sam with an .ngl script. Code is attached. NGLess ran fine but featureCounts threw an error with respect to the .sam's headers. Possibly a featureCounts formatting requirement? I will compare the format of the .sam files from our MOCAT pipeline to the ngless .sam to see if I can spot any differences. Any thoughts on what may be causing such a problem? The only mention of featureCounts and NGLess I've been able to find out there is this post by Luis:

https://metarabbit.wordpress.com/2018/05/04/ngless-timing-benchmarks/

I mainly want to run with featureCounts in order to compare results to the current pipeline. From there, if you know of a better way for me to get this result within NGLess or with a different tool, please let me know!

Thanks in advance for any help.

Josh

gut-demo_mod.ngl

Luis Pedro Coelho

unread,

Jul 18, 2022, 12:44:43 PM7/18/22

to Josh Sekela, NGLess List

Dear Josh,

I have no idea why this would be a problem, but you can replace this step

- Run the .gff and .sam through featureCounts

with an NGLess version

ngless "1.4"

input = samfile("my-mapped.sam") # BAM also works

counts = count(input,

gff_file="my-gff.gff",

features=["gene"]) # adapt as needed

write(counts, ofile="featuresCount.csv")

(Typed from memory, but see https://github.com/ngless-toolkit/ngless/blob/main/tests/count-gff/count-gtf.ngl)

Does this make sense?

HTH

Luis

Josh Sekela

unread,

Jul 19, 2022, 3:11:00 PM7/19/22

to NGLess

Luis, thanks so much for getting back to me! This is very helpful. I implemented your suggestion, although after generating my .gff through prodigal, an error results when running the script you provided:

[Mon 18-07-2022 23:59]: Script OK. Starting interpretation...
Exiting after fatal error while loading and running script
Data Error (the input data did not conform to NGLess' expectations)
Could not parse GFF line: "k141_708\tProdigal_v2.6.3\tCDS\t1451\t1576\t-0.7\t-\t0\tID=702_4;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.444;conf=55.61;score=0.98;cscore=-3.53;sscore=4.51;rscore=-0.37;uscore=1.76;tscore=1.96;"

Notably, the script proceeds to completion if I pass a truncated version of this .gff (only including the first 90 lines). The output does not exactly seem "correct", but the script does run. Any idea what may be causing this formatting issue?

I couldn't upload the full .gff to this post due to the filesize, so here is a publicly accessible link to the 2 files in my Google Drive.

https://drive.google.com/drive/folders/1niRzvSeHqZNjbn0ZYxsGjpwxNIjw1EYV?usp=sharing

Let me know if there is anything else I can do to help you help me. Thank you again!

Josh

test_short.gff

Luis Pedro Coelho

unread,

Jul 19, 2022, 7:27:40 PM7/19/22

to Josh Sekela, NGLess List

Hi again Josh,

I think we can make NGLess compute the same as featureCounts, but the defaults are not the same (the count function is very flexible http://ngless.embl.de/Functions.html#count)

Data Error (the input data did not conform to NGLess' expectations)
Could not parse GFF line: "k141_708\tProdigal_v2.6.3\tCDS\t1451\t1576\t-0.7\t-\t0\tID=702_4;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.444;conf=55.61;score=0.98;cscore=-3.53;sscore=4.51;rscore=-0.37;uscore=1.76;tscore=1.96;"

It seems that NGLess was (1) not very helpful with the error messages and (2) assumed that the score was never negative.

I fixed both these issues on git and will release a new version (1.4.2) with this fix by tomorrow

Relevant commit is:

https://github.com/ngless-toolkit/ngless/commit/38ad9c52664499c7c54307d603d183de4a4e5df5

Thanks for the report

Luis

Luis Pedro Coelho

unread,

Jul 19, 2022, 8:27:20 PM7/19/22

to NGLess List

If you use docker, you can run version 1.4.2 already:

docker run nglesstoolkit/ngless:1.4.2 ngless --version

Conda &c will be up soon.

HTH,

Luis

Luis Pedro Coelho | Fudan University | https://luispedro.org

https://orcid.org/0000-0002-9280-7885

--
You received this message because you are subscribed to the Google Groups "NGLess" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ngless+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ngless/3369c358-b875-4ff9-923f-853f8c1d8278%40www.fastmail.com.

Josh Sekela

unread,

Jul 21, 2022, 4:00:37 PM7/21/22

to NGLess

Much appreciated, I just updated our Conda install.

Reply all

Reply to author

Forward