counting in ngless dist1/all1

21 views
Skip to first unread message

ullo...@googlemail.com

unread,
Apr 18, 2021, 6:40:32 AM4/18/21
to NGLess

Dear user-group,
I was wondering whether the statement in the docs:
"Generally, for obtaining gene abundances, distribution of multiple mappers is the best (using multiple={dist1}), while for functional annotations, you want to count them all (using multiple={all1}). This implies that the functional annotations will sum to a higher value than the number of reads. This may seem strange at first, but it is the intended behaviour."
implies, that mapping your samples to the same references should theoretically results in more hits using all1 than dist1? If so, I observed different behavior mapping samples to the iMGMC mouse gene catalog:
```
imgmc_counts_new = count(imgmc_mapped,
                    features=['seqname'],
                    normalization={raw},
                    multiple={dist1})
collect(imgmc_counts_new,
        current=current,
        allneeded=samples,
        ofile=RESULTS</>'imgmc_geneabundance.dist1.raw.txt')
imgmc_counts_new = count(imgmc_mapped,
                    features=['seqname'],
                    normalization={raw},
                    multiple={all1})
collect(imgmc_counts_new,
        current=current,
        allneeded=samples,
        ofile=RESULTS</>'imgmc_geneabundance.all1.raw.txt')
```
```
777014121 Apr 17 23:16 preproc/imgmc_geneabundance.all1.raw.txt
1425878298 Apr 17 23:04 preproc/imgmc_geneabundance.dist1.raw.txt
```
Any idea/comment?
Best,
Ulrike

Luis Pedro Coelho

unread,
Apr 18, 2021, 11:22:13 PM4/18/21
to Ulrike Löber, NGLess List
Yes, that is correct. You get more apparent hits with all1 than with dist1:

If your "sample" is a single read that maps to genes A & B (annotated to functions FA and FB), then the all1 is A=1/B=1 (or FA=1/FB=1), whilst the dist1 is A=.5/B=.5 (or FA=.5/FB=.5).

After much discussion (frankly probably too much as, in practice all these measures are so heavily correlated across samples that it's unlikely to matter too much), we considered that if you are discussing gene counts, then A=.5/B=.5 is more meaningful, but for functional analyses, you probably want to use FA=1/FB=1

Best,
Luis

Luis Pedro Coelho | Fudan University | http://luispedro.org
--
You received this message because you are subscribed to the Google Groups "NGLess" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ngless+un...@googlegroups.com.

Luis Pedro Coelho

unread,
Apr 19, 2021, 1:17:25 AM4/19/21
to Ulrike Löber, NGLess List
No, that seems strange, indeed.


> 777014121 Apr 17 23:16 preproc/imgmc_geneabundance.all1.raw.txt
> 1425878298 Apr 17 23:04 preproc/imgmc_geneabundance.dist1.raw.txt

Sorry, I'm not 100% sure what those numbers are? It feels too large to be the # of lines/total sum of the columns.

Best
Luis

Ulrike Löber

unread,
Apr 19, 2021, 1:29:15 AM4/19/21
to Luis Pedro Coelho, NGLess List
No, it's the file size. But I would have expected to get the same number of lines, unbedingt the same samples on the same reference, only changing dist1 to all1.
Best,
Ulrike


Am Mo., Apr. 19, 2021 at 7:17 schrieb Luis Pedro Coelho

Luis Pedro Coelho

unread,
Apr 19, 2021, 1:33:07 AM4/19/21
to Ulrike Löber, NGLess List
Oh, if it's the file size, then it's a bit harder to evaluate. I expect that all1 will have the same (or more) lines, but dist1 may result in larger file sizes because it takes more character to write fractional numbers.

Best
Luis

Luis Pedro Coelho | Fudan University | http://luispedro.org


Ulrike Löber

unread,
Apr 22, 2021, 12:00:31 AM4/22/21
to Luis Pedro Coelho, NGLess List
But in my case, it's the other way around. The dist1 file is much bigger than all1. Any explanation for that?
Best,
Am Mo., Apr. 19, 2021 at 5:22 schrieb Luis Pedro Coelho
Reply all
Reply to author
Forward
0 new messages