OSFS file format (SweeD)

Halie Rando

unread,

Oct 2, 2018, 5:45:15 PM10/2/18

to OmegaPlus

Hello!

I was wondering if you could explain the format of the osfs output file. I generated it using:

./sweed/SweeD-P -name genomewide -input ./snps.vcf -threads 30 -osfs outputSFS.txt

The file outputSFS.txt looks like this (for each chromosome):

//2

0 7.198549e-02

1 6.833460e-02

2 5.946253e-02

3 8.946419e-02

4 1.325089e-01

5 2.123626e-01

6 7.750325e-02

7 4.845672e-02

8 5.116400e-02

9 1.317928e-01

10 5.696492e-02

I want to test whether the neutral SFS are the same across all autosomes, but I can't find what these data actually represent. Do 0-10 represent positions? What do the numbers after the tab represent?

Thank you very much!

All the best,

Halie Rando

Pavlos Pavlidis

unread,

Oct 8, 2018, 4:34:57 AM10/8/18

to omeg...@googlegroups.com, halie...@gmail.com

Hi Halie,

the first column represents the class of the SFS. For example 1: singletons, 2: dupletons etc

0 are the sites that received no mutation.

I hope this helps

all the best

pavlos

--
You received this message because you are subscribed to the Google Groups "OmegaPlus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to omegaplus+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Halie Rando

unread,

Oct 15, 2018, 5:02:07 PM10/15/18

to pavl...@gmail.com, omeg...@googlegroups.com

Dear Dr. Pavlidis,

Thank you very much for your response. This information is very helpful. I do have a follow up question: what modifications should the user make for pooled data?

In my case, I'm working with low-coverage data pooled across ten individuals per population, and in some cases pooled across multiple populations. I formatted my input according to the SweedFinder format, with n = total number of reads at the site in the population and x = number of reads with the derived allele. As a result, when I compute the osfs, the integer values range from 0 to 750 (because of my filtering, 750 is the upper bound on the number of reads). But most sites will not have 750 reads per site -- on average, we expect about 75x.

Do you recommend this input format for pooled data? Do you think the variability in overall coverage per site will impact SweeD's computation? Do I need to be implementing some kind of scaling?

Thank you very much for all of your help!

All the best,

Halie

Reply all

Reply to author

Forward