OSFS file format (SweeD)

38 views
Skip to first unread message

Halie Rando

unread,
Oct 2, 2018, 5:45:15 PM10/2/18
to OmegaPlus
Hello!
I was wondering if you could explain the format of the osfs output file. I generated it using:
./sweed/SweeD-P -name genomewide -input ./snps.vcf -threads 30 -osfs outputSFS.txt

The file outputSFS.txt looks like this (for each chromosome):
//2
0       7.198549e-02
1       6.833460e-02
2       5.946253e-02
3       8.946419e-02
4       1.325089e-01
5       2.123626e-01
6       7.750325e-02
7       4.845672e-02
8       5.116400e-02
9       1.317928e-01
10      5.696492e-02

I want to test whether the neutral SFS are the same across all autosomes, but I can't find what these data actually represent. Do 0-10 represent positions? What do the numbers after the tab represent?

Thank you very much!
All the best,
Halie Rando

Pavlos Pavlidis

unread,
Oct 8, 2018, 4:34:57 AM10/8/18
to omeg...@googlegroups.com, halie...@gmail.com
Hi Halie,
the first column represents the class of the SFS. For example 1: singletons, 2: dupletons etc
0 are the sites that received no mutation.

I hope this helps
all the best
pavlos


--
You received this message because you are subscribed to the Google Groups "OmegaPlus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to omegaplus+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Halie Rando

unread,
Oct 15, 2018, 5:02:07 PM10/15/18
to pavl...@gmail.com, omeg...@googlegroups.com
Dear Dr. Pavlidis,
Thank you very much for your response. This information is very helpful. I do have a follow up question: what modifications should the user make for pooled data? 

In my case, I'm working with low-coverage data pooled across ten individuals per population, and in some cases pooled across multiple populations. I formatted my input according to the SweedFinder format, with n = total number of reads at the site in the population and x = number of reads with the derived allele. As a result, when I compute the osfs, the integer values range from 0 to 750 (because of my filtering, 750 is the upper bound on the number of reads). But most sites will not have 750 reads per site -- on average, we expect about 75x.

Do you recommend this input format for pooled data? Do you think the variability in overall coverage per site will impact SweeD's computation? Do I need to be implementing some kind of scaling?

Thank you very much for all of your help!
All the best,
Halie
Reply all
Reply to author
Forward
0 new messages