Populations interface to Eigensoft smartPCA

2,294 views
Skip to first unread message

GM

unread,
Sep 22, 2015, 3:18:48 PM9/22/15
to Stacks
Hi,

I'd like to use the polymorphism data from my populations to perform principal component analysis as per the Eigensoft smartPCA module. The software requires files of a certain format, and so, I was trying to find out the right kind of output files of the Stacks populations module to start with. smartPCA requires the following files as input (http://genepath.med.harvard.edu/~reich/InputFileFormats.htm):

1) genotype file (.geno) in EIGENSTRAT format

The genotype file contains 1 line per SNP. Each line contains 1 character per individual:

  0 means zero copies of reference allele.

  1 means one copy of reference allele.

  2 means two copies of reference allele.

  9 means missing data.


I'm thinking of using the STRUCTURE format file produced by Populations to make this file. Is there an easy way to do this? Also, just wanted to confirm that 0/1/2/3/4 stand for noChange/a/c/g/t in the STRUCTURE file. I also found that despite using the --write_single_snps option, I still get multiple 

2) SNP file in EIGENSTRAT format

SNP-ID -- ChromosomeNumber -- GeneticPosition -- Physical position -- (OPTIONAL) Reference base -- (OPTIONAL) Polymorphic base

I can think of retrieving most of this information (except genetic position, which will be set at 0.0) from the sumstats.tsv file. Only those SNP-IDs that are in the structure file would be pulled out. 

3) IND file

SampleID -- Gender -- Label
 
which would be easy to generate since Gender is UNKNOWN and Label is Ignore in my situation 

Also, is there an easier way to generate these input files for PCA from Stacks output?

Thanks much!

GM

unread,
Sep 22, 2015, 3:23:29 PM9/22/15
to Stacks

I also found that despite using the --write_single_snps option, I still get multiple snps from the same locus in the structure output

eg: 24_10   24_20   24_21   24_40   24_61   24_65 180_24  180_29  180_49  180_77  180_87

my command line is this:-e hindIII --write_single_snp -t 30 --merge_sites -r 0.7 -p 7 -m 5 -a 5 -f bonferroni_gen --p_value_cutoff 0.1 --lnl_lim -10 --fasta_strict --structure --phylip --verbose

Not sure if I'm doing something wrong.

Thanks!

Ryan Waples

unread,
Sep 22, 2015, 4:12:43 PM9/22/15
to stacks...@googlegroups.com
GM - 

EIGENSTRAT also takes .map / .ped files as input which I think you can get with the '--plink' flag to the 'populations' program.

I've found that EIGENSTRAT  does not like the chr column in the map file to be zero - so change to a positive int.

Also make sure you look at the numchrom flag in the parameter file for EIGENSTRAT - it may expect human chromosomes

Hope this helps,
-Ryan 


--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
Visit this group at http://groups.google.com/group/stacks-users.
For more options, visit https://groups.google.com/d/optout.

Julian Catchen

unread,
Sep 22, 2015, 4:30:33 PM9/22/15
to stacks...@googlegroups.com, gaura...@gmail.com
What version of Stacks are you running? Make sure you are using the latest.

I just double-checked the --write_single_snp flag and it is working
correctly. When you enable that flag, the software prunes the other
sites out of memory, so it should not be possible to export the other
sites in any file output as they are no longer in the data structures to
be output.

julian

GM wrote:
>
> I also found that despite using the --write_single_snps option, I still
> get multiple snps from the same locus in the structure output
>
> eg: 24_10 24_20 24_21 24_40 24_61 24_65 180_24 180_29 180_49 180_77 180_87
>
> my command line is this:-e hindIII --write_single_snp -t 30
> --merge_sites -r 0.7 -p 7 -m 5 -a 5 -f bonferroni_gen --p_value_cutoff
> 0.1 --lnl_lim -10 --fasta_strict --structure --phylip --verbose
>
> Not sure if I'm doing something wrong.
>
> Thanks!
>
>
> On Tuesday, September 22, 2015 at 3:18:48 PM UTC-4, GM wrote:
>
> Hi,
>
> I'd like to use the polymorphism data from my populations to perform
> principal component analysis as per the Eigensoft smartPCA module.
> The software requires files of a certain format, and so, I was
> trying to find out the right kind of output files of the Stacks
> populations module to start with. smartPCA requires the following
> files as input
> (http://genepath.med.harvard.edu/~reich/InputFileFormats.htm
>
> 1) genotype file (.geno) in EIGENSTRAT format
>
> The genotype file contains 1 line per SNP. Each line contains 1
> character per individual:
>
> 0 means zero copies of reference allele.
>
> 1 means one copy of reference allele.
>
> 2 means two copies of reference allele.
>
> 9 means missing data.
>
>
> I'm thinking of using the STRUCTURE format file produced by
> Populations to make this file. Is there an easy way to do this?
> Also,*/just wanted to confirm that 0/1/2/3/4 stand for
> noChange/a/c/g/t/* in the STRUCTURE file. I also found that despite

GM

unread,
Sep 22, 2015, 5:15:37 PM9/22/15
to Stacks, gaura...@gmail.com, jcat...@illinois.edu
Hi Julian,

I was using Stacks 1.29. I'll check it again with 1.35.

Thanks
GM

GM

unread,
Sep 23, 2015, 5:27:33 PM9/23/15
to Stacks
Hi Ryan,

I'm using a non-model species with 11695 contigs in the map file. The convertf program will give me either of the two errors:
fatalx:
bad chrom: chr10000
Aborted (core dumped)

OR

warning (mapfile): bad chrom: 10000     142472_70       1       12264
Segmentation fault (core dumped)

I have tried changing the original names (which were named like this: SH_contig_10000) to chr10000 or simply 10000, but that doesn't help. I have also tried, as you suggested, changing the 0 in column 3 to a 1 with no success. I'm using EIGENSOFT 6.0.1 on map/ped files created by populations from Stacks v1.35

Parameter file:
parameter file: par.PED.EIGENSTRAT
genotypename: batch_1.plink.ped
snpname: batch_1.plink_v2.map.mod.map
indivname: batch_1.plink.ped
outputformat: EIGENSTRAT
genotypeoutname: batch_1.geno
snpoutname: batch_1.snp
indivoutname: batch1.ind
familynames: NO
numchrom: 11695

The map file is like this:

10000   142472_70       1       12264
10000   2_38    1       29277
10004   31_28   1       4332
10004   33_14   1       8451
10004   142505_50       1       10324
10004   21_35   1       17297
10004   26_48   1       26935
10005   38_68   1       29013
10005   41_27   1       57494
10006   45_22   1       18081
10009   49_37   1       9146

Ryan Waples

unread,
Sep 23, 2015, 5:48:10 PM9/23/15
to stacks...@googlegroups.com
I have got Smartpca within EIGENSOFT (6.0.1) to work without converting with convertf - it will take map/ped directly.  I have madified the output map/ped that stacks outputs.

EIGENSOFT and PLINK don't with thousands of chromosomes/contigs well - so I would suggest removing that info from the map file - replace the first column with all '1' for example.  I do have some chromosome info so I have chromosomes 1-37 for assigned loci and I used for '40' for unassigned loci.  I dont think smartpca likes a zero in the frist column of the map file.

example map file: 

ped file - I have the phenotype (col 6) set to missing (-9) and smartpca complains about it - but it works.  

example ped file:

example parfile:

-Ryan

GM

unread,
Oct 4, 2015, 4:58:29 PM10/4/15
to Stacks
Hi Ryan,

Thanks for that reply. smartPCA works with the map/ped files, and as you said, I had to modify the map file.

Thanks
GM
Reply all
Reply to author
Forward
0 new messages