Re: preparing input data for Structure using Plink

4,491 views
Skip to first unread message

Julie Hebert

unread,
Jan 15, 2013, 10:28:57 AM1/15/13
to structure...@googlegroups.com
What are your parameters when you run structure? Are you sure you have the correct settings for which labels you're including?

On Monday, January 14, 2013 12:56:01 AM UTC-5, Aydar wrote:
Hi. Sorry for a noob question, but I am stuck and looking for help.

I am trying to create a Structure input file. My data are in plink ped/map format. For testing purposes I've subseted a short list of SNP. So I've created Structure input file using Plink "--recode-structure" command.
Plink:
"$: plink --file data --recode-structure -out structure_input"

However, when I try to analyze data using structure I am getting an error message:
"WARNING! Probable error in the input file.
Individual 538, locus 1: encountered the following data "GRC110543" when expecting an integer"
I suspect that this is because Plink generates table with wrong formatting, but I am not sure.

Structure input table generated by Plink looks like this:

"rs3094315 rs12184325 rs3131969 rs12562034 rs2518996 rs12132517 rs11240777 rs11579015 rs12134754 rs11260595 rs6671356 rs1320571
-1 1539 77 14266 24086 6267 158 238000 88 2051 928 80405
GRC10041151 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
GRC10041187 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 2
GRC10041198 3 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2
GRC10041203 4 2 2 2 2 2 2 1 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2
GRC10041306 5 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
GRC10041153 6 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
GRC10041158 7 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
..."
Not sure if web interface will keep formatting. But first row is SNP tags.
Second row is numbers. SNP data starts from the 3rd row, sample tag is the first column.

Has anyone encountered similar problem?

Aydar

unread,
Jan 15, 2013, 7:33:54 PM1/15/13
to structure...@googlegroups.com
I partially solved the problem by using PGDspider2 version 2.0.3.0. It flawlessly converted small dataset from .ped/map to structure input. I was able to feed the Structure PGDspider2 output, and it worked flawlessly, so I think it is not the problem with parameters, It is problem with plink. However PGDSpider2 still chokes on my main dataset ("-not enought memory"). Any advice on that part? I tried to increase the memory by increasing Xmx parameter up to 10gb, but it didn't help.

The problems with Plink output are following:
1. Snp tags start in first column which causes an error because Structure parser expect intiger instead of sample tag in first column
2. Strange second raw with unintelligible numbers
3. Wrong recoding of nucleotides (1/2??) instead of 1234
4. Assigns 0s to missing values instead of -9s

So I am not sure if I use "--recode-structure" command incorrectly and it requires some sort of additional parameters, or it is some sort of bug. No documentation on this command is available though.

Aydar

unread,
Jan 15, 2013, 9:47:30 PM1/15/13
to structure...@googlegroups.com
btw does structure support binary encoding for SNPs or it has to be 1-4 encoding?

Vinod Kumar

unread,
Jan 16, 2013, 12:15:24 AM1/16/13
to structure...@googlegroups.com
Open the plink converted file in txt format and press BACK SPACE key, it might solve your problem. I am also confused with binary format, it might hide genetic information carried by SNPs, try 1,2,3,4 format.
Thanks,


--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To view this discussion on the web visit https://groups.google.com/d/msg/structure-software/-/Yii1uGiuTQoJ.

To post to this group, send email to structure...@googlegroups.com.
To unsubscribe from this group, send email to structure-softw...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/structure-software?hl=en.

Aydar

unread,
Jan 16, 2013, 4:36:10 AM1/16/13
to structure...@googlegroups.com
My dataset is too huge, for editing it in text editor. Any tips on how to reduce it?

Vinod Kumar

unread,
Jan 16, 2013, 7:26:00 AM1/16/13
to structure...@googlegroups.com
Hi Aydar,
I've not used such a huge dataset but we have run two dataset, the complete data set (6000 SNps) and a smaller dataset (1000 SNPs) (SNPs removed from equal distance throughout the genome), and the results were exactly same in both of the run. If it is a self pollinated plant then you can reduce to an extent but in case of cross pollinated plant species it is tough to reduce high number of SNPs as LD decays faster in such species. Some other methods are also available in the group, discussed especially on this topic, you can follow them.
Thanks,


  






On Wed, Jan 16, 2013 at 3:06 PM, Aydar <aidar...@gmail.com> wrote:
My dataset is too huge, for editing it in text editor. Any tips on how to reduce it?
--
You received this message because you are subscribed to the Google Groups "structure-software" group.

Vikram Chhatre

unread,
Jan 17, 2013, 12:42:23 AM1/17/13
to structure-software
A powerful text editor such as 'vim' can easily handle a data set that large.  Give it a shot.

Have you solved your plink conversion problem?

V

Aydar

unread,
Jan 18, 2013, 8:23:28 AM1/18/13
to structure...@googlegroups.com
Hi! Not yet. Now I am trying to convert data from Plink format to Structure using PGDSpider. It takes quite a long time though. Would you mind a few questions about the vim? How much resources it will require? 

On Thursday, January 17, 2013 8:42:23 AM UTC+3, Vikram Chhatre wrote:
A powerful text editor such as 'vim' can easily handle a data set that large.  Give it a shot.

Have you solved your plink conversion problem?

V

On Wed, Jan 16, 2013 at 6:26 AM, Vinod Kumar <kumar....@gmail.com> wrote:
Hi Aydar,
I've not used such a huge dataset but we have run two dataset, the complete data set (6000 SNps) and a smaller dataset (1000 SNPs) (SNPs removed from equal distance throughout the genome), and the results were exactly same in both of the run. If it is a self pollinated plant then you can reduce to an extent but in case of cross pollinated plant species it is tough to reduce high number of SNPs as LD decays faster in such species. Some other methods are also available in the group, discussed especially on this topic, you can follow them.
Thanks,


  






On Wed, Jan 16, 2013 at 3:06 PM, Aydar <aidar...@gmail.com> wrote:
My dataset is too huge, for editing it in text editor. Any tips on how to reduce it?

--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To view this discussion on the web visit https://groups.google.com/d/msg/structure-software/-/bG_tOWAJHUEJ.
To post to this group, send email to structure...@googlegroups.com.
To unsubscribe from this group, send email to structure-software+unsub...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/structure-software?hl=en.


--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To post to this group, send email to structure...@googlegroups.com.
To unsubscribe from this group, send email to structure-software+unsub...@googlegroups.com.

Vikram Chhatre

unread,
Jan 18, 2013, 10:28:30 AM1/18/13
to structure...@googlegroups.com
If plink conversion isnt working, use your raw snp data. Vim doesnt need much resources. 

---------------------------------------
Vikram Chhatre
Graduate Program in Genetics
Texas A&M University

This message was sent from a cellular device. It may contain typos and other errors.



On Jan 18, 2013, at 7:23 AM, Aydar <aidar...@gmail.com> wrote:

Hi! Not yet. Now I am trying to convert data from Plink format to Structure using PGDSpider. It takes quite a long time though. Would you mind a few questions about the vim? How much resources it will require? 

On Thursday, January 17, 2013 8:42:23 AM UTC+3, Vikram Chhatre wrote:
A powerful text editor such as 'vim' can easily handle a data set that large.  Give it a shot.

Have you solved your plink conversion problem?

V

On Wed, Jan 16, 2013 at 6:26 AM, Vinod Kumar <kumar....@gmail.com> wrote:
Hi Aydar,
I've not used such a huge dataset but we have run two dataset, the complete data set (6000 SNps) and a smaller dataset (1000 SNPs) (SNPs removed from equal distance throughout the genome), and the results were exactly same in both of the run. If it is a self pollinated plant then you can reduce to an extent but in case of cross pollinated plant species it is tough to reduce high number of SNPs as LD decays faster in such species. Some other methods are also available in the group, discussed especially on this topic, you can follow them.
Thanks,


  






On Wed, Jan 16, 2013 at 3:06 PM, Aydar <aidar...@gmail.com> wrote:
My dataset is too huge, for editing it in text editor. Any tips on how to reduce it?

--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To view this discussion on the web visit https://groups.google.com/d/msg/structure-software/-/bG_tOWAJHUEJ.
To post to this group, send email to structure...@googlegroups.com.
To unsubscribe from this group, send email to structure-softw...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/structure-software?hl=en.


--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To post to this group, send email to structure...@googlegroups.com.
To unsubscribe from this group, send email to structure-softw...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/structure-software?hl=en.

--
You received this message because you are subscribed to the Google Groups "structure-software" group.

To post to this group, send email to structure...@googlegroups.com.
To unsubscribe from this group, send email to structure-softw...@googlegroups.com.

IDBarrero

unread,
Jan 18, 2013, 11:03:10 AM1/18/13
to structure...@googlegroups.com
Try Powermarker 


there is one option under data tab

click batch export, then select prichard format

it does it for you and it is pretty simple

hope that it helps

Aydar

unread,
Jan 21, 2013, 8:44:34 PM1/21/13
to structure...@googlegroups.com
I don't have raw data unfortunately. Data was provided in plink binary format. Are there some other software solutions available to handle this task? PGDSpider and plink options are not really helping.


On Saturday, January 19, 2013 4:28:30 AM UTC+13, Vikram Chhatre wrote:
If plink conversion isnt working, use your raw snp data. Vim doesnt need much resources. 

---------------------------------------
Vikram Chhatre
Graduate Program in Genetics
Texas A&M University

This message was sent from a cellular device. It may contain typos and other errors.



On Jan 18, 2013, at 7:23 AM, Aydar <aidar...@gmail.com> wrote:

Hi! Not yet. Now I am trying to convert data from Plink format to Structure using PGDSpider. It takes quite a long time though. Would you mind a few questions about the vim? How much resources it will require? 

On Thursday, January 17, 2013 8:42:23 AM UTC+3, Vikram Chhatre wrote:
A powerful text editor such as 'vim' can easily handle a data set that large.  Give it a shot.

Have you solved your plink conversion problem?

V

On Wed, Jan 16, 2013 at 6:26 AM, Vinod Kumar <kumar....@gmail.com> wrote:
Hi Aydar,
I've not used such a huge dataset but we have run two dataset, the complete data set (6000 SNps) and a smaller dataset (1000 SNPs) (SNPs removed from equal distance throughout the genome), and the results were exactly same in both of the run. If it is a self pollinated plant then you can reduce to an extent but in case of cross pollinated plant species it is tough to reduce high number of SNPs as LD decays faster in such species. Some other methods are also available in the group, discussed especially on this topic, you can follow them.
Thanks,


  






On Wed, Jan 16, 2013 at 3:06 PM, Aydar <aidar...@gmail.com> wrote:
My dataset is too huge, for editing it in text editor. Any tips on how to reduce it?

--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To view this discussion on the web visit https://groups.google.com/d/msg/structure-software/-/bG_tOWAJHUEJ.
To post to this group, send email to structure...@googlegroups.com.
To unsubscribe from this group, send email to structure-software+unsub...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/structure-software?hl=en.


--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To post to this group, send email to structure...@googlegroups.com.
To unsubscribe from this group, send email to structure-software+unsub...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/structure-software?hl=en.

--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To view this discussion on the web visit https://groups.google.com/d/msg/structure-software/-/CtWiCGvTb8oJ.
To post to this group, send email to structure...@googlegroups.com.
To unsubscribe from this group, send email to structure-software+unsub...@googlegroups.com.

Vikram Chhatre

unread,
Jan 21, 2013, 8:54:56 PM1/21/13
to structure...@googlegroups.com
Not familiar with Plink format. Can you post a short example?

V

To post to this group, send email to structure...@googlegroups.com.
To unsubscribe from this group, send email to structure-softw...@googlegroups.com.

Aydar

unread,
Jan 21, 2013, 8:55:56 PM1/21/13
to structure...@googlegroups.com
Is it for mac?

Vikram Chhatre

unread,
Jan 21, 2013, 8:57:00 PM1/21/13
to structure...@googlegroups.com
Shouldnt matter if its plain text. 


On Monday, January 21, 2013, Aydar wrote:
--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To view this discussion on the web visit https://groups.google.com/d/msg/structure-software/-/nWlfsJH_a28J.

Aydar

unread,
Jan 21, 2013, 9:00:53 PM1/21/13
to structure...@googlegroups.com
this is the good example of plink ped/map format: http://www.gwaspi.org/?page_id=145  Ped file contains actual SNP information, whereis map file contains physical position and snp tag.

To unsubscribe from this group, send email to structure-software+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/structure-software?hl=en.


--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To post to this group, send email to structure...@googlegroups.com.
To unsubscribe from this group, send email to structure-software+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/structure-software?hl=en.

--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To view this discussion on the web visit https://groups.google.com/d/msg/structure-software/-/CtWiCGvTb8oJ.
To post to this group, send email to structure...@googlegroups.com.
To unsubscribe from this group, send email to structure-software+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/structure-software?hl=en.

--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To view this discussion on the web visit https://groups.google.com/d/msg/structure-software/-/niZFrSOrN24J.
To post to this group, send email to structure-software@googlegroups.com.

Vikram Chhatre

unread,
Jan 21, 2013, 11:25:57 PM1/21/13
to structure-software
So basically, the data you have is alphabetical genotypes (A, T, G, C).  All you need to do is convert them to numbers (1, 2, 3, 4).  Use a text editor to do this.  Make sure you use alpha to numero conversion rules such that you don't alter the individual IDs.  

You could import the file into Excel and chop off information not needed for Structure.  Then save it as tab-delimited file and do replacements in a text editor.  I wouldn't recommend doing replacements in Excel.

V



To view this discussion on the web visit https://groups.google.com/d/msg/structure-software/-/MF3OJjLWni8J.

To post to this group, send email to structure...@googlegroups.com.
To unsubscribe from this group, send email to structure-softw...@googlegroups.com.

Aydar

unread,
Jan 22, 2013, 1:21:49 AM1/22/13
to structure...@googlegroups.com
Excel is not really an option. Dataset is too big for it. 4gb in text format.

To post to this group, send email to structure...@googlegroups.com.

Vikram Chhatre

unread,
Jan 22, 2013, 8:44:10 AM1/22/13
to structure...@googlegroups.com
What are the dimentions of the matrix?

Aydar

unread,
Jan 24, 2013, 7:56:12 AM1/24/13
to structure...@googlegroups.com
Now I solved the problem with tabs and spaces in the PGDSpider output file(wrote small python script), however structure still does not work. I will provide more details on this later. Meanwhile can you advice on is it possible to upload file to structure where snps are coded like 1s and 2s. 1 for absence. Or they should be coded 1-4?

thanks!

Dan Weeks

unread,
Jan 1, 2014, 8:29:09 PM1/1/14
to structure...@googlegroups.com
The free Mega2 program will convert from PLINK format to Structure format.

Mayukh Mondal

unread,
May 7, 2014, 7:18:32 AM5/7/14
to structure...@googlegroups.com
On Monday, 14 January 2013 06:56:01 UTC+1, Aydar wrote:
> Hi. Sorry for a noob question, but I am stuck and looking for help.
>
> I am trying to create a Structure input file. My data are in plink ped/map format. For testing purposes I've subseted a short list of SNP. So I've created Structure input file using Plink "--recode-structure" command.
> Plink:
> "$: plink --file data --recode-structure -out structure_input"
>
> However, when I try to analyze data using structure I am getting an error message:
> "WARNING! Probable error in the input file.
> Individual 538, locus 1: encountered the following data "GRC110543" when expecting an integer"
> I suspect that this is because Plink generates table with wrong formatting, but I am not sure.
>
> Structure input table generated by Plink looks like this:
>
> "rs3094315 rs12184325 rs3131969 rs12562034 rs2518996 rs12132517 rs11240777 rs11579015 rs12134754 rs11260595 rs6671356 rs1320571
> -1 1539 77 14266 24086 6267 158 238000 88 2051 928 80405
> GRC10041151 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
> GRC10041187 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 2
> GRC10041198 3 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2
> GRC10041203 4 2 2 2 2 2 2 1 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2
> GRC10041306 5 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
> GRC10041153 6 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2
> GRC10041158 7 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
> ..."
> Not sure if web interface will keep formatting. But first row is SNP tags.
> Second row is numbers. SNP data starts from the 3rd row, sample tag is the first column.
>
> Has anyone encountered similar problem?

You can use plink structure out put directly by changing some parameters. In mainparams file of structure input, put #define ONEROWPERIND 1 and #define MAPDISTANCES 1. It worked for me.

Ramesh krishnan Ramasamy

unread,
May 7, 2014, 9:46:58 AM5/7/14
to structure...@googlegroups.com
I don't have any experience in Plink. But, why can't you check, Individual 538, locus 1? If you have very large data, try R using is.numeric() or subset the corresponding value and check manually.


--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to structure-softw...@googlegroups.com.

To post to this group, send email to structure...@googlegroups.com.



--
Ramesh krishnan R
Senior Research Fellow
Molecular Biology Lab-1
Central Sericultural Research & Training Institute
Ministry of  Textiles, Govt. of India
Srirampura, Manadavadi road
Mysore - 570 008

Mobile: +91 8951696997, +91 7502533038
Email: rame...@gmail.com

Sara Niedbalski

unread,
Oct 20, 2015, 12:18:09 PM10/20/15
to structure-software
Hi, I'm having a similar problem trying to convert VCF format to Structure output using Plink 1.9. My dataset is also far too large to do anything manually, and PGD spider is not an appropriate program for the size. When the same file is converted to structure using plink and pgd spider the outputs are not consistent, as mentioned by the original author of this post. I am not using output to run through Structure, but rather a self written program that uses the same structure input files. I cannot make good sense of the Plink Structure output. Is there an easy way to convert extremely large vcf files to the standard structure output? FYI, I'm using 1000 genomes phase 3 data.

Thanks!

Vikram Chhatre

unread,
Oct 20, 2015, 12:21:05 PM10/20/15
to structure-software
PGD should work just fine these purposes.  You do need to use commandline version though.  I routinely convert VCF files with over 100K variants to many different formats including regular and fastStructure w/o problem.  This is best done on a server with large blocks of memory.  

V

--
You received this message because you are subscribed to the Google Groups "structure-software" group.

Sara Niedbalski

unread,
Oct 20, 2015, 12:45:31 PM10/20/15
to structure-software
I'm not sure what you mean by the command line version. Could you elaborate? I currently start the program using the following command line code:

java -Xmx10g -Xms512M -jar PGDSpider2.jar pdgspider2.exe

Vikram Chhatre

unread,
Oct 20, 2015, 1:54:58 PM10/20/15
to structure-software
You will need access to a unix server with substantial memory to process large data sets such as this one.  At times, I had to set -Xmx25g on my data.

Commandline version is the one that runs through a terminal shell, unlike GUI.


Reply all
Reply to author
Forward
0 new messages