SNP data set

683 views
Skip to first unread message

Mitra

unread,
May 2, 2012, 3:07:10 PM5/2/12
to structure-software
Hi,

I am using this software for the first time and I have got 13
individuals for 12 loci. i need to analyse the SNPs in them. I am not
sure how to make the input file. If anyone could help with that it
would be really nice. I read through the manual but I am confused on
whether we can enter data for just one SNP. I have got more than one.

Vikram Chhatre

unread,
May 2, 2012, 3:12:42 PM5/2/12
to structure...@googlegroups.com
Hi Mitra,

Structure manual has details of data format, and there is an example
dataset provided with the software. I am not sure what distinction
you're making between 12 loci and SNPs. It sounds like you have
genotypes of 13 individuals at 12 SNP loci. If not, can you tell us
more about your data set?

HTH
V
> --
> You received this message because you are subscribed to the Google Groups "structure-software" group.
> To post to this group, send email to structure...@googlegroups.com.
> To unsubscribe from this group, send email to structure-softw...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/structure-software?hl=en.
>

Mitra

unread,
May 5, 2012, 1:08:25 PM5/5/12
to structure-software
Thanks for the help..Yess exactly.. I have kind of created the data
set for my SNP data. But I am not able to access any of the sample
data sets I downloaded along with the software.
But I got stuck at another place. I know might be very simple but my
question is, suppose for one individual one of the allele has WT at
one point and the other has AT, How do I kind of code for it. Also
correct me if I am wrong. If both alleles have WT and WT, I code it
as13 & 33?

Another question is suppose in one genotype nucleotide variation is
seen at more than one position how do I account for that.

Mitra

Vikram Chhatre

unread,
May 5, 2012, 1:56:21 PM5/5/12
to structure...@googlegroups.com
Mitra -

It might be best if you posted the first few lines of your
alphabetical genotypes and the Structure data file corresponding to
those individuals here. That way, we might be able to help you more.

Moreover, can you explain what 'WT' genotype is in your data?

I am providing a very simple SNP example below:

4 individuals,
3 loci,
first locus AT polymorphism,
the next two loci are AG and AC respectively.
First two individuals are with population 1,
next two with population 2.
Data for every individual is on ONE LINE, i.e. ONEROWPERIND=1.
SNP alleles A, T, G, and C are coded respectively as 1, 2, 3 and 4.


Raw genotypes
-------------------------
IND1 1 A A A G A A
IND2 1 A T G G A C
IND3 2 T T A A C C
IND4 2 A T A G A C
-------------------------

Structure formatted data file
----------------------------------
IND1 1 1 1 1 3 1 1
IND2 1 1 2 3 3 1 4
IND3 2 2 2 1 1 4 4
IND4 2 1 2 1 3 1 4
----------------------------------

I hope that clears it up.

Vikram

Mitra

unread,
May 5, 2012, 4:50:20 PM5/5/12
to structure-software
INDI 1: ACT GAA GAC ATA GCA
AYT GAA GAC ATA GCA

INDI 2: ACT GAA GAC ATA GTA
ACT GAA GAC ATA GTA

That is exactly how i coded them sir.

I am actually looking at sequences from the codon code aligner. I am
copying 2 genotypes' short fragment of the sequence here..they show
polymorphism as CY and as CT.
Y here is an ambiguity code which means that base could be a C or T.
What I can't figure out is how to code for sequences which show such
ambiguity codes and what if at a single loci there are more than 2
places where polymorphism is seen (Like in the above case ). How would
you code for that.

Really appreciate you helping me.

Thanks
Mitra

Vikram Chhatre

unread,
May 5, 2012, 5:02:57 PM5/5/12
to structure...@googlegroups.com
Hi Mitra -

There is no allowance in STRUCTURE (afaik) for ambiguous information.
If you have time and resources, you could make two separate datasets
and see how much different they are. For a large number of SNPs, this
is impractical since the number of different datasets will depend upon
the number of ambiguous sites you have. But considering the small
size of your data set, this should not be prohibitive.

If the sequence you posted is contiguous, the two polymorphisms are
tightly linked (physically), and thus you may not want to use both of
them as it violates the no-linkage assumption. STRUCTURE can handle
weak linkage between markers, especially when data from various
locations in the genome is available to compensate for such linkage.
This is not the case with your dataset.

Do you have 13 different sequences for these individuals, or are all
polymorphisms present in the same sequence?

HTH
V

Mitra

unread,
May 5, 2012, 5:23:55 PM5/5/12
to structure-software
I have 12 loci and 13 individuals (genotypes) so that means 12
sequences for each individual.

May I'll try out making 2 datasets and see how it works out. And thanx
for telling me abt the linkage thing.
Reply all
Reply to author
Forward
0 new messages