Exporting Genotypes

1,056 views
Skip to first unread message

Amir

unread,
Dec 19, 2012, 4:16:24 PM12/19/12
to stacks...@googlegroups.com
Hi All,

My experiment is about individual samples of same population and I ran 'denovo_map' with '-s' to analyze them. I want to export existent genotypes to feed in linkage mapping programs for the first time and I don't have any experience in those programs too. When I run 'genotypes' function for any map type rather than 'GEN', and any output file type, i.e. rqtl, joinmap, ... I get nothing. The output looks like as follows:

Found 2 input file(s).
  Parsing ./stacks/batch_1.catalog.tags.tsv
  Parsing ./stacks/batch_1.catalog.snps.tsv
  Parsing ./stacks/batch_1.catalog.alleles.tsv
Identified parent IDs: 1 2 
  Parsing ./stacks/111_CCGGCC.matches.tsv
  Parsing ./stacks/112_CCGGCC.matches.tsv
Populating observed haplotypes for 2 samples, 33252 loci.
Writing 0 loci to R/QTL file, './stacks/batch_1.genotypes_3.loc'
Writing SQL markers file to './stacks/batch_1.markers.tsv'
Writing SQL genotypes file to './stacks/batch_1.genotypes_3.txt'
Writing 0 loci to observed haplotype file, './stacks/batch_1.haplotypes_3.tsv'

The only time I have something rather than zero there is when I use 'GEN' and put the output file type after -o as blank. Do you have any idea what is wrong here. I am not sure if I use this function in an incorrect way or it relates to this fact that I have individuals and not Parent and Progenies. Does Stacks generate outputs for MapMaker software too?
Thanks in advance for helping me out.

Bests,
Amir

Julian Catchen

unread,
Dec 21, 2012, 11:14:04 AM12/21/12
to stacks...@googlegroups.com
Hi Amir,

Stacks is not designed to generate mapping genotypes from population data and
linkage mapping programs are not able to generate linkage patterns from
population data either. Linkage mappers rely on identifying genotypes in the
progeny that originated from a specific parent in the cross. This is obviously
not possible in a population.

Stacks is designed to work this way:

denovo_map.pl -p/-r --> genotypes program
denovo_map.pl -s --> populations program

Best,

julian


On 12/19/12 4:16 PM, Amir wrote:
> My experiment is about individual samples of same population and I ran
> 'denovo_map' with '-s' to analyze them. I want to export existent genotypes to
> feed in linkage mapping programs for the first time and I don't have any
> experience in those programs too. When I run 'genotypes' function for any map
> type rather than 'GEN', and any output file type, i.e. rqtl, joinmap, ... I get
> nothing. The output looks like as follows:
>
> Found 2 input file(s).
> Parsing ./stacks/batch_1.catalog.tags.tsv
> Parsing ./stacks/batch_1.catalog.snps.tsv
> Parsing ./stacks/batch_1.catalog.alleles.tsv
> Identified parent IDs: 1 2
> Parsing ./stacks/111_CCGGCC.matches.tsv
> Parsing ./stacks/112_CCGGCC.matches.tsv
> Populating observed haplotypes for 2 samples, 33252 loci.
> */Writing 0 loci to R/QTL file, './stacks/batch_1.genotypes_3.loc'/*
> Writing SQL markers file to './stacks/batch_1.markers.tsv'
> Writing SQL genotypes file to './stacks/batch_1.genotypes_3.txt'
> */Writing 0 loci to observed haplotype file, './stacks/batch_1.haplotypes_3.tsv'/*

Amir Nikooienejad

unread,
Dec 21, 2012, 11:57:54 AM12/21/12
to stacks...@googlegroups.com
Hi Julian,

Thanks for your reply. Actually this was what I surmised. Ultimately we will run the experiment for parent and progenies but I wanted to know even in this case, if it is possible to generate genotype outputs in the format of MapMaker ( I know it is a bit old program) in addition to R/qtl and Joinmap. If it is not possible and I need to write a perl script for that, which output file of the stacks can be used to get those information. Is it 'batch1_.markers.tsv' or something else?

As you might know the input for MapMaker is a csv file whose rows are markers and each column is a sample and each cell is one of 'A/B/H/C/D/-'. Therefore, we need to have information about each sample regarding each marker whether it is Parent A, Parent B, Heterozygote, ... .

Thanks Again and Merry Christmas,
Bests,
Amir

--
For more options or to unsubscribe: http://groups.google.com/group/stacks-users
Stacks website: http://creskolab.uoregon.edu/stacks/

Julian Catchen

unread,
Dec 21, 2012, 12:07:45 PM12/21/12
to stacks...@googlegroups.com
Hi Amir,

The R/QTL output for F2 populations should be in MapMaker format. We only output
the A/B/H genotypes as they are the only ones mappable from RAD data. You can
change this behavior by modifying the genotypes program if you need the C/D
genotypes or want to change any other of the genotype mappings.

Good luck with it.

julian
> denovo_map.pl <http://denovo_map.pl> -p/-r --> genotypes program
> denovo_map.pl <http://denovo_map.pl> -s --> populations program
>
> Best,
>
> julian
>
>
>
> On 12/19/12 4:16 PM, Amir wrote:
>
> My experiment is about individual samples of same population and I ran
> 'denovo_map' with '-s' to analyze them. I want to export existent
> genotypes to
> feed in linkage mapping programs for the first time and I don't have any
> experience in those programs too. When I run 'genotypes' function for
> any map
> type rather than 'GEN', and any output file type, i.e. rqtl, joinmap,
> ... I get
> nothing. The output looks like as follows:
>
> Found 2 input file(s).
> Parsing ./stacks/batch_1.catalog.tags.__tsv
> Parsing ./stacks/batch_1.catalog.snps.__tsv
> Parsing ./stacks/batch_1.catalog.__alleles.tsv
> Identified parent IDs: 1 2
> Parsing ./stacks/111_CCGGCC.matches.__tsv
> Parsing ./stacks/112_CCGGCC.matches.__tsv
> Populating observed haplotypes for 2 samples, 33252 loci.
> */Writing 0 loci to R/QTL file, './stacks/batch_1.genotypes_3.__loc'/*
>
> Writing SQL markers file to './stacks/batch_1.markers.tsv'
> Writing SQL genotypes file to './stacks/batch_1.genotypes_3.__txt'
> */Writing 0 loci to observed haplotype file,
> './stacks/batch_1.haplotypes___3.tsv'/*
>
>
> The only time I have something rather than zero there is when I use
> 'GEN' and
> put the output file type after -o as blank. Do you have any idea what is
> wrong
> here. I am not sure if I use this function in an incorrect way or it
> relates to
> this fact that I have individuals and not Parent and Progenies. Does Stacks
> generate outputs for MapMaker software too?
> Thanks in advance for helping me out.
>
>
> --
> For more options or to unsubscribe:
> http://groups.google.com/__group/stacks-users
> <http://groups.google.com/group/stacks-users>
> Stacks website: http://creskolab.uoregon.edu/__stacks/
> <http://creskolab.uoregon.edu/stacks/>
>
>
> --
> For more options or to unsubscribe: http://groups.google.com/group/stacks-users
> Stacks website: http://creskolab.uoregon.edu/stacks/

--
Julian M Catchen, Ph.D.
Institute of Ecology and Evolution
University of Oregon
--
jcat...@uoregon.edu
http://www.uoregon.edu/~jcatchen/

Amir Nikooienejad

unread,
Mar 21, 2013, 6:52:21 PM3/21/13
to stacks...@googlegroups.com
Hi Julian,

I want to make an input file for MapMaker and as you suggested previously in this thread, I use F2 cross in rqtl. Now I have some questions regarding that. I should thank you in advance and also appreciate the amount of effort you put into the Stacks project.

1. Two sets of files are generated: *_genotypes_1 and *_genotypes_3. What is the difference between these two and which should be used?
2. I use the file batch_1.genotypes_3.txt and reorder that in a csv file in which each row is a marker and columns are progenies. Is it a reasonable way to make an input for MapMaker?
3. What is the difference between lower case (a/b/h) markers and capital ones (A/B/H)?
4. Sometimes for one of progenies the depth of allele inherited from one of the parents, say b, is about 30 or more, but Stacks calls that genotype as '-', instead of 'b'. What could be the reason?
5. What are indices (4th column) of batch_1.genotypes_3.txt file? I mean what sample each index refers to? (I think that should be the order viewed in web interface).

Sorry if I asked many questions.

Regards,
Amir

Julian Catchen

unread,
Mar 22, 2013, 8:27:29 PM3/22/13
to stacks...@googlegroups.com, Amir
Hi Amir,

Answers below.

On 3/21/13 3:52 PM, Amir Nikooienejad wrote:
> I want to make an input file for MapMaker and as you suggested previously in
> this thread, I use F2 cross in rqtl. Now I have some questions regarding that.
>
> 1. Two sets of files are generated: *_genotypes_1 and *_genotypes_3. What is the
> difference between these two and which should be used?

The "_1" and "_3" correspond to the minimum number of progeny required to print
a marker. When you run the pipeline through denovo_map.pl, it will set this
value to 1, giving your first genotypes file. Then, I assume you ran genotypes
by hand and supplied "-r 3" to get the second file.

> 2. I use the file batch_1.genotypes_3.txt and reorder that in a csv file in
> which each row is a marker and columns are progenies. Is it a reasonable way to
> make an input for MapMaker?

I would specify the r/qtl output type. This should be the same as mapmaker for
F2 maps.

> 3. What is the difference between lower case (a/b/h) markers and capital ones
> (A/B/H)?

Capital markers signify that the marker has been auto-corrected by the genotypes
program.

> 4. Sometimes for one of progenies the depth of allele inherited from one of the
> parents, say b, is about 30 or more, but Stacks calls that genotype as '-',
> instead of 'b'. What could be the reason?

This is usually caused by the auto-corrector. In this case there are 30 'b'
reads, but likely one or two reads for allele 'a'. This is not enough reads to
call the locus a heterozygote, but too many 'b' alleles to know for sure it is
homozygous. So, it sets it to 'unknown'. You can adjust the auto correction
parameters on the command line or turn off auto corrections.

> 5. What are indices (4th column) of batch_1.genotypes_3.txt file? I mean what
> sample each index refers to? (I think that should be the order viewed in web
> interface).
>

The five columns in that file are:

id | batch_id | catalog_id | sample_id | genotype

Every sample input into the pipeline is given a unique ID. If you look in the
database in the samples table, or if you click on the "Samples" link on the main
page of the web interface, you will see these numbers. They are assigned by
denovo_map.pl as it reads each sample into the pipeline.


Best,

julian

Amir Nikooienejad

unread,
Mar 23, 2013, 7:07:41 PM3/23/13
to stacks...@googlegroups.com
Hi Julian,

I appreciate your responses. They are really helpful. But just two little things regarding your answers:

1. As you mentioned in answer to question 4, there are some loci in which some samples show many read for allele 'a', but one or two reads for 'b' but they are not that much to call that locus heterozygote. What is the threshold for that call? does it relate to 'chi square significance level' in your SNP model? 

2. In both ./stacks/ output files and web interface, those two reads of allele 'b' are counted as 'a' and the counts are (instead of 25 and 2) 27 for allele 'a'. It seems the genotype program is aware of those little counts to call it '-'. Is there anyway for us to extract those small counts too just like the genotype program. (I know it is possible by looking at that Stacks' read sequences, but I mean are those 25 and 2 also printed somewhere instead of 27?)  

Best,
Amir

Julian Catchen

unread,
Mar 25, 2013, 1:48:52 PM3/25/13
to stacks...@googlegroups.com, Amir
Hi Amir,

Answers below.

On 3/23/13 4:07 PM, Amir Nikooienejad wrote:
> 1. As you mentioned in answer to question 4, there are some loci in which some
> samples show many read for allele 'a', but one or two reads for 'b' but they are
> not that much to call that locus heterozygote. What is the threshold for that
> call? does it relate to 'chi square significance level' in your SNP model?
>

When originally calling sites as polymorphic or monomorphic, the SNP model uses
the chi square significance level to decide when a call is statistically
significant. This happens at each individual when processed by ustacks or
pstacks. The significance level can be made more stringent at this stage if you
like.

The automated corrections runs in the genotypes program, and it looks at calls
across the whole population. These command line options to genotypes control
these thresholds:

Automated corrections options:
--min_hom_seqs: minimum number of reads required at a stack to call a
homozygous genotype (default 5).
--min_het_seqs: below this minor allele frequency a stack is called a
homozygote, above it (but below --max_het_seqs) it is
called unknown (default 0.05).
--max_het_seqs: minimum frequency of minor allele to call a
heterozygote (default 0.1).

So, if a locus in an individual has less than 5 reads, then the automated
corrector will change its genotype to '-' instead of assuming it is a homozygote.

If there is evidence that a site is polymorphic, e.g. there is at least one read
representing the second allele at a site, then the software calculates the
frequency of the second allele. If the frequency is below 5%, the software does
not consider it to be enough evidence and calls a homozygote. If the minor
allele frequency is between 5 and 10%, it does not have enough evidence to call
a heterzygote, but it is not likely a homozygote, so it is set to unknown, '-'.
If there is the allele has a frequency of 10% or higher, it it called a het.

So, the SNP model is looking for roughly a 50% representation of the two alleles
at a polymorphic site, however, the genotypes program, because it knows the SNP
is definitely in a bunch of other individuals in the population, only requires a
10% frequency to call it.

Of course, you can adjust these as you like.

> 2. In both ./stacks/ output files and web interface, those two reads of allele
> 'b' are counted as 'a' and the counts are (instead of 25 and 2) 27 for allele
> 'a'. It seems the genotype program is aware of those little counts to call it
> '-'. Is there anyway for us to extract those small counts too just like the
> genotype program. (I know it is possible by looking at that Stacks' read
> sequences, but I mean are those 25 and 2 also printed somewhere instead of 27?)
>

Yes, you can use the export_sql.pl program to get the raw read counts. Specify
the '-d' option.

Best,

julian



> On Fri, Mar 22, 2013 at 7:27 PM, Julian Catchen <jcat...@uoregon.edu
> <mailto:jcat...@uoregon.edu>> wrote:
>
> Hi Amir,
>
> Answers below.
>
>
> On 3/21/13 3:52 PM, Amir Nikooienejad wrote:
>
> I want to make an input file for MapMaker and as you suggested previously in
> this thread, I use F2 cross in rqtl. Now I have some questions regarding
> that.
>
> 1. Two sets of files are generated: *_genotypes_1 and *_genotypes_3.
> What is the
> difference between these two and which should be used?
>
>
> The "_1" and "_3" correspond to the minimum number of progeny required to
> print a marker. When you run the pipeline through denovo_map.pl
> <http://denovo_map.pl>, it will set this value to 1, giving your first
> assigned by denovo_map.pl <http://denovo_map.pl> as it reads each sample
> into the pipeline.
>

Amir Nikooienejad

unread,
Mar 25, 2013, 10:13:34 PM3/25/13
to stacks...@googlegroups.com
Hi Julian,

Your responses are impressive. Thank you so much and I really appreciate it.
If there are further questions I post them here but as of now I don't think I have any other questions regarding genotypes.

Best,
Amir

Amir Nikooienejad

unread,
Mar 27, 2013, 11:57:21 PM3/27/13
to stacks...@googlegroups.com
Hi Julian,

Just wanted to mention that I used 'export_sql' command with -d option but I still get the count as 27 for one of the alleles, instead of 25 and 2 for two different ones. Therefore it is the same as those counts reported in files in ./stacks/ folder or web interface which are added up.
Does it mean that there is no way to retrieve that 2 or any other small counts that do not pass the heterozygocity threshold?

Best,
Amir
Reply all
Reply to author
Forward
0 new messages