The problem appears to be that there is a duplicated locus with the same name and multiple genome mappings. Unfortunately, GLU doesn't handle this condition well and complains, as you've seen. I have a few experimental fixes, but it is not clear what the correct behavior should be. Obviously, being able to exclude the locus prior to checking for name and position uniqueness are potential workarounds.
Thoughts?
-Kevin
________________________________________
From: Wagner Magalhães [wcsmag...@gmail.com]
Sent: Friday, January 20, 2012 01:04 PM
To: glu-users
Subject: [glu-users] ginfo error
Dear Glu-users
Thanks,
Wagner
Materializing genotypes.
Well, this is embarrassing.
gzip: stdout: Broken pipe
--
You received this message because you are subscribed to the Google Groups "glu-users" group.
To post to this group, send email to glu-...@googlegroups.com.
To unsubscribe from this group, send email to glu-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/glu-users?hl=en.
What about a module to check the genotype file for these kind of inconsistencies, that produces an out file file of the problematic loci and perhaps a reason for flagging them? That way the user could quickly exclude markers and complete analyses of good loci, but would have a reference list for the duplicates, multiple hits etc that could be investigated.
Nick
________________________________________
From: glu-...@googlegroups.com [glu-...@googlegroups.com] On Behalf Of Jacobs, Kevin (NIH/NCI) [C] [jaco...@mail.nih.gov]
Sent: 21 January 2012 16:00
To: glu-...@googlegroups.com
Subject: RE: [glu-users] ginfo error
Hi Wagner,
Thoughts?
-Kevin
Dear Glu-users
Thanks,
Wagner
Materializing genotypes.
Well, this is embarrassing.
gzip: stdout: Broken pipe
The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP.
This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network.
Hi Nick,
The problem with that approach is that GLU has a very specific, perhaps overly specific, and unforgiving concept of a "genotype file". The file format parsing portion of GLU consists of "plug-ins" for many different formats, but applies essentially the same infrastructure to build data structures for use by any analysis. The advantage to this approach is that all genotype file formats benefit from equal levels of support and capabilities, but that is also a downside where all file formats also share the same limitations.
This means we have to fix this problem globally.
What should GLU do when:
1. it finds more than one SNP with the same name?
2. it finds more than one SNP at a single genomic coordinate?
3. multiple input files have SNPs with unambiguously reverse-complemented alleles?
4. multiple input files have SNPs with incompatible alleles (too many, non-reverse complement compatible, ambiguous reverse-complement, etc.)?
Some options:
a. Is it enough to report on one error and then stop?
b. Should GLU report all errors and then stop?
c. Should GLU be able to issue warnings, but ignore errors and continue processing the remaining data (for some of the above or all of the above errors)?
Option (a) is essentially what GLU does today. I have preliminary support for options (b) in the code, but it isn’t yet user visible. Option (c) can be implemented, but there are some tricky issues. For one, to most of GLU’s code, loci are processed in a stream and much of the code does not have any idea how long the stream will be or which loci it will see next. I know this sounds silly, but it is necessary to be able to efficiently use file formats that do not include metadata on the samples or loci to expect without exhaustively reading the file. Thus, much of GLU cannot exclude the first instance of a problematic locus, since it cannot “look ahead” to see where conflicts may occur.
So is option (b) good enough, provided that I also add the ability to exclude loci prior to the error checking?
Thanks,
-Kevin