Are we losing UserSNPs?

10 views
Skip to first unread message

Philipp Bayer

unread,
Aug 26, 2014, 8:10:29 AM8/26/14
to snpr-dev...@googlegroups.com
Hi,

We got an email saying that we should have more individuals for
rs2066847, about 1100. The page says we got 541 only.
https://opensnp.org/snps/rs2066847#users

grep confirms that, in public/data:

>LC_ALL=C grep rs2066847 * | wc -l
1049

>Snp.find_by_name("rs2066847").user_snps.size

says

541 (this checks the count fields instead of manually counting everything)

>Snp.find_by_name("rs2066847").user_snps.count

says

35 (this issues a COUNT(*))

.length says 35 too. (this creates a Ruby array and checks that)

If I check out the array manually, it returns 35.

Are we losing UserSNPs or what is going on? Is this DB corruption? My
sleepiness not getting something?

Helge Rausch

unread,
Aug 26, 2014, 10:00:27 AM8/26/14
to snpr-dev...@googlegroups.com
The database has 36. When I ran the parser for all genotypes that didn't
have any user_snps it was surprisingly fast. So I guess we might also be
missing some errors happening during that or rows that just don't get
imported.

I propose finally getting the parsing straight and just let it run over
all the genotypes again. Only thing missing, if we use the sql copy
version, is the configurations for all but decodeme and 23andme.

See:
https://github.com/tsujigiri/snpr/blob/fastestest_import/app/workers/parsing.rb

One other thing I have been thinking about is counting the rows that get
imported and putting that into the mail to the user instead of saying
something went wrong.
--
XMPP+OTR: helge....@jabber.ccc.de
Threema: TXFZ3MFV


signature.asc

Philipp Bayer

unread,
Aug 26, 2014, 8:23:30 PM8/26/14
to snpr-dev...@googlegroups.com
Yes, let's get this straight by the weekend and then run the whole thing again for everyone. We should even drop SNPs/UserSNPs table and just run the fastestestestsest for everyone, followed by frequency calculation? That way we won't get any dirty overhang in the db. It should only take half a day or a full day if the parsing takes only a few minutes per genotype.

What does this line do? https://github.com/tsujigiri/snpr/blob/fastestest_import/app/workers/parsing.rb#L49
Does it remove a header line?

For the configs:

exome-vcf is a bit more complicated, we should skip Indels via grep -v 'IndelType', they look like this:

1       1275264 .       A       ACAGCCGCATGTCCCCCAGCAGCCCCCACAGACCCACCCG        852.04  PASS    AC=blablabla

Here the reference has an A, while the individual in question has a long insertion of ACAGC... They might mess up anything depending on the display of UserSNPs.

The parsing of the line itself isn't easy, either, here's a proper SNP:

1    14907    rs79585140    A    G    2179.91    MQFilter40    AB=0.424;AC=1;AF=0.50;AN=2;BaseQRankSum=6.957;DB;DP=205;Dels=0.0;FS=5.141;HRun=1;HaplotypeScore=2.8561;MQ=23.95;MQ0=34;MQRankSum=-0.148;QD=10.63;ReadPosRankSum=0.551;SNPEFF_EFFECT=DOWNSTREAM;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=processed_transcript;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000450305    GT:AD:DP:GQ:PL    0/1:87,118:205:99:2210,0,1321

This means that the reference allele carries an A and the alternative known allele is a G. The actual person's alleles are hidden in the end:
0/1:87,118:205:99:2210,0,1321
This 0/1 is the person's genotype and translates to 'reference allele'/alternative allele', so the person has an AG here. If there would have been a 1/1 there, the person would have had GG, a 0/0 would have meant AA. Correct me if I'm wrong here. This isn't easy to parse out via bash alone.

Some more configs:

# AncestryDNA looks like 23andme, EXCEPT there's a tab between the two alleles:
rs4477212	1	82154	AA #23andme type
rs4477212	1	82154	A    A #ancestry type
'ancestry' => {
separator: "\t",
snp_name: 0,
chromosome: 1,
position: 2,
local_genotype: 3 AND 4 # needs slight adjustment
},

ftdna has extra " double quotations everywhere, needs to be replaced via sed: sed 's/"//g'

'ftdna' => {
separator: ",",
snp_name: 0,
chromosome: 1,
position: 2,
local_genotype: 3,
skip: 1, # has a header line
replace_quot: 1 # will run sed on every line first, or do a gsub
},

'exome-vcf' => {
separator: '\t',
snp_name: 2,
chromosome: 0,
position: 1,
local_genotype: ??? # extra magic, see above
skip_indel: 1 # will run grep -v 'IndelType'
}

Hope this helps.

Helge Rausch

unread,
Aug 27, 2014, 3:01:19 AM8/27/14
to snpr-dev...@googlegroups.com

On 27.08.2014 02:23, Philipp Bayer wrote:
Yes, let's get this straight by the weekend and then run the whole thing again for everyone. We should even drop SNPs/UserSNPs table and just run the fastestestestsest for everyone, followed by frequency calculation? That way we won't get any dirty overhang in the db. It should only take half a day or a full day if the parsing takes only a few minutes per genotype.

We could import them into new tables and then swap them out for the old ones. We also need to come up with ways to monitor if everything works as intended. Sending the number of imported rows to the user might help, so they can tell us if this number seems off. Any more ideas?


Does it remove a header line?

Yes. Any better idea how to do this without accidentally skipping data? I would like to try if actually using the csv library would be any slower. Using this would be much cleaner.


For the configs:

exome-vcf is a bit more complicated, we should skip Indels via grep -v 'IndelType', they look like this:

1       1275264 .       A       ACAGCCGCATGTCCCCCAGCAGCCCCCACAGACCCACCCG        852.04  PASS    AC=blablabla

Here the reference has an A, while the individual in question has a long insertion of ACAGC... They might mess up anything depending on the display of UserSNPs.

The parsing of the line itself isn't easy, either, here's a proper SNP:

1    14907    rs79585140    A    G    2179.91    MQFilter40    AB=0.424;AC=1;AF=0.50;AN=2;BaseQRankSum=6.957;DB;DP=205;Dels=0.0;FS=5.141;HRun=1;HaplotypeScore=2.8561;MQ=23.95;MQ0=34;MQRankSum=-0.148;QD=10.63;ReadPosRankSum=0.551;SNPEFF_EFFECT=DOWNSTREAM;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=processed_transcript;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000450305    GT:AD:DP:GQ:PL    0/1:87,118:205:99:2210,0,1321

This means that the reference allele carries an A and the alternative known allele is a G. The actual person's alleles are hidden in the end:
0/1:87,118:205:99:2210,0,1321
This 0/1 is the person's genotype and translates to 'reference allele'/alternative allele', so the person has an AG here. If there would have been a 1/1 there, the person would have had GG, a 0/0 would have meant AA. Correct me if I'm wrong here. This isn't easy to parse out via bash alone.

Who comes up with something like this?

I guess we don't go the configuration way then but have a method for each format that puts it in a unified csv format. Potentially using the csv lib if applicable. Might also be less messy then the config thing.
It does. Thanks!
signature.asc

Helge Rausch

unread,
Aug 27, 2014, 4:16:23 AM8/27/14
to snpr-dev...@googlegroups.com

On 27.08.2014 09:01, Helge Rausch wrote:
> I would like to try if actually using the csv library would be any
> slower. Using this would be much cleaner.

Aaand it is veeeery sloooooooow... ;) Was worth a try.
signature.asc

Philipp Bayer

unread,
Aug 27, 2014, 5:16:36 AM8/27/14
to snpr-dev...@googlegroups.com
On Wed, Aug 27, 2014 at 5:01 PM, Helge Rausch <he...@rausch.io> wrote:

On 27.08.2014 02:23, Philipp Bayer wrote:
Yes, let's get this straight by the weekend and then run the whole thing again for everyone. We should even drop SNPs/UserSNPs table and just run the fastestestestsest for everyone, followed by frequency calculation? That way we won't get any dirty overhang in the db. It should only take half a day or a full day if the parsing takes only a few minutes per genotype.

We could import them into new tables and then swap them out for the old ones. We also need to come up with ways to monitor if everything works as intended. Sending the number of imported rows to the user might help, so they can tell us if this number seems off. Any more ideas?

Pseudobash:

needed = `grep -v "#" file | wc -l`
if counter_user_snps_just_parsed < needed:
  raise error to errbit

That way, we don't get any mails from confused users :P
 


Does it remove a header line?

Yes. Any better idea how to do this without accidentally skipping data? I would like to try if actually using the csv library would be any slower. Using this would be much cleaner.

Hm no, I think the number of header lines is always standard... 
 

Who comes up with something like this?

Bioinformatics: "Invent a new weakly defined, internally redundant, ambiguous, bulky fruit salad of a data format. Again. "
https://www.biostars.org/p/7126/#7136

I guess we don't go the configuration way then but have a method for each format that puts it in a unified csv format. Potentially using the csv lib if applicable. Might also be less messy then the config thing.

I like the parser for everything except the exome-vcf, and it seems very fast - we could just run the "old" parser for the exome-vcf? we only extremely rarely get these, anyway, so it doesn't matter how long these run

Helge Rausch

unread,
Aug 27, 2014, 5:56:42 AM8/27/14
to snpr-dev...@googlegroups.com

> Pseudobash:
>
> needed = `grep -v "#" file | wc -l`
> if counter_user_snps_just_parsed < needed:
> raise error to errbit
>
> That way, we don't get any mails from confused users :P

Since the sql-copy-parser already is written in Ruby, I think we can go
on with that. ;) Processing the csv in Ruby (without csv lib) seems
faster than the bash version anyway. I'm also thinking validating the
individual fields would be a good idea to avoid getting errors when
running the sql copy. Something like only select rows where the genotype
consists of some two letter permutation of ACGT. Then we count how many
we didn't import and tell the user to get his file fixed. ;)
signature.asc

Philipp Bayer

unread,
Aug 27, 2014, 6:13:29 AM8/27/14
to snpr-dev...@googlegroups.com
Yes good idea -

so chromosomes would have to be 1 >= x <= 22, or X, Y, MT
SNPs can be A, T, G, C (MT SNPs for example, or X SNPs), or any of the
combinations AA, AT, AG, AC, etc.
Positions can be 1 >= x <= chromosome 1 has 249,250,621 basepairs and
that's the longest, so 250 million?
SNP name can be tons of things, so I'd just leave it like that..
Reply all
Reply to author
Forward
0 new messages