rs4477212 1 82154 AA #23andme type
rs4477212 1 82154 A A #ancestry type
'ancestry' => { |
separator: "\t", |
snp_name: 0, |
chromosome: 1, |
position: 2, |
local_genotype: 3 AND 4 # needs slight adjustment |
}, ftdna has extra " double quotations everywhere, needs to be replaced via sed: sed 's/"//g' 'ftdna' => { |
separator: ",", |
snp_name: 0, |
chromosome: 1, |
position: 2, |
local_genotype: 3, skip: 1, # has a header line replace_quot: 1 # will run sed on every line first, or do a gsub }, 'exome-vcf' => { separator: '\t', snp_name: 2, chromosome: 0, position: 1, local_genotype: ??? # extra magic, see above skip_indel: 1 # will run grep -v 'IndelType' } Hope this helps. |
Yes, let's get this straight by the weekend and then run the whole thing again for everyone. We should even drop SNPs/UserSNPs table and just run the fastestestestsest for everyone, followed by frequency calculation? That way we won't get any dirty overhang in the db. It should only take half a day or a full day if the parsing takes only a few minutes per genotype.
What does this line do? https://github.com/tsujigiri/snpr/blob/fastestest_import/app/workers/parsing.rb#L49
Does it remove a header line?
For the configs:
exome-vcf is a bit more complicated, we should skip Indels via grep -v 'IndelType', they look like this:
1 1275264 . A ACAGCCGCATGTCCCCCAGCAGCCCCCACAGACCCACCCG 852.04 PASS AC=blablabla
Here the reference has an A, while the individual in question has a long insertion of ACAGC... They might mess up anything depending on the display of UserSNPs.
The parsing of the line itself isn't easy, either, here's a proper SNP:
1 14907 rs79585140 A G 2179.91 MQFilter40 AB=0.424;AC=1;AF=0.50;AN=2;BaseQRankSum=6.957;DB;DP=205;Dels=0.0;FS=5.141;HRun=1;HaplotypeScore=2.8561;MQ=23.95;MQ0=34;MQRankSum=-0.148;QD=10.63;ReadPosRankSum=0.551;SNPEFF_EFFECT=DOWNSTREAM;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=processed_transcript;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000450305 GT:AD:DP:GQ:PL 0/1:87,118:205:99:2210,0,1321
This means that the reference allele carries an A and the alternative known allele is a G. The actual person's alleles are hidden in the end:
0/1:87,118:205:99:2210,0,1321
This 0/1 is the person's genotype and translates to 'reference allele'/alternative allele', so the person has an AG here. If there would have been a 1/1 there, the person would have had GG, a 0/0 would have meant AA. Correct me if I'm wrong here. This isn't easy to parse out via bash alone.
We could import them into new tables and then swap them out for the old ones. We also need to come up with ways to monitor if everything works as intended. Sending the number of imported rows to the user might help, so they can tell us if this number seems off. Any more ideas?
On 27.08.2014 02:23, Philipp Bayer wrote:
Yes, let's get this straight by the weekend and then run the whole thing again for everyone. We should even drop SNPs/UserSNPs table and just run the fastestestestsest for everyone, followed by frequency calculation? That way we won't get any dirty overhang in the db. It should only take half a day or a full day if the parsing takes only a few minutes per genotype.
Yes. Any better idea how to do this without accidentally skipping data? I would like to try if actually using the csv library would be any slower. Using this would be much cleaner.
What does this line do? https://github.com/tsujigiri/snpr/blob/fastestest_import/app/workers/parsing.rb#L49
Does it remove a header line?
Who comes up with something like this?
I guess we don't go the configuration way then but have a method for each format that puts it in a unified csv format. Potentially using the csv lib if applicable. Might also be less messy then the config thing.