Genotype parsing in 22 seconds

Helge Rausch

unread,

Aug 18, 2014, 11:42:50 AM8/18/14

to snpr-dev...@googlegroups.com

I just did this:

$ time cat ~/Downloads/1.23andme.9.txt | grep -v '#' | cut -f 1,4
--output-delimiter=, | sed "s/^/42,/" | psql snpr_development -c 'copy
user_snps (genotype_id,snp_name,local_genotype) from STDIN with (FORMAT
CSV, HEADER FALSE, DELIMITER ",")'

And it returned this:

cat ~/Downloads/1.23andme.9.txt 0.00s user 0.02s system 0% cpu 20.688 total
grep -v '#' 0.10s user 0.02s system 0% cpu 20.875 total
cut -f 1,4 --output-delimiter=, 0.18s user 0.03s system 0% cpu 20.877 total
sed "s/^/42,/" 0.65s user 0.01s system 3% cpu 21.034 total
psql snpr_development -c 0.19s user 0.02s system 0% cpu 22.226 total

:)

This of course only covers creating the user_snps and support for
23andme, but it shouldn't be that hard to extend it.

Seems worth it to me! What do you think?

--
XMPP+OTR: helge....@jabber.ccc.de
Threema: TXFZ3MFV

signature.asc

Samantha Clark

unread,

Aug 18, 2014, 4:09:34 PM8/18/14

to openSNP Dev List

Now I'm curious about how fast you could do an exome VCF!

Philipp Bayer

unread,

Aug 18, 2014, 7:40:09 PM8/18/14

to snpr-dev...@googlegroups.com

haha very nice :D A good side-effect of linking the tables via snp-names and not snp-IDs, because the names are known before the SNP-entries are created, the IDs are not.

Couldn't you just run the same command create the SNPs in a second run, and postgresql should skip snps we already have since the name has to be unique? You'd iterate over the table twice but it's not particularly long, anyway

--
You received this message because you are subscribed to the Google Groups "SNPr development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snpr-developme...@googlegroups.com.
To post to this group, send email to snpr-dev...@googlegroups.com.
Visit this group at http://groups.google.com/group/snpr-development.
For more options, visit https://groups.google.com/d/optout.

Helge Rausch

unread,

Aug 19, 2014, 3:22:27 AM8/19/14

to snpr-dev...@googlegroups.com

On 19.08.2014 01:40, Philipp Bayer wrote:

Couldn't you just run the same command create the SNPs in a second run, and postgresql should skip snps we already have since the name has to be unique? You'd iterate over the table twice but it's not particularly long, anyway

With the unique constraint Postgres will complain and just stop if there are duplicates, so I guess we would have to remove the known snps the usual way, looking up what's there and removing it from the data to be imported. I was wondering about that with the user_snps as well. Since we don't have a unique constraint there, we could even remove the duplicates with a scheduled maintenance task after importing or something. Not ideal but fast in terms of importing. Another way that could work is to first import the data into a temporary table and then copying over everything that's not already in the actual table. This may still be faster than getting all the snp names out of the database first. I will just go on experimenting a bit.

On Tue, Aug 19, 2014 at 6:09 AM, Samantha Clark <iov...@gmail.com> wrote:

Now I'm curious about how fast you could do an exome VCF!

I will try. :)

signature.asc

Helge Rausch

unread,

Aug 19, 2014, 3:27:36 AM8/19/14

to snpr-dev...@googlegroups.com

So... this is the temp table variant. It took 50 seconds in the initial
run, 10 in the second where it doesn't add data since it is already
there from the first! I have to add though, that I have an SSD in my
notebook. The question now is how long it takes if the user_snps table
is as big as our production one and the disks are as slow. ;) But that's
for a later time. i'm heading to work now. Maybe someone else can try
and see how long it takes to collect more data. :)

https://github.com/tsujigiri/snpr/blob/fastestest_import/script/genotype_import.sh

signature.asc

Helge Rausch

unread,

Aug 20, 2014, 3:11:18 AM8/20/14

to snpr-dev...@googlegroups.com

I rewrote this in Ruby and it's even faster. I suppose that's because it
doesn't have the piping overhead. First run now is ~40s, second is ~20s.
I will add the SNP creation next.

https://github.com/tsujigiri/snpr/blob/fastestest_import/app/workers/parsing.rb

signature.asc

Reply all

Reply to author

Forward