[opdev] berkeley -> mysql + questions

1 view
Skip to first unread message

Pierre Lindenbaum

unread,
Oct 21, 2008, 5:12:27 AM10/21/08
to opero...@googlegroups.com, fog...@cng.fr, he...@cng.fr
Following a request from Nicolas (DBM at Inserm) I created a tool which automatically dump a BerkeleyDB to mysql.
The tool create the tables and the indexes (see below).
The 'velocity' template used to generate the C code is here:  http://code.google.com/p/polymorphism/source/browse/trunk/operon/src/velocity/XXXstruct2mysql.c.vm

As far as I saw, Mario indexes its snp on a sequence where the first base is  indexed at '1' , whereas a center such as the UCSC starts its sequences at index='0'

this is where I'm missing something e.g rs1850299 i said to be at chr3:96571922-96572422 at UCSC but on operon I found: chr3:96572156 why such a difference ?

Pierre

mysql> desc SNPRefMap36;
+----------------+--------------+------+-----+---------+-------+
| Field          | Type         | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------+-------+
| p              | int(11)      | YES  |     | NULL    |       |
| q              | int(11)      | YES  |     | NULL    |       |
| x              | int(11)      | YES  |     | NULL    |       |
| y              | int(11)      | YES  |     | NULL    |       |
| cv             | int(11)      | YES  |     | NULL    |       |
| mw             | int(11)      | YES  |     | NULL    |       |
| sc             | int(11)      | YES  |     | NULL    |       |
| ori            | int(11)      | YES  |     | NULL    |       |
| cat            | int(11)      | YES  |     | NULL    |       |
| ver            | int(11)      | YES  |     | NULL    |       |
| fid_chrom      | varchar(20)  | YES  | MUL | NULL    |       |
| fid_taxid      | int(11)      | YES  |     | NULL    |       |
| fid_chromStart | int(11)      | YES  |     | NULL    |       |
| ctg            | varchar(13)  | YES  |     | NULL    |       |
| acn            | varchar(12)  | YES  | MUL | NULL    |       |
| band           | varchar(10)  | YES  |     | NULL    |       |
| chr            | varchar(3)   | YES  |     | NULL    |       |
| pos            | varchar(1)   | YES  |     | NULL    |       |
| com            | varchar(1)   | YES  |     | NULL    |       |
| ca             | text         | YES  |     | NULL    |       |
| ra             | varchar(175) | YES  |     | NULL    |       |
| aa             | varchar(183) | YES  |     | NULL    |       |
| flags          | varchar(16)  | YES  |     | NULL    |       |
+----------------+--------------+------+-----+---------+-------+
23 rows in set (0.00 sec)

mysql> show indexes from SNPRefMap36;
+-------------+------------+------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+
| Table       | Non_unique | Key_name         | Seq_in_index | Column_name    | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------------+------------+------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+
| SNPRefMap36 |          1 | FidChromIndex    |            1 | fid_chrom      | A         |          25 |     NULL | NULL   | YES  | BTREE      |         |
| SNPRefMap36 |          1 | FidChromPosIndex |            1 | fid_chrom      | A         |          25 |     NULL | NULL   | YES  | BTREE      |         |
| SNPRefMap36 |          1 | FidChromPosIndex |            2 | fid_chromStart | A         |    11729953 |     NULL | NULL   | YES  | BTREE      |         |
| SNPRefMap36 |          1 | acnIndex         |            1 | acn            | A         |    11729953 |     NULL | NULL   | YES  | BTREE      |         |
+-------------+------------+------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+
4 rows in set (0.00 sec)

mysql> select * from SNPRefMap36 order by fid_chrom,fid_chromStart limit 1\G
*************************** 1. row ***************************
             p: 0
             q: 0
             x: 71
             y: 71
            cv: 1
            mw: 2
            sc: 77
           ori: 0
           cat: 0
           ver: 127
     fid_chrom: 1
     fid_taxid: 9606
fid_chromStart: 71
           ctg: NT_113872.1
           acn: rs2296233
          band: 1
           chr: 1
           pos:
           com:
            ca:
            ra: C
            aa: G
         flags: m
1 row in set (0.02 sec)



Pierre Lindenbaum

unread,
Oct 21, 2008, 3:27:47 PM10/21/08
to fog...@cng.fr, opero...@googlegroups.com, he...@cng.fr
> Pierre, requests from Nicolas are outside the scope I have been
> instructed to aid the CEPH.

and I've been asked to help people from inserm :-)

> What are you doing at
> UCSC ?

USCS/goldenpath is a valuable source of data ! :-)
Their SNP track is based on dbSNP129 http://tinyurl.com/snp129


> does not finish here. On the new Operon version that you don't have yet,
> we have dbSNP data of the latest release 129 (128 did not survived long)

ok, so as far as I understand the local version at ceph is not based
on dbsnp129.
That should explain the discrepancies.

thank you for your answer.

see you on Friday
Pierre

Mario Foglio

unread,
Oct 22, 2008, 5:46:51 AM10/22/08
to opero...@googlegroups.com
On Tue, Oct 21, 2008 at 9:27 PM, Pierre Lindenbaum <plindenb...@googlemail.com> wrote:

> Pierre, requests from Nicolas are outside the scope I have been
> instructed to aid the CEPH.

and I've been asked to help people from  inserm :-)

> What are you doing at
> UCSC ?
 
I know, but NCBI dbSNP is the public source for Operon SNP data,
and the position at UCSC is different from the current one in
dbSNP 129, and the 127.

USCS/goldenpath is a valuable source of data !  :-)
Their SNP track is based on dbSNP129 http://tinyurl.com/snp129

I know, but you have not explained why there is a difference in
position for rs1850299. 
 
> does not finish here. On the new Operon version that you don't have yet,
> we have dbSNP data of the latest release 129 (128 did not survived long)

ok, so as far as I understand the local version at ceph is not based
on dbsnp129.
That should explain the discrepancies.

No. Actually you missed what I wrote in my last email. Here it
is again:

NCBI dbSNP is the public source for Operon SNP data and from the SQL output
you are showing me in this email you should know that. What are you doing at
UCSC ? Of course you will find some differences, and these differences
depend on the algorithms used to map SNP flanking sequences with ambiguous
alignments. For example:
http://regulon.cng.fr/cgi-bin/operon?app=altera&act=marker&fid=101031096572156&tax=9606&set=36&key=ref&sid=0
This record is from dbSNP release 127, and NCBI positioned the SNP at
chr3 96572156. At the moment of inserting this SNP record, the program
that inserts the data into Operon found the alignment result unsatisfactory
and for this reason you have the flag 'w'. However, the positions are not
corrected. We live the public data as it is. Now, the story for this SNP

does not finish here. On the new Operon version that you don't have yet,
we have dbSNP data of the latest release 129 (128 did not survived long)
and the postion now is at chr 3 96572172 (again different):
http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=1850299
Nevertheless, the 'w' persists on the new Operon version. The reason is
simple. Just take a good look to the dbSNP reference flanking sequences.
This is one SNP that will never be selected as a condidate for production,
and it is here the end of the story for this SNP.

Operon identifies these sort of problems while other dbs don't. Do
you know how you can identify these sort of inconsistencies from the
UCSC SNP data set ?

Mario

Pierre Lindenbaum

unread,
Oct 22, 2008, 6:02:05 AM10/22/08
to opero...@googlegroups.com


On Wed, Oct 22, 2008 at 12:01 PM, Pierre Lindenbaum <plindenb...@googlemail.com> wrote:
Hi,
first of all , I'm sorry: in a previous e-mail I said a structure was partially saved. I was really wrong (and drunk ? :-) ). Again it was my fault as I thought that sizeof(long)==8 but it is 4 on my machine. That explains now the many incorrect values I've found with the java server I wrote


> I know, but you have not explained why there is a difference in
> position for rs1850299.



I don't see any difference between ucsc and ncbi for this rs##


mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18 -e ' select name,chrom,chromStart from hg18.snp129 where name="rs1850299"'
+-----------+-------+------------+
| name      | chrom | chromStart |
+-----------+-------+------------+
| rs1850299 | chr3  |   96572171 |
+-----------+-------+------------+


and the ncbi said: http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=1850299

ncbi chromPosition = 96572172 (same value)


but as I said yesterday operon gives: http://regulon.cng.fr/cgi-bin/operon?app=operon&act=search&key=rs1850299&qrybox=Search

operon chromPosition = 96572156


mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18 -e ' select name,chrom,chromStart from hg18.snp128 where name="rs1850299"'
+-----------+-------+------------+
| name      | chrom | chromStart |
+-----------+-------+------------+
| rs1850299 | chr3  |   96572155 |
+-----------+-------+------------+

  mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18 -e ' select name,chrom,chromStart from hg18.snp126 where name="rs1850299"'
+-----------+-------+------------+
| name      | chrom | chromStart |
+-----------+-------+------------+
| rs1850299 | chr3  |   96572155 |
+-----------+-------+------------+

==> *** OK same as operon for the older dbsnp versions **: as I said yesterday I now understand that the values stored on my local are not from dbsnp129. That explains the discrepancies.



>> ok, so as far as I understand the local version at ceph is not based
>> on dbsnp129.
>> That should explain the discrepancies.
> No. Actually you missed what I wrote in my last email. Here it
> is again:

so, if I'm still missing something, let's discuss this on friday :-)

 I'll be away from my my this afternoon, I'll be at the Kremlin Bicetre Hospital with Florence's team.

A+
Pierre




Mario Foglio

unread,
Oct 22, 2008, 7:22:29 AM10/22/08
to opero...@googlegroups.com
You are right Pierre. The position for rs1850299 in dbSNP 129 and current UCSC are the same.
However, the importance in here is the fact that there are many SNPs in public databases that
have a high chance of being wrongly positioned like rs1850299. Operon detects these sort
of problems and annotates the potential error or inconsistency. From my last email, I
wanted to know from you if UCSC provides any sort of quality controls like Operon. For us,
it is very important to have them in the system in order to eliminate any risk of potential problems
that may arrive at the production stage. When collaborators select SNP markers for genotyping
on their own, we screen their entire data set systematically because we cannot afford this risk at
the scale we produce.

We will certainly discuss more about this on Friday.

Cheers,
Mario




Pierre Lindenbaum

unread,
Oct 23, 2008, 5:06:38 AM10/23/08
to opero...@googlegroups.com
> I
> wanted to know from you if UCSC provides any sort of quality controls like Operon.

Mario, just to answer this question, AFAIK, there is an "weight score" for each snp


mysql> select name,chrom,chromStart,weight   from hg18.snp129 where name in ("rs10000","rs10046606");
+------------+--------------+------------+--------+
| name       | chrom        | chromStart | weight |
+------------+--------------+------------+--------+
| rs10000    | chr7         |    5979678 |      2 |
| rs10000    | chr7         |    6757424 |      2 |
| rs10046606 | chr5         |   68982759 |      3 |
| rs10046606 | chr5         |   69236854 |      3 |
| rs10046606 | chr5         |   69544450 |      3 |
| rs10046606 | chr5         |   69562500 |      3 |
| rs10046606 | chr5         |   69837591 |      3 |
| rs10046606 | chr5         |   69858876 |      3 |
| rs10046606 | chr5         |   70111920 |      3 |
| rs10046606 | chr5         |   70544814 |      3 |
| rs10046606 | chr5_h2_hap1 |     637879 |      3 |
| rs10046606 | chr5_h2_hap1 |     658332 |      3 |
| rs10046606 | chr5_h2_hap1 |     913859 |      3 |
| rs10046606 | chr8_random  |     854163 |      3 |
+------------+--------------+------------+--------+

Pierre

Reply all
Reply to author
Forward
0 new messages