I was able to follow the tutorial for BridgeDb-Batchmapper, but all my
attempts to similarly translate from entrez id to ensembl failed.
The following is typical of my dismal failures:
java -jar dist/bridgedb-batchmapper.jar -i in.txt -c 3 -is L -o
out.txt -os En -g Hs_Derby_20090509.pgdb
Missing : 22
Ambiguous : 0
Ok : 0
_______ +
Total : 22
If someone could provide some tips, that would be great!
TIA
Ken
Just a quick thing to try: what happens if you use "-os EnHs" instead of
"-os En"?
regards,
Martijn
java -jar dist/bridgedb-batchmapper.jar -i in.txt -c 3 -is L -o
out.txt -os EnHs -g Hs_Derby_20090509.pgdb
Missing : 1
Ambiguous : 0
Ok : 21
_______ +
Total : 22
(The 'missing' line is the header row)
Thank you kindly. And congratulations on the latest publication.
Ken
On Jan 13, 5:21 pm, Kenneth Lai <lai_shiaw_tun_kenn...@lilly.com>
wrote:
Martijn
I have some questions which i hope we can discuss here:
The mapping from Entrez to Ensemble and batchmapper's results are
thus:
Entrez Target Correct mapping Hs_Derby_20090509.pgdb
Hs_Derby_20090720.pgdb
8968 Ensembl ENSG00000112727 ENSG00000182572 ENSG00000197153
8969 Ensembl ENSG00000196787 ENSG00000198374 ENSG00000233224
Queries/Comments
1. It would be great if Batchmapper could mark in the result file
those entries that it thinks ambiguous, to serve as flag for us to
verify manually.
2. May i ask what are the possible sources of ambiguity?
Mapping from Entrez and RefSeq, batchmapper returns just a single
RefSeq id.
3. What is the criteria for returning the selected RefSeq id?
Entrez BatchMapper
2218 NM_001079802
4. Would it be possible to return all the mappings by repeating the
source id (something like this):
Entrez BatchMapper
2218 NG_008754
2218 NM_001079802
2218 NP_001073270
2218 NM_006731
2218 NP_006722
The webservice gives a more complete list, but omits the genomic
RefSeqGene id
http://webservice.bridgedb.org/Human/xrefs/L/2218?dataSource=Q
NP_006722 RefSeq
NP_001073270 RefSeq
NM_006731 RefSeq
NM_001079802 RefSeq
5. What is the reason for the absence of the RefSeqGene id?
Thanks!
Ken
> >> Ken- Hide quoted text -
>
> - Show quoted text -
On 1/18/10 12:47 AM, "Kenneth Lai" <lai_shiaw_...@lilly.com> wrote:
> Still getting to know the tools slowly, but results are interesting
> and impressive. e.g. batchmapper returned mappings for 90% of 17000
> inputs. That's cool!.
>
> I have some questions which i hope we can discuss here:
>
> The mapping from Entrez to Ensemble and batchmapper's results are
> thus:
>
> Entrez Target Correct mapping Hs_Derby_20090509.pgdb
> Hs_Derby_20090720.pgdb
> 8968 Ensembl ENSG00000112727 ENSG00000182572 ENSG00000197153
> 8969 Ensembl ENSG00000196787 ENSG00000198374 ENSG00000233224
>
> Queries/Comments
> 1. It would be great if Batchmapper could mark in the result file
> those entries that it thinks ambiguous, to serve as flag for us to
> verify manually.
> 2. May i ask what are the possible sources of ambiguity?
The source of the mappings is Ensembl. They perform genome assembly and
alignment to make their mapping assignments. These can change from release
to release as assemblies mature and repetitive regions and paralogs are
clarified. Your examples above reference histone cluster families that I
suspect share a lot of identical sequence.
> Mapping from Entrez and RefSeq, batchmapper returns just a single
> RefSeq id.
> 3. What is the criteria for returning the selected RefSeq id?
>
> Entrez BatchMapper
> 2218 NM_001079802
>
> 4. Would it be possible to return all the mappings by repeating the
> source id (something like this):
>
> Entrez BatchMapper
> 2218 NG_008754
> 2218 NM_001079802
> 2218 NP_001073270
> 2218 NM_006731
> 2218 NP_006722
>
> The webservice gives a more complete list, but omits the genomic
> RefSeqGene id
>
> http://webservice.bridgedb.org/Human/xrefs/L/2218?dataSource=Q
> NP_006722 RefSeq
> NP_001073270 RefSeq
> NM_006731 RefSeq
> NM_001079802 RefSeq
>
> 5. What is the reason for the absence of the RefSeqGene id?
The webservice is running on Hs_Derby_20090720 which does not contain NG
references. In general, we have tried to keep the "Derby" databases slim to
limit its size. For the webservice, we are now attaching to mysql databases
and these can handle a bit more information, so we will likely add more
datasource types. As far as I can tell, the Hs_Derby_2000509 database does
not contain NG refs either, so I don't know how you got these from
BatchMapper??
Kenneth Lai wrote:
> Still getting to know the tools slowly, but results are interesting
> and impressive. e.g. batchmapper returned mappings for 90% of 17000
> inputs. That's cool!.
>
> I have some questions which i hope we can discuss here:
>
> The mapping from Entrez to Ensemble and batchmapper's results are
> thus:
>
> Entrez Target Correct mapping Hs_Derby_20090509.pgdb
> Hs_Derby_20090720.pgdb
> 8968 Ensembl ENSG00000112727 ENSG00000182572 ENSG00000197153
> 8969 Ensembl ENSG00000196787 ENSG00000198374 ENSG00000233224
>
> Queries/Comments
> 1. It would be great if Batchmapper could mark in the result file
> those entries that it thinks ambiguous, to serve as flag for us to
> verify manually.
>
I agree this would be useful. I've entered your feature request in our
tracker: http://bridgedb.org/ticket/41
I'll do this when I have some spare time, but further development of the
batchmapper tool doesn't have a high priority for me at the moment. Do
you or any of your colleagues know Java programming? Some assistance
would certainly help to speed up things. Batchmapper is open source and
a collaborative project, and we welcome new contributors.
> 2. May i ask what are the possible sources of ambiguity?
>
In addition to what Alex said: Batchmapper counts something as ambiguous
if there are multiple possible mappings. As you've noted, if there are
multiple possible mappings, only one is returned.
> Mapping from Entrez and RefSeq, batchmapper returns just a single
> RefSeq id.
> 3. What is the criteria for returning the selected RefSeq id?
>
> Entrez BatchMapper
> 2218 NM_001079802
>
>
Batchmapper returns the first result it can find, meaning that the
criterion is essentially "random". I realize this is not ideal, but that
is the way it's implemented right now.
> 4. Would it be possible to return all the mappings by repeating the
> source id (something like this):
>
> Entrez BatchMapper
> 2218 NG_008754
> 2218 NM_001079802
> 2218 NP_001073270
> 2218 NM_006731
> 2218 NP_006722
>
Not currently, but that would certainly be a useful feature to add. I've
entered a feature request for this as well:
http://bridgedb.org/ticket/42, but see also my answer to 1.
> The webservice gives a more complete list, but omits the genomic
> RefSeqGene id
>
> http://webservice.bridgedb.org/Human/xrefs/L/2218?dataSource=Q
> NP_006722 RefSeq
> NP_001073270 RefSeq
> NM_006731 RefSeq
> NM_001079802 RefSeq
>
> 5. What is the reason for the absence of the RefSeqGene id?
>
> Thanks!
> Ken
>
>
regards,
Martijn
Haha, I welcome chances to improve my Java. When it comes to it, we
are happy to contribute back to open source projects we use (I must
apply for routine corporate clearance though).
The external collection, curation and maintenance of mappings is what
we are seeking to leverage; for this we are comparing the accuracy and
coverage of shortlisted projects.
Now that you mention, can we download your MySql databases that are
more complete?
This would also provide a path to query across organisms and
historical versions; some of our users are keen on this.
Ken
> Martijn- Hide quoted text -
We are indeed planning to make mysql dumps of our databases available in
the future.
Alex, do you think we can implement that for the next Ensembl release?
Martijn