New GenMAPP Databases - missing IDs and relationships

Message has been deleted

Nathan Salomonis

unread,

Jun 21, 2007, 8:15:33 PM6/21/07

to GenMAPP

(User Question to GenMAPP Help Desk)
Hi,
I have a question about the new mouse database that was released on
May 14, 2007. This database contains 23793 EntrezGene IDs for mouse.
The old database (20060628) contains 60682 EntrezGene IDs. What was
the strategy behind this reduction? The impact that this has on my
expression dataset is that approximately 5000 probes (with EntrezGene
ID) are no longer recognized in the database. It is hard to
understand how this will impact GO and pathway analysis.
Can you help?
Thanks
--------------------------------------------
(Response from GenMAPP Help Desk)
The new databases were the first set to be entirely built using just
the Ensembl API. The older databases were augmented with gene
information and relationships from EntrezGene, Unigene and
Affymetrix. These databases were difficult to maintain and support
for frequent updates since there was no automated method for
extracting the data and building databases, like there is with the
Ensembl API.

We initially observed similar changes in Affymetrix probe set to
EntrezGene mapping as you described and tried to look into it in more
detail. This drop in information is likely a result of 2 factors:
1) missing EntrezGene to Affymetrix probe set links derived from
the Affymetrix .csv annotation files
2) missing Ensembl to Affymetrix probe set links, where there was
not valid alignment of the probeset to the Ensembl gene of interest.

The second of these two factors does appear to have some false
negative relationships (relationships excluded as a result of a flaw
in the way Ensembl matches these probe sets to Ensembl genes). We
have specifically contacted Ensembl about the issue with their mapping
algorithms and hope to hear back soon about a solution (see below
email thread). The basis of the mapping problem stems from the fact
that Ensembl largely ignores probes aligning to exon-junction in the
gene structure and also can generate false sequence similarity scores
since they look at exon-intron overlaps for probes when there is
perfect overlap of that probe with a junction. We have identified
many valid cases where the probeset should be mapped to an Ensembl
gene and the Ensembl gene links to a EntrezGene ID in the database.
This issue will likely impact the GO relationships you obtain,
therefore, we recommend using both an old and new database to compare
the results from MAPPFinder.

Your feedback is helpful, in that we can gauge what the potential
impact of this change in our database builds is to our users. We
apologize for any inconvenience as these issues will hopefully soon be
resolved through Ensembl (BTW: you are welcome to contact Ensembl and
tell them that you are not happy with this issue). Two possible
solutions are if Affymetrix were to make their databases
programmatically accessible or Ensembl improves is mapping strategy.
We will expect one or both of these to happen in the near future with
our continued pressure.

-----------------------------------------------------------------
E-mail Thread between GenMAPP Help Staff and Ensembl Help Desk

(Response from GenMAPP Support)
Thank you for the detailed information. Upon receiving your email I
downloaded the probe sequences for the example illustrated below and
mapped them, by hand back to the two Ensembl transcripts and
individual Ensembl exons. I see a few inherent issues with the
strategy you outlined, which also appear to introduce intron detection
errors.

In the case of the provided example, Msa.18713.0_g_at, all 20 probes
are overlapping by one or two base-pairs with the previous probe,
providing a contiguous sequence. This sequence, which is the same as
the Affymetrix target sequence has 100% match to both of the assembled
Ensembl transcript sequences and thus is in a constitutive region of
the region of the gene. Searching for each probe sequence in both
transcripts, I find that 9 out of the 20 probes can be found in a
single exon and that 11 overlap with an exon junction (the 3' exon ID
is different between the two transcripts, but the 3' splice junction
sequences are the same with varing 3' exon sequence lengths).
However, none of the 11 junction aligning probe sets have 100%
alignment to a exon-intron boundry (see attached excel file).
However, since two of the probes have a 2bp overlap between junctions,
a possible 1bp mismatch is possible with an exon-intron boundry. When
using a purely exon based approach this is can be a very detrimental
limitation in obtaining reliable matches, since false negatives can
arise from small coincidental sequence matches with an exon-intron, if
only a few bp overlap.
Also, this method biases against probes which overlap with exon
junctions and thus can eliminate many high quality probe sets from
being annotated, which often exist in constitutive regions of the gene
which are common among all transcripts.

I would recommend the following as a solution:
-First check to see if the probe set target sequence has a 100% match
in any transcripts or alternatively all probes (independent of where
the probe aligns within the gene structure).

If you feel strongly about retaining the existing method, I would
suggest the following modification:
-do not count a probe as aligning to an intron if the intron overlap
is less than 4bp.

If you would like to talk about this any further, I would be happy to
do so and even bring into the conversation our collaborators at
Affymetrix involved in chip design and annotation.

Best Regards,
GenMAPP Support Group

(Response from Ensembl Help Desk)
The reasons for the apparent loss of probe sets is due to the way we
generate cross reference between the individual probe mappings to the
genome and how they overlap with a transcript. The rules for
generating affy probeset to transcript xrefs are as follows:

Mapping works by finding overlapping probesets and transcripts where
at least 50% of the probes in the set overlap the transcript's cDNA or
2000 bases downstream of it. A probe is considered to hit the
transcript if it's 25 bases match exactly or there is at most a 1 base
substitution.
Probes which cross exon boundaries are currently ignored.

A quick look at the mapper logs shows the following:

PROBESET TRANSCRIPT MAPPED_STATE PROBE_SET_SIZE NUM_EXON_HITS
NUM_INTRON_HITS NUM_HITS_ON_REVERSE_STRAND DESCRIPTION
Msa.18713.0_g_at ENSMUST00000100740 0 20
9
1 0 insufficient,intronic
Msa.18713.0_g_at ENSMUST00000031423 0 20
9
1 0 insufficient,intronic

So both of the Atp2a2 transcripts missed out narrowly by having one
probe which mapped as intronic (i.e. crossing an intron/exon
boundary), they must have 50% of the probes in exons or 2KB
downstream.

Our oligo mapping pipeline is currently under review. Therefore, we'd
be interested on any comments about how people feel about this, as we
want to keep as many people happy as possible.

I hope this answers your question. Please don't hesitate to contact us
if you have other questions or problems.

With kind regards,
Bert Overduin, Ph.D.
(Ensembl Helpdesk)

Message has been deleted

wangke

unread,

Jul 10, 2007, 1:15:40 AM7/10/07

to GenMAPP

Hi,
I have a same question about the new R.norvegicus database. My Gene ID
System was used Unigene.When I import my own expression data, the old
gene database (Rn-Std_20060526.gdb) could used and produce my
expression dataset.But when choice new gene database (Rn-
Std_20070514.gdb), it can't be imported.
Can you help?
Thanks
Ke Wang

GenMAPP Support

unread,

Jul 10, 2007, 1:53:55 PM7/10/07

to GenMAPP

Hello Ke,

The Unigene database is updated frequently, with many of the Unigene
IDs being "retired" during each update. So your problems are most
likely due to outdated Unigene IDs in your dataset. If you send your
data to gen...@gladstone.ucsf.edu we can look into it.

Kristina

Reply all

Reply to author

Forward