sequence retrieval discrepancy between TableBrowser and DAS server

Andrea Bagnacani

unread,

Aug 21, 2017, 12:49:05 PM8/21/17

to gen...@soe.ucsc.edu

Dear staff at UCSC,

I have a question regarding a discrepancy in the results I obtain while
retrieving a gene's 3'UTR sequence from the TableBrowser vs. the same
gene sequence from the DAS.

From the TableBrowser, I select:
- Human genome assembly hg19
- the "refGene" table and "RefSeq" track
- upload the list of genes with the ID "NM_005504"
- select sequence as output format
By clicking "get output", the TableBrowser asks me to select the
sequence type for RefSeq genes, I specify "genomic", and click "Submit".
In the next page I select only:
- 3'UTR Exons
- one FASTA record per gene
- all lower case
By clicking on "get sequence", I retrieve the result, and the FASTA
header looks like this:
>hg19_refGene_NM_005504 range=chr12:24962958-24970941 5'pad=0 3'pad=0
strand=- repeatMasking=none

I want to automate this process, so I tried using the DAS server (whose
indexes start from 1 instead of 0) with the previously obtained genome
region. So I made the request:
http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr12:24962957,24970940

..but the two sequences are different. What am I missing?

Thank you for your time.
Kind regards
Andrea

--
Andrea Bagnacani

Dept. of Systems Biology & Bioinformatics
University of Rostock
18051 Rostock, Germany

www.sbi.uni-rostock.de

+49 (0) 381 498-7684

Galt Barber

unread,

Aug 21, 2017, 1:58:55 PM8/21/17

to Andrea Bagnacani, gen...@soe.ucsc.edu

Is he getting the reverse complement of his expected strand?

NM_005504 seems to be on the negative strand.

-Galt

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/93de9e5b-d5a6-efd9-fe7d-e72ad007f71c%40uni-rostock.de.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Galt Barber

unread,

Aug 21, 2017, 1:59:47 PM8/21/17

to Andrea Bagnacani, gen...@soe.ucsc.edu

I do not know much at all about the das interface,

so I do not know how one specifies the strand to it.

2017-08-21 10:58 GMT-07:00 Galt Barber <ga...@soe.ucsc.edu>:

Is he getting the reverse complement of his expected strand?
NM_005504 seems to be on the negative strand.
-Galt

Cath Tyner

unread,

Aug 21, 2017, 2:38:41 PM8/21/17

to Andrea Bagnacani, gen...@soe.ucsc.edu

Hi Andrea,

Thank you for contacting the UCSC Genome Browser support team! Can you reply to the forum (or if prefer, you can reply directly to me), and let me know how you are building the DAS url?

I have reconstructed your Table Browser query, and it sounds like you want to get the same results from the DAS query for automation purposes.

One thought I have is that there may be a mistake in the 0-start/1-start conversion.

If you instead use the following DAS url:

http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr12:24962958,24970941

...a then if you copy the sequence and blat it to the hg19 genome, you will find that it aligns perfectly with the Table Browser results.

Here is a session which shows your Table Browser query, along with the blat-aligned sequence.

In the Table Browser output, you noted these results in the fasta header: chr12:24962958-24970941

Because of the colon and dash notation, you can see that these results are in the "position" format, which is 1-start.

You can read all about 1-start vs 0-start coordinates in this blog post:

http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/

If this does not solve the issue, please respond to this list if you have further questions.

Thank you for contacting the UCSC Genome Browser support team.

Please send new and follow-up questions to one of our UCSC Genome Browser mailing lists below:

* Post to the Public Help Forum: E
mail
gen...@soe.ucsc.edu
or search the Public Archives

 * Post to the Mirror Help Forum: Email
genome...@soe.ucsc.edu or search the Mirror Archives

 * Confidential/private help: Email
genom...@soe.ucsc.edu

UCSC Genome Browser Announcements List (email alerts for new data & software):

* Subscribe: Email genome-announce+subscribe...@soe.ucsc.edu
* Unsubscribe: Email genome-announce+unsubscri...@soe.ucsc.edu

Join us on Social Media! Facebook, Twitter, Wordpress Blog, YouTube

Enjoy,

Cath
. . .

Cath Tyner

UCSC Genome Browser, Software QA & User Support

UC Santa Cruz Genomics Institute

UCSC Genome Browser

To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAJQGhA1Mu2LMptCk%2BtmaXU-G_%3DGrN9%2BtuAKLR5JZk7yaWNNYEg%40mail.gmail.com.

Andrea Bagnacani

unread,

Aug 22, 2017, 1:14:03 PM8/22/17

to Cath Tyner, gen...@soe.ucsc.edu

Dear Cath, dear Galt,
thank you for your answers. I'll reply to each point you highlighted:

On 08/21/2017 08:38 PM, Cath Tyner wrote:
> let me know how you are building the DAS url?

I connected to the UCSC via MySQL with the command:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19

and built the DAS URL from the data I obtain by executing the MySQL query:
select g.chrom, g.txStart, g.cdsStart from refGene g, knownToRefSeq r
where g.name = 'NM_005504' AND r.value = g.name;

which returns:
+-------+----------+----------+
| chrom | txStart | cdsStart |
+-------+----------+----------+
| chr12 | 24962957 | 24970941 |
+-------+----------+----------+

I then composed the DAS URL by assembling a string putting together:
- "http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment="
- "chr12" (from the "chrom" field)
- ":"
- "24962957" +1 (from the txStart field)
- ","
- "24970941" (from the cdsStart field)
And obtain exactly the same DAS URL you wrote:

> http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr12:24962958,24970941

Which I would indeed use to automate the retrieval of the sequence, and
avoid using the TableBrowser's web interface.

However, although the sequences obtained from the TableBrowser (the way
I described in my previous mail) and the DAS server are of the same
length, they are different in their content:

(length test)
wc -c sequence_from_das
7985

wc -c sequence_from_tablebrowser
7985

(content test)
cat sequence_from_das | cut -c1-50
aatcttccaactcaatttatttgtttcagctcacgaacttttcagaatcc

cat sequence_from_tablebrowser | cut -c1-50
atggaaaatagaggatacaatggaaaatagaggataccaactgtatgcta

> If you instead use the following DAS url:

> http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr12:24962958,24970> ...a then if you copy the sequence and blat
> <http://genome.ucsc.edu/cgi-bin/hgBlat> it to the hg19 genome, you will

> find that it aligns perfectly with the Table Browser results.

Thank you, but using BLAT would mean using a web interface, and since I
have to automate the retrieval process, I can't walk this path. Sorry :/

I tried also to automate the retrieval of the sequence not by leveraging
on DAS, but directly from UCSC's "knownGeneMrna" table. Here I connect
to UCSC's MySQL interface with the usual:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19

followed by this MySQL query:
select g.name, g.chrom, g.strand, g.txStart, g.txEnd, g.cdsStart,
g.cdsEnd, m.seq from knownGene g, knownToRefSeq r, knownGeneMrna m where
r.value = 'NM_005504' AND r.name = g.name AND r.name = m.name;

which returns (I cut the sequence for brevity) :
+-----------------------------------------------------------------
-----------------------------+
| name | chrom | strand | txStart | txEnd | cdsStart |
cdsEnd | seq |
+-----------------------------------------------------------------
-----------------------------+
| uc001rgd.4 | chr12 | - | 24962957 | 25102393 | 24970941 |
25101866 | agtagggaggtg... |
+-----------------------------------------------------------------
-----------------------------+

here the retrieved sequence is 9683 nt. long, which I suspect it's
because I can't directly (at MySQL query-time) specify to return the
3'UTR only, but...
this time, if I grep the sequence retrieved from the TableBrowser's web
interface within the sequence I just obtained from the "knownGeneMrna"
table, I find it! :)
Or better, the sequence I retrieve from the TableBrowser's web interface
(shorter sequence) is contained in the sequence I retrieve from the
"knownGeneMrna" table (longer sequence) :

- longer sequence starts, and after 1689 nucleotides
- shorter sequence starts
- shorter and longer sequences are 100% identical
- shorter sequence ends
- longer sequence continues with 10 "a" nucleotides, then ends

To wrap up:
- I retrieve a sequence from the TableBrowser
- I want to automatize the process by not using any web interface, so I
try leverage on DAS, building a DAS URL with the coordinates I obtain
via MySQL query
- the sequences obtained from the DAS server and TableBrowser are equal
in size but different in content although the position is correct
- I try then to obtain the sequence from the "knownGeneMrna" table, and
the sequence I get back _contains_ the sequence I originally got from
the TableBrowser
- this DAS vs. TableBrowser sequence discrepancy is puzzling
- this result I get from the knownGeneMrna table contains the one I get
from the TableBrowser, but I don't get how come the two overlap after
1689 nucleotides

Since I can't use any web interface to automate the process, do you
think I should rely on the DAS server or the "knownGeneMrna" table?
If so, whoat would be the best way to have consistent results?
Am I missing something somewhere?

Thank you for your time.

Best regards

Christopher Lee

unread,

Aug 22, 2017, 2:31:33 PM8/22/17

to Andrea Bagnacani, Cath Tyner, gen...@soe.ucsc.edu

Hi Andrea,

Thank you for further explaining your steps. The issue here is that the transcript you are interested in lies on the negative strand, and so the Table Browser returns a reverse complemented sequence compared to what you are seeing from DAS, which is a simple genomic query. If you reverse complement one of the sequences you will find that they are the same.

Please note that you can use the utility faRc from our directory of utilities to reverse complement the sequence:
http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/

Thank you again for your inquiry and using the UCSC Genome Browser. If
you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a
publicly-accessible forum. If your question includes sensitive data,
you may send it instead to genom...@soe.ucsc.edu.

Christopher Lee
UCSC Genomics Institute

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.

To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/7a1d4eb6-2862-9342-dc79-368156e9ad16%40uni-rostock.de.

Andrea Bagnacani

unread,

Aug 23, 2017, 11:23:47 AM8/23/17

to Christopher Lee, Cath Tyner, gen...@soe.ucsc.edu

Dear Christopher, dear all,

thank you for your answer.
I tried what you suggested, and obtained perfect matching sequences
between the two services once I reverse complement for the negative strand.
Sorry I didn't pay attention to this detail.
I saw you added your answer on Biostars as well
(https://www.biostars.org/p/267588/). Thanks for that. I marked your
answer as accepted.

Best regards
Andrea

On 08/22/2017 08:31 PM, Christopher Lee wrote:
> Hi Andrea,
>
> Thank you for further explaining your steps. The issue here is that the
> transcript you are interested in lies on the negative strand, and so the
> Table Browser returns a reverse complemented sequence compared to what
> you are seeing from DAS, which is a simple genomic query. If you reverse
> complement one of the sequences you will find that they are the same.
>
> Please note that you can use the utility faRc from our directory of
> utilities to reverse complement the sequence:
> http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
>
> Thank you again for your inquiry and using the UCSC Genome Browser. If
> you have any further questions, please reply to gen...@soe.ucsc.edu

> <mailto:gen...@soe.ucsc.edu>.

> All messages sent to that address are archived on a
> publicly-accessible forum. If your question includes sensitive data,
> you may send it instead to genom...@soe.ucsc.edu

> <mailto:genom...@soe.ucsc.edu>.

>
> Christopher Lee
> UCSC Genomics Institute
>
>
> On Tue, Aug 22, 2017 at 2:13 AM, Andrea Bagnacani
> <andrea.b...@uni-rostock.de

> <mailto:andrea.b...@uni-rostock.de>> wrote:
>
> Dear Cath, dear Galt,
> thank you for your answers. I'll reply to each point you highlighted:
>
> On 08/21/2017 08:38 PM, Cath Tyner wrote:
> > let me know how you are building the DAS url?
>
> I connected to the UCSC via MySQL with the command:
> mysql --user=genome --host=genome-mysql.cse.ucsc.edu

> <http://genome-mysql.cse.ucsc.edu> -A -D hg19

>
> and built the DAS URL from the data I obtain by executing the MySQL
> query:
> select g.chrom, g.txStart, g.cdsStart from refGene g, knownToRefSeq r

> where g.name <http://g.name> = 'NM_005504' AND r.value = g.name
> <http://g.name>;

> <http://genome-mysql.cse.ucsc.edu> -A -D hg19

>
> followed by this MySQL query:

> select g.name <http://g.name>, g.chrom, g.strand, g.txStart,

> g.txEnd, g.cdsStart,
> g.cdsEnd, m.seq from knownGene g, knownToRefSeq r, knownGeneMrna m where

> r.value = 'NM_005504' AND r.name <http://r.name> = g.name
> <http://g.name> AND r.name <http://r.name> = m.name <http://m.name>;

> www.sbi.uni-rostock.de <http://www.sbi.uni-rostock.de>
>
> +49 (0) 381 498-7684 <tel:%2B49%20%280%29%20381%20498-7684>

>
> --
>
> ---
> You received this message because you are subscribed to the Google
> Groups "UCSC Genome Browser Public Support" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to genome+un...@soe.ucsc.edu

> <mailto:genome%2Bunsu...@soe.ucsc.edu>.

> To post to this group, send email to gen...@soe.ucsc.edu

> <mailto:gen...@soe.ucsc.edu>.

> Visit this group at
> https://groups.google.com/a/soe.ucsc.edu/group/genome/

> <https://groups.google.com/a/soe.ucsc.edu/group/genome/>.

> To view this discussion on the web visit
> https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/7a1d4eb6-2862-9342-dc79-368156e9ad16%40uni-rostock.de

> <https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/7a1d4eb6-2862-9342-dc79-368156e9ad16%40uni-rostock.de>.