RE: [Genome] Protein coding genes list without isoforms

Hani Choudhry

unread,

Sep 10, 2012, 1:43:14 PM9/10/12

to st...@soe.ucsc.edu, gen...@soe.ucsc.edu

Dear UCSC genome,
Thanks for your reply on how to get a canonical protein coding gene list.
However, I realised that the file output file contains long non coding RNA,
tRNA and mirRNAs. How can I filter these genes to have only protein coding
genes?
Thanks for help and look forward to your reply.
Regards,
Hani

-----Original Message-----
From: Steve Heitner [mailto:st...@soe.ucsc.edu]
Sent: 31 July 2012 16:55
To: ha...@well.ox.ac.uk; gen...@soe.ucsc.edu
Subject: RE: [Genome] Protein coding genes list without isoforms

Hello, Hani.

Based on what you've described, it sounds like the UCSC Genes knownCanonical
and knownIsoforms tables contain the information you're looking for. To get
the canonical gene list, perform the following steps:

1. Navigate to http://genome.ucsc.edu/cgi-bin/hgTables

2. Select the following options:
Clade: Mammal
Genome: Human
Assembly: Feb. 2009 (GRCh37/hg19)
Group: Genes and Gene Prediction Tracks
Track: UCSC Genes
Table: knownCanonical
Region: Select "genome" for the entire genome or specify a position in the
"position" box. You can also specify multiple loci by clicking the "define
regions" button.
Output format: Select "all fields from selected table" to list every field
from the table in your output. Select "selected fields from primary and
related tables" to specify which fields should be included in your output.

3. Click the "get output" button

The knownIsoforms table combines all isoforms of a single gene into
"clusters" and lists each cluster along with the isoforms that belong to
that cluster. Note that when you select this table in the Table Browser, it
displays the contents of the entire genome by default. You do not have the
option of specifying individual genomic regions. To display the contents of
this table, simply change the table to knownIsoforms in step 2 above.

Please contact us again at gen...@soe.ucsc.edu if you have any further
questions.

---
Steve Heitner
UCSC Genome Bioinformatics Group

-----Original Message-----
From: genome-...@soe.ucsc.edu [mailto:genome-...@soe.ucsc.edu] On
Behalf Of ha...@well.ox.ac.uk
Sent: Tuesday, July 31, 2012 2:01 AM
To: gen...@soe.ucsc.edu
Subject: [Genome] Protein coding genes list without isoforms

Dear UCSC Genome Browser,

I would like to get a list of all RefSeq protein coding genes (name, strand,
chromosomal location, and sequences). I have tried to get them from Table
option but it gave me with different isoforms. Is there any option to remove
all isoforms and have one canonical gene?! Also, I am wondering if you list
meta gene list (combined all isoforms to have a single meta gene) of protein
coding genes?

Thanks for your attention and help. Look forward to your reply soon.

Regards,
Hani Choudhry

_______________________________________________
Genome maillist - Gen...@soe.ucsc.edu
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Steve Heitner

unread,

Sep 19, 2012, 3:11:17 PM9/19/12

to Hani Choudhry, gen...@soe.ucsc.edu

Hello, Hani.

You can further filter the results by following the previous instructions
and also creating a filter to eliminate certain results based on the
contents of the kgXref table. For example, to eliminate any results where
the kgXref.description field contains the text "non-coding," perform the
following steps to create a filter:

1. On the "filter" line of the main Table Browser screen, click the "create"
button

2. In the "hg19.kgXref based filters" section, enter the following in the
"Free-form query" box: description not like "%non-coding%"

3. Click the "submit" button

Note that if there are still results you want to filter, you can specify
multiple terms in your free-form query. For example, if you wanted to
filter results where kgXref.description contains "non-coding" and also
results where kgXref.description contains "piRNA," you could enter the
following free-form query: description not like "%non-coding%" and
description not like "%piRNA%"

You can specify an unlimited number of filter terms in this manner. Note
that you are also not limited to just kgXref.description. If you want to
filter results based on the contents of other fields in kgXref, just be sure
to specify the appropriate field name in your free-form query.

ha...@well.ox.ac.uk

unread,

Jul 31, 2012, 5:00:49 AM7/31/12

to gen...@soe.ucsc.edu

Steve Heitner

unread,

Jul 31, 2012, 11:55:21 AM7/31/12

to ha...@well.ox.ac.uk, gen...@soe.ucsc.edu

Hani Choudhry

unread,

Jul 31, 2012, 12:39:53 PM7/31/12

to gen...@soe.ucsc.edu

-----Original Message-----
From: Hani Choudhry [mailto:ha...@well.ox.ac.uk]

Sent: 31 July 2012 17:39
To: 'st...@soe.ucsc.edu'
Subject: RE: [Genome] Protein coding genes list without isoforms

Hi Steve,
Thanks for your reply. I tried steps as suggested by you but I am still
getting isofrom in the final gene list (attached). Is any way to filter
isoform and get only unique protein coding gene list?
Regards,
Hani Choudhry

-----Original Message-----
From: Steve Heitner [mailto:st...@soe.ucsc.edu]
Sent: 31 July 2012 16:55
To: ha...@well.ox.ac.uk; gen...@soe.ucsc.edu

Steve Heitner

unread,

Jul 31, 2012, 2:32:07 PM7/31/12

to Hani Choudhry, gen...@soe.ucsc.edu

Hello, Hani.

In your text file, items 2 and 3 have the gene symbol WASH7P. Let's use
WASH7P as an example. If you view these items in the UCSC Genes track in
the Browser, you will see that there are multiple transcript variants of
WASH7P, some that span chr1:14000-19000 and some that span chr1:14000-29000.
As a result, there are multiple entries in our canonical set with the same
gene symbol. The previously-answered mailing list question at
https://lists.soe.ucsc.edu/pipermail/genome/2005-July/008123.html describes
the process of selecting transcripts as members of our canonical set. As
you discovered, it isn't a perfect system and it's something we're currently
working on revising.

There is no simple way to filter out the isoforms as you suggest. It would
probably be easiest to devise a post-output method of scanning and removing
items with duplicate gene symbols.

Bob Harris

unread,

Jul 31, 2012, 2:41:28 PM7/31/12

to gen...@soe.ucsc.edu

Howdy, Tom,

I'm the author of lastz. I hadn't been on that mailing list, but a
colleague forwarded your message to me. I've now joined the list.

(a) I cannot find a way to compare multiple fasta with multiple fasta.

I'm not entirely sure what you are trying to do with multiple fasta.
One guess is you want to align a gourp of aligned sequences with
another group of aligned sequences.

Lastz is a pairwise aligner, single sequence to single sequence. It
does have some support for comparing profile sequences (called
'quantum DNA' in the documentation). This was intended to be useful
for aligning alignments to alignments, but it has limited practical
use as implemented. Our lab also produced multiz and tba, two
aligners that build multiple alignments from pairwise alignments.
Those should be available here:
www.bx.psu.edu/miller_lab

Another guess is that you have a bunch of sequences in one fasta file,
and you want to align them all vs each other. The simplest way to do
this is to just list that file twice on the command line, as both the
target and as the query, e.g. "lastz file.mfa[multiple] file.mfa".
For compatibility with blastz, though, lastz does not accept a file
with more than one sequence as the target (first file) unless you
attach "[multiple]" to the end of the file name. The other issue is,
if all those sequences combined are more than 2 billion bases, you
will need to build a version of lastz that supports that. I can
describe how to do that if that's what you need.

(b) Is there any ways to [compare one chromosome and other
chromosomes] in a quick fashion?

To get quick answers about alignability, I usually use lastz's
ungapped alignments and restrict seeding. This can usually give you a
good picture of what aligns to what. This is described here:
www.bx.psu.edu/~rsharris/lastz/newer/README.lastz-1.03.02.html#ex_chrom

Let me know if that helps, or not,
Bob H

> ---------- Forwarded message ----------
> Date: Tue, 31 Jul 2012 17:49:21 +0800
> From: wangdp <wan...@big.ac.cn>
> Subject: [Genome] Lastz tools
> To: genome <gen...@soe.ucsc.edu>
> Message-ID: <2012073117...@big.ac.cn>
> Content-Type: text/plain; charset="gb2312"
>
> Dear UCSC team:
>
> I am now using the lastz tool to compare one species with itself.
>
> But I cannot find a way to compare multiple fasta with multiple
> fasta. Furthermore, it need much more time to compare one chromosome
> and other chromosomes.
>
> Could you please help me about the two issues?
>
> Is there any ways to do it in a quick fashion?
>
> Thanks a lot and best wishes,
>
> Sincerely,
>
> Tom
>

Reply all

Reply to author

Forward