[Genome] How can I get gene names out from positions of probes in a BED file?

17 views
Skip to first unread message

John Cumbers

unread,
May 17, 2007, 1:12:36 AM5/17/07
to gen...@soe.ucsc.edu
Hello,
I have drosophila affy tiled array data and two bed files.

I am having trouble pulling out gene names for the corresponding probes.
If I import the bed file as a custom track, then go to table browser and
click output at the bottom. I've tried many of the settings, changing the
group from custom to all tables, but still I don't get any gene names, just
an output of position. This maybe because no gene names are present in my
sample, (i.e. I'm just in intergenic regions) but I doubt it. Am I doing
something wrong? If it is because I don't have any genes there, then how
can I find which genes are nearest to these sites?

As a second question, do you know of a computational transcription factor
binding site predictor that outputs predicted binding sites as a bed file,
or similar file that can be intersected with the tiled array data above, to
test how good the predictions are?

Any help much appreciated,
John



--
John Cumbers, Graduate Student
Biology and Medicine
Brown University, Box G-W
Providence, Rhode Island, 02912, USA
Tel USA: +1 401 523 8190, Fax: +1 401 863-2166
UK to USA: 0207 617 7824

Rachel Harte

unread,
May 17, 2007, 4:25:59 PM5/17/07
to John Cumbers, gen...@soe.ucsc.edu
Hello John,

Regarding the problem that you are having with the affy tiled data, please
would you send me an example of the data that you are loading. Just a few
lines will suffice and then I will be able to help you more easily. If you
don't have gene names in the custom track then that would be the problem.
You may need to do an intersection with one of the gene tracks to obtain
gene names.

I have several suggestions from other engineers of transcription factor
binding site predictors that you could try:

1) There is a program in the Genome Browser source code called
dnaMotifFind (src/hg/geneBounds/dnaMotifFind).
The motif input is described as so:

table dnaMotif
"A gapless DNA motif"
(
string name; "Motif name."
int columnCount; "Count of columns in motif."
float[columnCount] aProb; "Probability of A's in each
column."
float[columnCount] cProb; "Probability of C's in each
column."
float[columnCount] gProb; "Probability of G's in each
column."
float[columnCount] tProb; "Probability of T's in each
column."
)
It requires tabs between fields, and commas between the elements of the
arrays. The source code is free for personal, academic and non-profit use
and details about obtaining it are here:

http://genome.ucsc.edu/FAQ/FAQlicense#license3

2) If you want to do de novo searching (i.e. not search for matches to a
position weight matrix-type model), the program BEST was recommended:

http://www.cs.uga.edu/~che/BEST

It automatically runs AlignACE, BioProspector, CONSENSUS and MEME,
combines their output, and does some nice optimizations (i.e. merging
results, expanding motifs, etc.)

It is only available for Linux. You also can't run it from the
command line so you need to do things manually, which can
be a problem if you are planning on looking at many sets of genes.
You will also need to do some parsing of the output in order to create a
BED file format.

3) For searching for matches to known binding site profiles, rVista is a
good program to use:

http://rvista.dcode.org/

I hope that this helps you. Please let us know if you have further
questions.

Rachel

Rachel Harte
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu
> _______________________________________________
> Genome maillist - Gen...@soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>

John Cumbers

unread,
May 17, 2007, 4:57:23 PM5/17/07
to Rachel Harte, gen...@soe.ucsc.edu
hi Rachel,

Many thanks for the predictors, I will investigate the options.
Here are the two snippets of the files that I'm importing. I think
that I do want to do an intersect with each of them against the gene
track. This is what I was trying to do before, but it was not
working. On pasting now, I see that the two files are slightly
different and I'm not sure why this is (created on different analysis
programs is the reason, but not sure if this affects the results)
Best,
John


chr4 522051 522720 CWO2_1 52.27
chr4 528169 528828 CWO2_2 52.83
chr4 529288 530007 CWO2_3 54.48
chr4 577460 578115 CWO2_4 52.68
chr2L 108229 109517 CWO2_5 97.94
chr2L 124682 126386 CWO2_6 105.53
chr2L 127841 129218 CWO2_7 111.71
chr2L 131814 133016 CWO2_8 73.76
chr2L 140553 142303 CWO2_9 80.55
chr2L 165150 166328 CWO2_10 77.85
chr2L 244984 245608 CWO2_11 50.66
chr2L 246004 248078 CWO2_12 300.97
chr2L 248079 248744 CWO2_13 57.47
chr2L 249032 249692 CWO2_14 54.11
chr2L 272833 273705 CWO2_15 55.62
chr2L 274055 274679 CWO2_16 50.83


chr2L 108251 110286 target 999 +
chr2L 119671 121087 target 999 +
chr2L 123464 125698 target 999 +
chr2L 125807 126206 target 999 +
chr2L 127630 129148 target 999 +
chr2L 132047 133073 target 999 +
chr2L 136302 136375 target 506 +
chr2L 136666 137077 target 973 +
chr2L 140590 142555 target 999 +
chr2L 165301 166365 target 999 +
chr2L 221975 224058 target 999 +

Rachel Harte

unread,
May 18, 2007, 12:05:01 PM5/18/07
to John Cumbers, gen...@soe.ucsc.edu
Hello John,

The gene names must come from another table so you will need to do an
intersection between the regions generated from your Affy tiling array and
a gene track. Here is an example:

1) After loading your custom track, go to the Table Browser and then
select the species and assembly of interest.
2) Select the "Genes and Gene Prediction Tracks" group and a gene track
e.g. FlyBase Genes.
3) Select genome as the region.
4) Press the "create" button for intersection and set up the intersection
to be with your custom track.
5) Select the output to be BED format.

This will give you a list of genes (and their locations) whose exons
intersect with your custom track. In order to find out if any of the Affy
tiling array regions intersect with introns, then you will need to create
a custom track of introns for the flyBase genes - to do this, you would
select custom track as the output. Alternatively, you could create a
custom track of the gene footprints i.e. download the chromosome, name,
txStart and txEnd of the flyBase genes, format it as a BED file and then
load it as a custom track. This will give you a single block for each
transcript so you could determine if your Affy regions intersect with
these gene regions (if you don't care about knowing if they intersect with
the exons or introns). One drawback of the intersection is that you will
only get the names of the genes that intersect with an Affy region and not
the names of the Affy regions with which they intersect. One way to get
around this is to use the Galaxy tool, at Penn State University, which is
built on top of the UCSC Table Browser.

After loading your custom track, you can then go to the Table Browser and
click on the "Galaxy" link (by the output format menu). Then click on
"Galaxy Main" to get to the Galaxy tool. Next, click on the "Get Data"
in the left pane and then select "UCSC Main" table browser. You will then
see that the interface looks like UCSC's Table Browser and your custom track
is also available there. In Galaxy, you can do a join so that you retain
the identifiers from both tables. First, do a Table Browser query (in
Galaxy) to output the contents of the custom track. Then do the same for the
flyBase genes. The query results will appear to the right of the page. Then
click on the "Operate on Genomic intervals" link on the left side and click
on the "Join" (the intervals of two queries side-by-side) link. Select the
two queries that you just did to create a join between them. The eye icon
next to the results will allow you to display the data and you will be able
to see which Affy array regions intersect with which genes.

For intergenic regions, to find which gene the Affy array regions fall
closest to is a little more tricky. Perhaps you will need to intersect
with a custom track of upstream and/or downstream regions of genes.

I hope that this helps you. Please let us know if you have further
questions. Please direct questions about using Galaxy to the Galaxy team
at Penn State University.

Maximilian Haeussler

unread,
May 18, 2007, 12:31:21 PM5/18/07
to John Cumbers, gen...@soe.ucsc.edu
> For intergenic regions, to find which gene the Affy array regions fall
> closest to is a little more tricky. Perhaps you will need to intersect
> with a custom track of upstream and/or downstream regions of genes.

Having stumbled over your message, I'm not affiliated with UCSC, but
I'm searching for closest-genes from time to time, and I don't find it
that difficult: concat your regions and gene models into one file,
then do a bedSort <infile> <outfile> and parse the result with a short
script that is looking for the closest ATG / flanking ones / only
downstream / etc. I can send you the script if you like or you can get
it from our cvs.

cheers,
Max
Reply all
Reply to author
Forward
0 new messages