Good morning
My name is Marco Matejcic and I’m a postdoc fellow at IARC, in France.
I’ve got a question in relation to the Repeat Masker database.
When I download RepeatMask as BED file from UCSC website (see screenshot attached), I get a file with only four columns:
chr1 |
16777160 |
16777470 |
AluSp |
2147 |
+ |
chr1 |
25165800 |
25166089 |
AluY |
2626 |
- |
chr1 |
33553606 |
33554646 |
L2b |
626 |
+ |
chr1 |
50330063 |
50332153 |
L1PA10 |
12545 |
+ |
chr1 |
58720067 |
58720973 |
L1PA2 |
8050 |
- |
chr1 |
75496180 |
75498100 |
L1MB7 |
10586 |
+ |
However, when I click on the “Describe table schema” box in the attached screenshot, I can see a more complete description for each line:
bin |
swScore |
milliDiv |
milliDel |
milliIns |
genoName |
genoStart |
genoEnd |
genoLeft |
strand |
repName |
repClass |
repFamily |
repStart |
repEnd |
repLeft |
id |
585 |
1504 |
13 |
4 |
13 |
chr1 |
10000 |
10468 |
-249240153 |
+ |
(CCCTAA)n |
Simple_repeat |
Simple_repeat |
1 |
463 |
0 |
1 |
585 |
3612 |
114 |
270 |
13 |
chr1 |
10468 |
11447 |
-249239174 |
- |
TAR1 |
Satellite |
telo |
-399 |
1712 |
483 |
2 |
585 |
437 |
235 |
186 |
35 |
chr1 |
11503 |
11675 |
-249238946 |
- |
L1MC |
LINE |
L1 |
-2236 |
5646 |
5449 |
3 |
585 |
239 |
294 |
19 |
10 |
chr1 |
11677 |
11780 |
-249238841 |
- |
MER5B |
DNA |
hAT-Charlie |
-74 |
104 |
1 |
4 |
How can I download a file including all the columns listed above?
Thanks for your help
Marco
Hello Marco,
Thank you for your question about retrieving the full contents of the RepeatMasker table. In your screenshot, it shows that you have selected the output format "BED - Browser Extensible Data". The BED output for this track from the UCSC Table Browser is a simplified representation of the data, and is not intended to contain all of the information listed by the "describe table schema" button. There are a couple of ways for you to obtain all of the data listed by "describe table schema".
The first method is to use the UCSC Table Browser as in your screenshot, but change the output format to "all fields from selected table". The output of that search will contain all of the fields listed by "describe table schema". It is important to note, though, that it will contain about 450 MB of uncompressed data.
The second option is to download the file http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz, and the accompanying http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.sql. The first file is a compressed version of the data that you would get from the UCSC Table Browser - it is around 137 MB instead of 450 MB. You will need a decompression program to extract the data from this file. The second file (rmsk.sql) contains a short description of the fields that appear in the data, very similar to the description that you see when you click the "describe table schema" button.
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.
--
Jonathan Casper
UCSC Genome Bioinformatics Group
--