Repeat Masker BED file

2,137 views
Skip to first unread message

Marco Matejcic

unread,
Feb 17, 2015, 1:02:05 PM2/17/15
to gen...@soe.ucsc.edu

Good morning

 

My name is Marco Matejcic and I’m a postdoc fellow at IARC, in France.

 

I’ve got a question in relation to the Repeat Masker database.

 

When I download RepeatMask as BED file from UCSC website (see screenshot attached), I get a file with only four columns:

 

chr1

16777160

16777470

AluSp

2147

+

chr1

25165800

25166089

AluY

2626

-

chr1

33553606

33554646

L2b

626

+

chr1

50330063

50332153

L1PA10

12545

+

chr1

58720067

58720973

L1PA2

8050

-

chr1

75496180

75498100

L1MB7

10586

+

 

However, when I click on the “Describe table schema” box in the attached screenshot, I can see a more complete description for each line:

 

bin

swScore

milliDiv

milliDel

milliIns

genoName

genoStart

genoEnd

genoLeft

strand

repName

repClass

repFamily

repStart

repEnd

repLeft

id

585

1504

13

4

13

chr1

10000

10468

-249240153

+

(CCCTAA)n

Simple_repeat

Simple_repeat

1

463

0

1

585

3612

114

270

13

chr1

10468

11447

-249239174

-

TAR1

Satellite

telo

-399

1712

483

2

585

437

235

186

35

chr1

11503

11675

-249238946

-

L1MC

LINE

L1

-2236

5646

5449

3

585

239

294

19

10

chr1

11677

11780

-249238841

-

MER5B

DNA

hAT-Charlie

-74

104

1

4

 

How can I download a file including all the columns listed above?

 

Thanks for your help

 

Marco

ucsc_screenshot.png

Jonathan Casper

unread,
Feb 17, 2015, 1:39:40 PM2/17/15
to Marco Matejcic, gen...@soe.ucsc.edu

Hello Marco,

Thank you for your question about retrieving the full contents of the RepeatMasker table. In your screenshot, it shows that you have selected the output format "BED - Browser Extensible Data". The BED output for this track from the UCSC Table Browser is a simplified representation of the data, and is not intended to contain all of the information listed by the "describe table schema" button. There are a couple of ways for you to obtain all of the data listed by "describe table schema".

The first method is to use the UCSC Table Browser as in your screenshot, but change the output format to "all fields from selected table". The output of that search will contain all of the fields listed by "describe table schema". It is important to note, though, that it will contain about 450 MB of uncompressed data.

The second option is to download the file http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz, and the accompanying http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.sql. The first file is a compressed version of the data that you would get from the UCSC Table Browser - it is around 137 MB instead of 450 MB. You will need a decompression program to extract the data from this file. The second file (rmsk.sql) contains a short description of the fields that appear in the data, very similar to the description that you see when you click the "describe table schema" button.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group


--


Reply all
Reply to author
Forward
0 new messages