Repeat sequence downloaded and item count do not match

12 views

Skip to first unread message

Shruti Sinha

unread,

Jun 26, 2015, 10:42:05 AM6/26/15

to gen...@soe.ucsc.edu

Dear Sir/Madam,

I downloaded the repeat sequences from UCSC using the table browser. However the number of repeat sequences is 2213521 where as the item count in the summary statistics for downloading the fasta sequence for the repeats is 5520017. I would be grateful if you could clarify the discrepancy in the numbers.

Kind Regards,

Shruti

Matthew Speir

unread,

Jun 30, 2015, 2:38:34 PM6/30/15

to Shruti Sinha, gen...@soe.ucsc.edu

Hi Shruti,

Thank you for your question about getting the sequences of items in a
RepeatMasker track. The issue is likely that the download of these
repeat sequences is timing out before it can complete, leaving you with
an incomplete file. The RepeatMasker track is quite large at over 5
million items, so it is inefficient to try downloading the sequences for
these repeat items using the Table Browser.

I recommend getting a BED file of the positions of items in the
RepeatMasker track and then using our command line tool "twoBitToFa" to
get the sequence for these items. First, get a BED file of the repeat
positions from the RepeatMasker track, you can do this using the
following steps (Note, I've used hg19 in this example, but you can
substitute in your assembly of interest):

1. Navigate to the Table Browser, http://genome.ucsc.edu/cgi-bin/hgTables.
2. Make the following selections:
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Repeats
track: RepeatMasker
table: rmsk
region: genome
output: BED - browser extensible data
output file: myRepeatPositions.bed

4. Click "get output".
5. Under "Create one BED record per", check "Whole Gene".
6. Click "get BED".

Next, download the 2bit file for your assembly under the section for
your assembly of interest:
http://hgdownload.soe.ucsc.edu/downloads.html. You can find 2bit files
under the "Full data set" link for a particular assembly. Then, download
the "twoBitToFa" file for your system here:
http://hgdownload.soe.ucsc.edu/admin/exe/. Lastly, you can run a command
like (again, using hg19 as an example):

twoBitToFa -bed=myRepeatPositions.bed hg19.2bit myRepeatSeqs.fa

This will output the sequences for all of the items in your
"myRepeatPositions.bed" file.

I hope this is helpful. If you have any further questions, please reply
to gen...@soe.ucsc.edu. All messages sent to that address are archived
on a publicly-accessible Google Groups forum. If your question includes
sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group