information about UCSC data download

606 views
Skip to first unread message

Paola Orsini

unread,
Jun 1, 2016, 11:26:13 AM6/1/16
to gen...@soe.ucsc.edu
Hi,
I am Paola Orsini, a biologist from the university of Bari (Italy). I write to you to have some additional information about UCSC data download. In details, I would like to download ALU data, their position and sequences,and divide ALU sequences into different families (Alu S, ALU Y and so on) and I would like to kindly ask you how I can do it. Should I use RepeatMasker program and genome-mysql.cse.ucsc.edu link? Is there a tutorial or manual I can use?
Sorry, I'm new in this kind of analysis.
Thank you for your kind attention.
Best regards,

Paola Orsini

Matthew Speir

unread,
Jun 2, 2016, 12:11:45 PM6/2/16
to Paola Orsini, gen...@soe.ucsc.edu
Hi Paola,

Thank you for your questions about using the UCSC Genome Browser to find Alu repeats.

If you are new to using the UCSC Genome Browser, I would highly recommend that you take advantage of the training material that we provide: http://genome-euro.ucsc.edu/training/index.html. In particular, I would start with the OpenHelix videos: http://www.openhelix.com/ucsc.

For those genomes and assemblies that we host, you can obtain Alu repeat positions using the Table Browser, http://genome-euro.ucsc.edu/cgi-bin/hgTables. You can see all of the organisms that we host in the species tree on our "Gateway" page: http://genome-euro.ucsc.edu/cgi-bin/hgGateway. Using the Table Browser, you can filter the "RepeatMasker" table and extract only the Alu repeats. To get both positions and sequence for these repeats will require two different Table Browser queries. You can obtain the Alu repeat positions using the following steps:

1. Navigate to the Table Browser, http://genome-euro.ucsc.edu/cgi-bin/hgTables.
2. Select your genome and assembly. In this example, I will be using the hg38 assembly of the human genome:
    clade: Mammal
    genome: Human
    assembly: Dec. 2013 (GRCh38/hg38)

3. Make the following table selections:
    group: Genes and Gene Predictions Tracks
    track: RepeatMasker
    table: rmsk
    output: BED - browser extensible data
    output file: enter a file name to save your results to a file, or leave blank to display results in your browser

4. Next to "filter", click "create".
5. Enter "Alu" in the "repFamily" fields of the "Filter on Fields from hg38.rmsk" section.
        The "repFamily" line should read: repFamily does match Alu

6. Click "Submit".
7. Click "get output".

You should be able to sort the Alu repeats into the different families (AluS, AluY, etc.) using the name in the fourth column.

You can use the same steps described above to obtain the sequence, the only difference will be that in step 3, you will need to select "sequence" as your output type, instead of BED.

If, however, you want repeat positions and sequence for a genome that we don't host, you will need to obtain the RepeatMasker software, http://www.repeatmasker.org/, and run it for the genome you're interested in. Questions about using the RepeatMasker utility should be directed to the RepeatMasker group here: http://www.repeatmasker.org/cgi-bin/form2mail?template=feedback.tmpl&title=Feedback%20Form.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Matthew Speir

unread,
Jun 2, 2016, 12:15:18 PM6/2/16
to Paola Orsini, gen...@soe.ucsc.edu
Hello, again, Paola,

One correction to the steps I provided in my previous email:

In step 3, the group should be "Repeats", not "Genes and Gene Predictions".

I apologize for any confusion.


Matthew Speir
UCSC Genome Bioinformatics Group

Matthew Speir

unread,
Jun 10, 2016, 5:24:44 PM6/10/16
to Paola Orsini, gen...@soe.ucsc.edu
Hi Paola,

It sounds like the download from the Table Browser is timing out before it can complete. This is not terribly surprising considering that the output of your Alu query results in over 1.2 million results, at least for the human assembly hg38. You can see how many results to expect in your file by clicking the "summary/statistics" button on the Table Browser after making all of your selections.

Are you using our official European mirror, http://genome-euro.ucsc.edu/, for this query? If you are and this query is still timing out, then this query may just be too large for the Table Browser to handle.

You can use our command-line tools to obtain the sequence from an assembly for a list of regions in a BED file. First, you will need to download the "twoBitToFa" utility from http://hgdownload.soe.ucsc.edu/admin/exe/. Then, you can use this utility on the command line like so:

twoBitToFa -bed=myAlu.bed http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit myAluSequences.fa

to obtain sequences for all of the Alu elements in the BED file,myAlu.bed. This BED file, myAlu.bed, should be the results of the Table Browser query I described in my original email. Additionally, you can specify any 2bit file on our download server, http://hgdownload.soe.ucsc.edu/downloads.html, instead of the hg38 example I've used here. You can run the twoBitToFa utility without any arguments to see the usage statement.


I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group


On 6/8/16 4:50 AM, Paola Orsini wrote:
Dear doctor,
thank you very much for your reply and help, I have download the bed file of Alu sequences. I would like to kindly ask you some additional information about Alu data .I would also like to downoald the fasta sequecnces of all Alu elements. To do this, in the output format I have selected "sequence", and selected "send the output to Galaxy" .  I have tried to do this operation some times, and I have observed that the numeber of sequences of the fasta files in the outputs is different: the first time the fasta file contains 1,073,385 sequences, the second time 969,845 sequences, the third time 847,461 sequences. I don't know the difference among these files, and I would like to kindly ask you if it's correct to download the fasta sequences of Alu maintaining the same options you indicated in your email and selecting "sequence" in the output format.
Thank you again for your time and kind attention.
Best regards,

Paola Orsini


2016-06-08 13:42 GMT+02:00 Paola Orsini <paolao...@gmail.com>:
Dear doctor,
thank you very much for your reply and help, I have download the bed file of Alu sequences. I would like

Mail priva di virus. www.avast.com
Reply all
Reply to author
Forward
0 new messages