Annotation of Intact LINE elements in UCSC

Kostas Tsirigos

unread,

Sep 26, 2017, 11:32:45 AM9/26/17

to gen...@soe.ucsc.edu

Dear UCSC,

is there a particular annotation of intact LINE elements in the browser? Because we know that there should be ~6,000 bp, but this approximation is not very helpful if one wants to set a cut-off to fish them out in a text file.

E.g.

https://genome-euro.ucsc.edu/cgi-bin/hgc?hgsid=224821811_zbCISrEIYD190jxeYbVQhhr5rbDb&c=chr15&l=26277024&r=26287423&o=26280242&t=26285911&g=rmsk&i=L1Md_T

in mouse, is an intact L1 element, but still it is not 6kb.

Is it maybe so that UCSC has annotated them somehow differently?

Thank you,

Kostas

Brian Lee

unread,

Sep 28, 2017, 7:29:10 PM9/28/17

to Kostas Tsirigos, gen...@soe.ucsc.edu

Dear Kostas,

Thank you for using the UCSC Genome Browser and your question about intact LINE elements in the mm10 assembly and your example link.

If you look closely at this example spot you fill find it is actually a LINE element that did not perfectly align in this region. Click this below session link that will have a second "Detailed Visualization of RepeatMasker Annotations" track displayed and three highlights, one for a light blue and a second for a darker blue, and a slight yellow highlight to emphasize where this LINE element is joined:

http://genome-euro.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=brianlee&hgS_otherUserSessionName=mm10_MLQ20219

Click into the top "L1Md_T#LINE/L1" element to arrive at at details page for the rmskJoinedBaseline item that shares this element is fragmented, showing it is broken into two regions and the alignments displayed below. Scroll down to find a graphic on the description section explaining the various graphical items in the "Detailed Visualization of RepeatMasker Annotations" track. Click the "View table schema" to learn about the rmskJoinedBaseline table.

If you click into the below "Repeating Elements by RepeatMasker" rmsk track in the two different blue regions, you will see that the two combine to have 5669 and 996 sizes for about 6.6Kb in size, but are two different annotations. Click the "View table schema" to learn about the rmsk table.

You could do MySQL queries to extract the coordinates of regions that are above a certain size. With MySQL installed on your computer (http://genome.ucsc.edu/goldenPath/help/mysql.html) you could use the command like the following to get 100 examples where the rmsk table has entries greater than 6,000 bp (remove limit 100 to get all):

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -NAe 'select repClass, repName, genoName, genoStart, genoEnd, (genoEnd-genoStart) as diff from rmsk where (genoEnd-genoStart) >= 6000 and repClass like "%LINE%" limit 100;' mm10

This would screen out such above items that are split into two sections and give results such as the following (where the final number is the span):

LINE L1_Mus4 chr1 23500015 23506071 6056

If you felt you were interested in also selecting for the first non-intact items that might span large regions, another option would be to look at the rmskJoinedBaseline table with a query like the following:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -NAe 'select name, chrom, alignStart, alignEnd, (alignEnd-alignStart) as diff from rmskJoinedBaseline where (alignEnd-alignStart) >= 6000 and name like "%LINE%" limit 100;' mm10

This would capture the item in the top session, but would also capture many other LINE items that span large regions (where the final number is the span, I have a feeling this would not suit your needs):.

L1Md_T#LINE/L1 | chr15 | 26280242 | 26286907 | 6665

..
L1Md_F2#LINE/L1 | chr1 | 5920137 | 5939499 | 19362

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee

> --
>
> ---
> You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
> To post to this group, send email to gen...@soe.ucsc.edu.
> Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
> To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAFNbu4af%3DWND4b1ugD2qgQhsDvEbFKZT8MYSo-qQB%3Dbsh4rv7Q%40mail.gmail.com.
> For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Kostas Tsirigos

unread,

Nov 29, 2017, 1:38:50 PM11/29/17

to gen...@soe.ucsc.edu

Hello again,

many thanks for your response... I am now re-focusing on this and I have 2 questions :

a) when downloading ALL the LINEs from the RepeatMasker (using the Table browser), I end up with ~1M L1s. In the literature I have read that the L1 family comprises ~500,000 copies. When taking a closer look at the data I downloaded I saw that I have L1s with a size of even 11bp (which is really small I reckon).

My question here is then, how/why are these sequences assigned as being a LINE?

E.g. here is an example:

http://genome-euro.ucsc.edu/cgi-bin/hgc?hgsid=225869563_Nzm1N3Zs8iQ1rf7jhwfOeTnLGQ20&c=chr1&l=149159897&r=149159969&o=149159910&t=149159957&g=rmsk&i=L1MB3

Is there some cutoff or something? Or some other helpful annotation field that I can use/rely on in order to keep more "confident" results?

b) Related to my previous post, is the 6kb an arbitrary size? I was mainly interested - if possible somehow - to get the intact L1s, not the ones that are at least of 6kb in length. Because there can easily be an intact L1 that is e.g. 5850 bg long and still has both ORFs intact.