[Genome] repStart, repEnd, repLeft in chrN_rmsk table

175 views
Skip to first unread message

Casey Bergman

unread,
Jan 30, 2007, 11:30:36 AM1/30/07
to
Hi -

Following an old thread posted at <http://www.cse.ucsc.edu/pipermail/
genome/2003-November/003500.html> I have a query about the repStart,
repEnd, repLeft fields in the UCSC RepeatMasker tables.

My concern is that the parsing of RepeatMasker coordinates on the
repeat query sequence may not be consistent for matches on positive
and negative strands of the genome. As can be seen in the sample rows
at <http://genome.ucsc.edu/cgi-bin/hgTables?
db=hg18&hgta_doSchema=describe+table+schema> and throughout the
genome browser and download files, matches on the negative strand
have a negative repStart value which seems not to be possible,
whereas matches on the positive strand have interpretable integer
coordinates.

On investigating a few matches on both positive and negative strands,
it appears that start and end coordinates of the query repeat are
stored at UCSC as repStart & repEnd for positive strand matches, but
stored as repLeft and repEnd for negative strand matches. This
appears to be related to differences in the format of RepeatMasker
output for positive and negative strand matches (see below, + vs C
rows). If this interpretation of the situation is correct, the
meaning of repStart, repEnd, repLeft fields changes for positive and
negative strand matches. It would be great to get a second opinion
on this, and if this situation might be flagged for review since the
current format is not terribly intuitive and may not be desired.

All the best,
Casey

*************

From <http://www.repeatmasker.org/webrepeatmaskerhelp.html>
Example:

1306 15.6 6.2 0.0 HSU08988 6563 6781 (22462) C MER7A DNA/
MER2_type (0) 336 103
12204 10.0 2.4 1.8 HSU08988 6782 7714 (21529) C TIGGER1 DNA/
MER2_type (0) 2418 1493
279 3.0 0.0 0.0 HSU08988 7719 7751 (21492) + (TTTTA)n
Simple_repeat 1 33 (0)
1765 13.4 6.5 1.8 HSU08988 7752 8022 (21221) C AluSx SINE/
Alu (23) 289 1
12204 10.0 2.4 1.8 HSU08988 8023 8694 (20549) C TIGGER1 DNA/
MER2_type (925) 1493 827
1984 11.1 0.3 0.7 HSU08988 8695 9000 (20243) C AluSg SINE/
Alu (5) 305 1
12204 10.0 2.4 1.8 HSU08988 9001 9695 (19548) C TIGGER1 DNA/
MER2_type (1591) 827 2
711 21.2 1.4 0.0 HSU08988 9696 9816 (19427) C MER7A DNA/
MER2_type (224) 122 2
This is a sequence in which a Tigger1 DNA transposon has integrated
into a MER7 DNA transposon copy. Subsequently two Alus integrated in
the Tigger1 sequence. The simple repeat is derived from the poly A of
the Alu element. The first line is interpreted like this:

1306 = Smith-Waterman score of the match, usually complexity
adjusted
The SW scores are not always directly comparable. Sometimes
the complexity adjustment has been turned off, and a variety of
scoring-matrices are used.
15.6 = % substitutions in matching region compared to the
consensus
6.2 = % of bases opposite a gap in the query sequence (deleted
bp)
0.0 = % of bases opposite a gap in the repeat consensus
(inserted bp)
HSU08988 = name of query sequence
6563 = starting position of match in query sequence
7714 = ending position of match in query sequence
(22462) = no. of bases in query sequence past the ending position
of match
C = match is with the Complement of the consensus sequence
in the database
MER7A = name of the matching interspersed repeat
DNA/MER2_type = the class of the repeat, in this case a DNA
transposon
fossil of the MER2 group (see below for list and
references)
(0) = no. of bases in (complement of) the repeat consensus
sequence
prior to beginning of the match (so 0 means that the
match extended
all the way to the end of the repeat consensus sequence)
2418 = starting position of match in database sequence (using
top-strand numbering)
1465 = ending position of match in database sequence

*************

Casey Bergman, Ph.D.
Faculty of Life Sciences
University of Manchester
Michael Smith Building
Oxford Road, M13 9PT
Manchester, UK

Tel: +44-(0)161-275-1713
Fax: +44-(0)161-275-5082
skype: caseymbergman

Email: casey.bergman at manchester.ac.uk
Web: http://www.bioinf.manchester.ac.uk/bergman/


Kayla Smith

unread,
Feb 6, 2007, 1:36:17 PM2/6/07
to

Casey,

You are correct that the information in the repStart and repLeft fields
of the rmsk table is interpreted differently based on which strand is
involved. It is true the the fields could be better named. The two
fields behave as mirror images of each other. Allow me to describe how
these fields are related.

Of the three coordinates involved (repStart, repEnd, repLeft), repEnd
can be thought of as the "middle" coordinate of the three. It is always
a positive integer regardless of the strand the repeat element hits. It
represents the end coordinate of the matching part of the repeat element
in "repeat element coordinates". The "beginning" coordinate of the
matching part of a repeat element is repStart (for + oriented hits) or
repLeft (for - oriented hits).

For + oriented hits, repStart & repEnd are the proper coordinates of the
matching part of the element. RepLeft in this case is a numerical value
which may be used (via the equation repEnd-repLeft) to obtain the size
of the repeat element ("Left" in the sense of "repeat remaining" unaligned).

For - oriented hits, repLeft & repEnd are the proper coordinates of the
matching part of the element. RepLeft will be upstream from (smaller
than) repEnd for these neg-strand alignments. RepStart in this case is
a numerical value which may be used (via the equation repEnd-repStart)
to obtain the size of the repeat element.

I hope this is helpful to you. Please don't hesitate to contact us
again if you require more assistance.

Kayla Smith
UCSC Genome Bioinformatics Group
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome

Reply all
Reply to author
Forward
0 new messages