liftover gff3

227 views
Skip to first unread message

Arumilli, Meharji

unread,
Jul 7, 2016, 1:48:06 PM7/7/16
to Matthew Speir, gen...@soe.ucsc.edu
Dear Matthew,

Could you please offer quick help to liftover gff3 format file from
canFam2 to canFam3? while the input isin the below format:

chr1 3000001 3001863 1863 0.903 1682 0 0 +
chr1 3005899 3007758 1860 0.904 1682 0 0
chr1 3000004 3005485 5482 0.903 4934 29 0 +
chr1 3004428 3009896 5469 0.905 4934 28 0
chr1 3004428 3009896 5469 0.905 4934 28 0 +
chr1 3000004 3005485 5482 0.903 4934 29 0
chr1 3005899 3007758 1860 0.904 1682 0 0 +
chr1 3000001 3001863 1863 0.903 1682 0 0
chr1 3006315 3008429 2115 0.910 1914 21 0 +
chr1 3007787 3009896 2110 0.912 1914 21 0
chr1 3007787 3009896 2110 0.912 1914 21 0 +
chr1 3006315 3008429 2115 0.910 1914 21 0

These represent segmental duplications, which has two pairs of
coordinates in each row. Would you suggest a way to convert these from
canFam2 to canFam3? retaining the each pair of segments.

Br

Mehar

Cath Tyner

unread,
Jul 7, 2016, 7:51:05 PM7/7/16
to Arumilli, Meharji, gen...@soe.ucsc.edu
Hello Mehar,

You might be interested in http://crossmap.sourceforge.net/, which does support gff.

You can also try converting your gff3 file to genePred format. You can use our gff3ToGenePred utility, found in our utilities section:

Choose your system, e.g., for macOSX.x86_64

You can run "gff3ToGenePred" on the comman-line with no other parameters to see the usage statement and the options.

You can then use your genePred file with liftOver, using the -genePred liftOver option.

Please respond to this list if you have further questions!

Thank you again for your inquiry and for using the UCSC Genome Browser. 
​Please send new and follow-up questions to one of our UCSC Genome Browser mailing lists below:

  * Post to the Public Help Forum: E
mail 
gen...@soe.ucsc.edu
​ or search the Public Archives
​  * Post to the Mirror Help Forum: Email
 
genome...@soe.ucsc.edu 
or search the Mirror Archives​
​  * Confidential/private help: Email
 
genom...@soe.ucsc.edu

UCSC Genome Browser Announcements List (email alerts for new data & software):
  * Subscribe: Email genome-annou...@soe.ucsc.edu 
  * Unsubscribe: Email genome-announ...@soe.ucsc.edu

Join us on Social Media! FacebookTwitter, Wordpress BlogYouTube

​Enjoy,​
Cath
. . .
Cath Tyner
UCSC Genome Browser, Software QA & User Support
UC Santa Cruz Genomics Institute




--

--- You received this message because you are subscribed to the Google Groups "UCSC Genome Browser discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.


Arumilli, Meharji

unread,
Jul 8, 2016, 12:31:17 PM7/8/16
to Cath Tyner, gen...@soe.ucsc.edu

Dear Cath,

Thank you for your email.

I somehow made a wrong assumption that my input is in gff3 format which is not. Below are a couple of lines of the input format:


chr1    3000001    3001863    1863    0.903    1682    0    0 +    chr1    3005899    3007758    1860    0.904    1682    0 0
chr1    3000004    3005485    5482    0.903 4934 29 0 +    chr1    3004428    3009896    5469    0.905 4934 28 0

It is possible to prepare two bed files by splitting the input column wise. However, this will not help to retain the input format of corresponding coordinates pairs.Would you let me know how to lift the pair of coordinates to a different genome build?

Br

Mehar

Brian Lee

unread,
Jul 8, 2016, 1:24:13 PM7/8/16
to Arumilli, Meharji, gen...@soe.ucsc.edu
Dear Mehar,

Thank you for using the UCSC Genome Browser and sharing your assumption about the file format for your previous question.

In order to be of help, can you share where you obtained this file information and the meaning of each column?

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genomics Institute 

--


Arumilli, Meharji

unread,
Jul 8, 2016, 7:38:51 PM7/8/16
to Brian Lee, gen...@soe.ucsc.edu

Dear Lee,

The data corresponds to segmental duplication regions generated at Broad institute where fields 1-6 correspond to:

chromosome start_coordinate stop_coordinate length percent_identical_bases number_identical_bases
We have no information about the zero's and then again the same fields with the coordinates for the identical segment. Let me know if i can provide more information.
Br
Mehar

Brian Lee

unread,
Jul 15, 2016, 7:08:42 PM7/15/16
to Arumilli, Meharji, gen...@soe.ucsc.edu

Dear Mehar,

Thank you for using the UCSC Genome Browser and your question about lifting over data.

A solution to your question about how to lift a pair of coordinates will involve scripting approaches, which is generally beyond the scope of our mailing list support. Please understand that we can not help you through all the details of scripting, you will need to tackle these on your own. You may wish to contact the source of your data to enlist help with your goals. Our engineers share that rather than lifting segmental duplications, which is not a very guaranteed outcome, they should rather be recomputed.

Here are some outlined ideas that could provide a scripting solution, but again, we can not elaborate further on these propositions. If the first three columns are coordinates like "chr start end" (BED3) the file can be lifted despite whatever the other columns mean as long, as they are not gene definitions with exons and so forth. With your data, an approach might be to include BED6 where the six columns are "chr start end name score strand". Unfortunately your data does not have strand in the sixth column and would need it put into place.

Using the bed6+ approach, here are some suggestions our engineers share:

1. For the 1st of each segment pair, add in "+" automatically as column 6 (strand).
2. LiftOver the lines
3. Swap columns around reversing the halves. This time use the original strand as column 6.
4. LiftOver the lines
5. Now we have 2 output strand columns.

To finally re-create the original format (lifted), you need to write a script that
collapses them back into a single strand column as follows:

strandFromStep2    strandFromStep4   finalStrand
-------------------------------------------------
         +               +                +
         +               -                -
         -               +                -
         -               -                +


In terms of using the liftOver file in general, here is an example if you had a bed3 file input like the following, "chr1 3000001 3001863", you would acquire the liftOver file canFam2ToCanFam3.over.chain.gz, available here http://hgdownload.soe.ucsc.edu/goldenPath/canFam2/liftOver/, and run a command like:

liftOver bed3 canFam2ToCanFam3.over.chain.gz newBed3 unMapped

This would give you mapping results from canFam2 to canFam3 with results like, "chr1 101 1963".

If you had bed6+ input like "chr1 3000001 3001863 1863 0.903 + More Fields That Could Be Ignored", you could run a command like:

liftOver -bedPlus=6 bed6+ canFam2ToCanFam3.over.chain.gz newBed6+ unMapped2

This would give you mapping results like "chr1 101 1963 1863 0.903 + More Fields That Could Be Ignored" allowing you to use a command like awk to rearrange the columns as suggested in step 3 above to proceed with a second lift.

Likely the best solution is to reach out to the source and seek assistance on obtaining recomputed canFam3 data.

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genomics Institute

Reply all
Reply to author
Forward
0 new messages