--
--- You received this message because you are subscribed to the Google Groups "UCSC Genome Browser discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
Dear Cath,
Thank you for your email.
I somehow made a wrong assumption that my input is in gff3 format which is not. Below are a couple of lines of the input format:
chr1 3000001 3001863 1863 0.903 1682 0 0 +
chr1 3005899 3007758 1860 0.904 1682 0 0
chr1 3000004 3005485 5482 0.903 4934 29 0 +
chr1 3004428 3009896 5469 0.905 4934 28 0
It is possible to prepare two bed files by splitting the input column wise. However, this will not help to retain the input format of corresponding coordinates pairs.Would you let me know how to lift the pair of coordinates to a different genome build?
Br
Mehar--
Dear Lee,
The data corresponds to segmental duplication regions generated
at Broad institute where fields 1-6 correspond to:
chromosome start_coordinate stop_coordinate length percent_identical_bases number_identical_basesWe have no information about the zero's and then again the same fields with the coordinates for the identical segment. Let me know if i can provide more information.
Dear Mehar,
Thank you for using the UCSC Genome Browser and your question about lifting over data.
A solution to your question about how to lift a pair of coordinates will involve scripting approaches, which is generally beyond the scope of our mailing list support. Please understand that we can not help you through all the details of scripting, you will need to tackle these on your own. You may wish to contact the source of your data to enlist help with your goals. Our engineers share that rather than lifting segmental duplications, which is not a very guaranteed outcome, they should rather be recomputed.
Here are some outlined ideas that could provide a scripting solution, but again, we can not elaborate further on these propositions. If the first three columns are coordinates like "chr start end" (BED3) the file can be lifted despite whatever the other columns mean as long, as they are not gene definitions with exons and so forth. With your data, an approach might be to include BED6 where the six columns are "chr start end name score strand". Unfortunately your data does not have strand in the sixth column and would need it put into place.
Using the bed6+ approach, here are some suggestions our engineers share:
1. For the 1st of each segment pair, add in "+" automatically as column 6 (strand).
2. LiftOver the lines
3. Swap columns around reversing the halves. This time use the original strand as column 6.
4. LiftOver the lines
5. Now we have 2 output strand columns.
To finally re-create the original format (lifted), you need to write a script that
collapses them back into a single strand column as follows:
strandFromStep2 strandFromStep4 finalStrand
-------------------------------------------------
+ + +
+ - -
- + -
- - +
liftOver bed3 canFam2ToCanFam3.over.chain.gz newBed3 unMapped
This would give you mapping results from canFam2 to canFam3 with results like, "chr1 101 1963".
If you had bed6+ input like "chr1 3000001 3001863 1863 0.903 + More Fields That Could Be Ignored", you could run a command like:
liftOver -bedPlus=6 bed6+ canFam2ToCanFam3.over.chain.gz newBed6+ unMapped2
This would give you mapping results like "chr1 101 1963 1863 0.903 + More Fields That Could Be Ignored" allowing you to use a command like awk to rearrange the columns as suggested in step 3 above to proceed with a second lift.
Likely the best solution is to reach out to the source and seek assistance on obtaining recomputed canFam3 data.
Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
All the best,
Brian Lee
UCSC Genomics Institute