About axtChain Error

117 views
Skip to first unread message

Zeta Mui

unread,
Aug 10, 2017, 11:33:28 AM8/10/17
to gen...@soe.ucsc.edu
Hi UCSC personnel,

I am trying to liftOver the soybean genome (1190 sequence records) to the common bean genome (478 sequence records) by LASTZ --> chaining with axtChain, but got this error:

query Chr01 block 51435129-51435189 exceeds sequence length 51433939

for 201 cases out of 2882 .psl files. Below is both my LASTZ, mafToPsl and axtChain command:

LASTZ:

lastz Gmax_275_v2.0.softmasked.fa[multiple] Pvulgaris_442_v2.0.softmasked.fa[multiple] --strand=both --notransition --step=20 --nogapped --progress=1 --format=maf > gmaxv2_vs_pvulv2.maf

mafToPsl (This is a loop in a python script after traversing the .maf file collecting non-empty target-query pairs):

# Run command like `mafToPsl Chr01 Chr01 gmaxv2_vs_pvulv2.maf gmaxv2_vs_pvulv2_chr01.psl`
for target_seqname, query_seqname in target_query_pairs: 
	outfile_name = '%s_%s_vs_%s.psl' %(prefix_root, target_seqname, query_seqname)
	subprocess.run(["mafToPsl", query_seqname, target_seqname, "gmaxv2_vs_pvulv2.maf", outfile_name]) 

axtChain (with a bash loop):

for i in ../00_mafToPsl/00_psl_files/*.psl
do axtChain ${i} ../Gmax_275_v2.0.softmasked.2bit ../Pvulgaris_442_v2.0.softmasked.2bit ./${i%%.psl}.chain -linearGap=loose -psl
done

What is wrong? 
--
Best,
Zeta

Cath Tyner

unread,
Aug 10, 2017, 7:51:15 PM8/10/17
to Zeta Mui, UCSC Genome Browser Public Help Forum
Hello Zeta,

Thank you for contacting the UCSC Genome Browser support team. 

It's possible that you've come across a bug. Can you reply and let us know the lastz version you're using? 
We currently use:
lastz-distrib-1.03.66

One of our engineers has shared that he has never run lastz in the mode you have described. While this mode may be a legitimate method, he shared that this is an unusual combination with both [multiple] on the target and the query. He has only used [multiple] on target only with 2bit files. He shared that this use of [multiple] is a tricky operation for lastz to run.

Also, when he uses the [multiple] operation, he allows lastz to output
--format=axt+ format files, not maf. Then axtToPsl to get ready for the chaining.

There is concern that you might be mixing up fasta files for use with lastz, then 2bit files for use with kent commands. It might be good verify that these are equivalent files, perhaps with an faCount on the fasta, and the
'twoBitToFa file.2bit stdout | faCount stdin' on the 2bit file to verify they are identical sequences. lastz knows how to use 2bit files, there is no need to use fasta.

You can find the faCount utility for your OS-specific system here:

Please respond if you have further questions so that we can provide other troubleshooting options. 

Thank you for contacting the UCSC Genome Browser support team. 
​Please send new and follow-up questions to one of our UCSC Genome Browser mailing lists below:

  * Post to the Public Help Forum: E
mail 
gen...@soe.ucsc.edu
​ or search the Public Archives
​  * Post to the Mirror Help Forum: Email
 
genome...@soe.ucsc.edu 
or search the Mirror Archives​
​  * Confidential/private help: Email
 
genom...@soe.ucsc.edu

UCSC Genome Browser Announcements List (email alerts for new data & software):
  * Subscribe: Email genome-announce+subscribe...@soe.ucsc.edu 
  * Unsubscribe: Email genome-announce+unsubscri...@soe.ucsc.edu

Join us on Social Media! FacebookTwitter, Wordpress BlogYouTube

​Enjoy,​
Cath
. . .
Cath Tyner
UCSC Genome Browser, Software QA & User Support
UC Santa Cruz Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAAJ6Tgp2y8G8CB%2BBseQRB1d4eKbbHJuOxryvd2y%3D7nSzaCgvEw%40mail.gmail.com.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Zeta Mui

unread,
Aug 11, 2017, 11:25:33 AM8/11/17
to Cath Tyner, UCSC Genome Browser Public Help Forum
Hi Cath,

Thanks for your prompt response! 

I am using lastz-distrib-1.04.00. 

I have done the faCount as adviced and the following is my command and the results:

## a bash loop doing faCount on the 2 genome .fa files
for i in *.fa
    do faCount ${i} > ${i}.countStats
done

## a bash loop doing the faCount after twoBitToFa
for i in *.2bit
    do twoBitToFa ${i} stdout | faCount stdin > ${i}.countStats
done

## Running diff to check the files
diff Gmax_275_v2.0.softmasked.2bit.countStats Gmax_275_v2.0.softmasked.fa.countStats
diff Pvulgaris_442_v2.0.softmasked.2bit.countStats Pvulgaris_442_v2.0.softmasked.fa.countStats

and these don't have any outputs before the prompt returned.

Thank you for your time for the trouble-shooting! Hopefully this can settle soon! Maybe in the meantime I will try your suggested pipeline using only [multiple] on only the target genome, and use axt+ format and axtToPsl for chaining. (by the way why not directly use axt for axtChain?)

Best,
Zeta


*P.S. Just curious, after reading the lastz online documentation, it does say that [multiple] is only used on the target and that 'there's rarely any reason to use [multiple] action on the query sequence'. I used the [multiple] action on the query sequence because I realized it has more than one sequence records and I didn't split the query genome into 478 sequences. Why would they say there isn't a reason? (One reason they kind of explained was because ''it will negatively affect memory use and run time", however, the job I ran finished pretty quickly using one cpu)

I have also used the maf output format because originally I was thinking I can do the parsing/lifting over by parsing maf alignments, but ended up coming across the standard pipeline for this purpose.

--
Best,
Zeta

Hiram Clawson

unread,
Aug 11, 2017, 1:04:11 PM8/11/17
to Zeta Mui, Cath Tyner, UCSC Genome Browser Public Help Forum
Good Morning Zeta:

Your lastz version is out of date and has this type
of known bug. Please update lastz.

--Hiram
>> * * Post to the Public Help Forum: E*
>> *mail gen...@soe.ucsc.edu <gen...@soe.ucsc.edu>​ or search the Public
>> Archives <https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome>*
>> *​ * Post to the Mirror Help Forum: Email **genome...@soe.ucsc.edu
>> <genome...@soe.ucsc.edu> *
>> *or search the Mirror Archives​
>> <https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome-mirror>*
>> *​ * Confidential/private help: Email **genom...@soe.ucsc.edu
>> <genom...@soe.ucsc.edu>*
>>
>> UCSC Genome Browser Announcements List
>> <https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome-announce> (email
>> alerts for new data & software):
>>
>> * * Subscribe: Email genome-annou...@soe.ucsc.edu
>> <http://genome-announce+subs...@soe.ucsc.edu/> * Unsubscribe:
>> Email genome-announ...@soe.ucsc.edu
>> <http://genome-announce+unsub...@soe.ucsc.edu/>*
>>
>> Join us on Social Media!
>> *Facebook <https://www.facebook.com/ucscGenomeBrowser>, Twitter,
>> <http://www.twitter.com/GenomeBrowser> Wordpress Blog
>> <http://genome.ucsc.edu/blog/>, YouTube
>> <http://www.youtube.com/channel/UCQnUJepyNOw0p8s2otX4RYQ>*
>>
>> ​Enjoy,​
>> Cath
>> . . .
>> Cath Tyner
>> UCSC Genome Browser, Software QA & User Support
>> UC Santa Cruz Genomics Institute <https://genomics.soe.ucsc.edu/>
>> UCSC Genome Browser <http://genome.ucsc.edu/contacts.html>
>>> <https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAAJ6Tgp2y8G8CB%2BBseQRB1d4eKbbHJuOxryvd2y%3D7nSzaCgvEw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .

Hiram Clawson

unread,
Aug 12, 2017, 2:39:27 AM8/12/17
to Zeta Mui, Cath Tyner, UCSC Genome Browser Public Help Forum
Good Evening Zeta:

Pardon me, you are correct, 1.04 is newer that my mention of 1.03.66

Try running with [multiple] only on the query and not on the target.

What lastz does with [multiple] is to take all the sequences
in query.fa[multiple] and concatenate them all together as if they
were one single sequence without any breaks. It remembers where
the individual records begin and end for later breakup of the
results. Then it takes that single sequence and runs the alignment
between it and every individual sequence in the target. It then takes
the resulting output and breaks it up back into the individual pieces
that were in the original query.fa input.

It does this for efficiency. If, for example, query.fa had Q number
of records, and target.fa had T number of records, the number of individual
lastz operations would be: Q * T. With the [multiple] operation, this
turns Q into 1, and thus the number of individual lastz operations
are simply T = 1 * T. You can imagine the bookkeeping necessary to keep
track of everything in query.fa when it all becomes one sequence.

I usually run lastz runs between target.2bit files of no more than about
100 sequences or 20,000,000 bases whichever comes first, and query.2bit[multiple] files
that can have as many as 1,000 sequences or 10,000,000 bases whichever comes
first. Such a job will have a run time of approximately 30 to 60 minutes,
which is well suited for a cluster of 1,000 cores. I am for a count
of target(N) and query(M) pieces such that N * M < 100,000 and thus the entire
cluster run is only a couple of hours, perhaps half a day at most.

--Hiram

On 8/11/17 10:54 PM, Zeta Mui wrote:
> Hi Hiram,
>
> I thought 1.04 is the latest (according to https://www.bx.psu.edu/~rsharris/lastz/) Which version is the latest?
>
> Best,
> Zeta

Zeta Mui

unread,
Aug 14, 2017, 11:38:40 AM8/14/17
to Hiram Clawson, Cath Tyner, UCSC Genome Browser Public Help Forum
Hi Hiram,

I thought 1.04 is the latest (according to https://www.bx.psu.edu/~rsharris/lastz/) Which version is the latest?

Best,
Zeta

On 12 Aug 2017, at 1:04 AM, Hiram Clawson <hi...@soe.ucsc.edu> wrote:

Zeta Mui

unread,
Aug 14, 2017, 4:38:46 PM8/14/17
to Hiram Clawson, Cath Tyner, UCSC Genome Browser Public Help Forum
Dear Hiram/Cath,

According to the LASTZ page, they suggested using [multiple] on target but not on query, and Hiram suggested vice versa. Does that mean as long as I don’t use multiple on both, the axtChain will not give me errors? (just making sure)

I didn’t have problems running LASTZ. The error message is only seen at the axtChain step. Hopefully re-running the LASTZ step with only using [multiple] on query can solve the problem in axtChain. It seems to me that using [multiple] on both target and query is only affecting computational demands, but it still doesn’t explain why it would cause bugs later on in another program.

Best,
Zeta

Jairo Navarro Gonzalez

unread,
Aug 22, 2017, 7:56:30 PM8/22/17
to Zeta Mui, UCSC Genome Browser Public Help Forum

Hello Zeta,

Thank you for using the UCSC Genome Browser and further explaining your issues.

From the axtChain error in the original email:

query Chr01 block 51435129-51435189 exceeds sequence length 51433939

axtChain sees that the size of the query, Chr01, in the input sequence file is 51,433,939 bp, but then the PSL input has a query range with coordinates greater than this size. So either the size is wrong, or LASTZ's math is going wrong somewhere. The command lines you shared appear to have the correct argument order, i.e., mafToPsl wants the query then target, and the other tools should have the target then query. Please make sure all arguments for target and query are in the correct order for all commands.

Is 51,433,939 indeed the correct length for pvulv2 Chr01? The next thing to look at would be the output of LASTZ; in gmaxv2_vs_pvulv2.maf, is there a MAF block in which pvulv2 Chr01 has coordinates greater than 51,433,939?

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly-accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro 
UCSC Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.

Zeta Mui

unread,
Aug 29, 2017, 11:28:54 AM8/29/17
to UCSC Genome Browser Public Help Forum
Dear Jairo,

I have checked the sequence record size of pvulv2 Chr01 and it is indeed 51,433,939 (from the faSize utility) and I have also checked the .maf file, the last MAF block with Chr01 being the query is the following:

a score=5122
s Chr19 39798530 68 + 50746916 ATGATTGCATATGAATTGCTTTGAAAGTGCTATCATGGCACCACCAAGGTGACAGATGAGCTAGTGCA
s Chr01 51433871 68 + 51433939 ATGATTACATGTGAATTGCTTTGACAGTGTTATCATGATACCTCCAAGGTGACAGATGAGCTAGTGCA

which does not exceed 51,433,939.

With Hiram's suggestion of only using [multiple] on the target, I had successfully passed through the axtChain step and is working on netting the chain files, so at least I am not stuck now. Though we can bypass the problem with this solution, it still makes sense to figure out why using [multiple] on both in the LASTZ step would cause such a problem later on in axtChain. Is it because the .psl files produced by using only [multiple] on target sequence will have the aligned blocks of one query sequence record to the target sequence together in one file for axtChain to work on?

Many thanks for the attention given to this matter!

Best,
Zeta
--
Best,
Zeta



--
Best,
Zeta

Hiram Clawson

unread,
Aug 29, 2017, 12:16:39 PM8/29/17
to Zeta Mui, UCSC Genome Browser Public Help Forum
Good Morning Zeta:

You can forward this problem to the lastz mail list to see if
there are any ideas there. We won't be working on this issue.

--Hiram
Reply all
Reply to author
Forward
0 new messages