error about axtChain

144 views
Skip to first unread message

bettycatherine

unread,
Mar 23, 2018, 11:36:15 AM3/23/18
to genome
Dear UCSC staff,
I am working with the axtChain program and I encountered a problem that I looked up everywhere and cannot find a solution.
My command was as follow:
axtChain -linearGap=medium procap.chohof.axt procap chohof procap.chohof.axt.chain
The error message was as follow:
Symbol count 37 != 0 inconsistent between sequences line -1371471456 and prev line of procap.chohof.axt
I suppose that the axt file was too big to do the analyses, the axt file is about 1TB because the two genomes was too fragmented (procap with 277110 scaffolds and chohof with 454510 scaffolds), but I am not sure and wondering if there are any solution for this error.
The nib files were not the problem because other alignments using these two species with other species respectively worked well.
Looking forward for your respond.
 
Best wishes!
Yous,
Xue Lv
2018-03-23

bettycatherine

Jairo Navarro Gonzalez

unread,
Mar 27, 2018, 3:43:49 PM3/27/18
to bettycatherine, genome

Hello Xue,

Thank you for using the UCSC Genome Browser and your inquiry.

One of our engineers shares that the error you are experiencing looks like an overflow condition in a counter, most likely due to the immense input file. We suspect you may have all the lastz output in a single 1 Tb file instead of packaging such an alignment into a parts list. Although there are many scaffolds between these two assemblies, we have created alignments with similar sequences. For an example of such an alignment, look at the tarSyr2 vs. tupBel1 alignment.

You can download the scripts used to generate this alignment using the following link: http://hgwdev.cse.ucsc.edu/~jairo/MLQ/21158/tarSyr2VsTupBel1.tar.gz

-rwxrwxr-x 1 1826 Mar 27  2015 run.blastz/doPartition.bash
-rwxrwxr-x 1  606 Mar 27  2015 run.blastz/doClusterRun.csh
-rw-rw-r-- 1  150 Mar 27  2015 run.blastz/gsub
-rwxrwxr-x 1   72 Mar 27  2015 run.cat/cat.csh
-rwxrwxr-x 1  813 Mar 27  2015 run.cat/doCatRun.csh
-rw-rw-r-- 1   81 Mar 27  2015 run.cat/gsub
-rwxrwxr-x 1  341 Mar 27  2015 axtChain/run/chain.csh
-rwxrwxr-x 1  711 Mar 27  2015 axtChain/run/doChainRun.csh
-rw-rw-r-- 1   73 Mar 27  2015 axtChain/run/gsub
-rwxrwxr-x 1 1681 Mar 27  2015 axtChain/netChains.csh
-rwxrwxr-x 1 1130 Mar 27  2015 axtChain/loadUp.csh

Note the DEF file for the lastz run:

# Tarsier vs Tree shrew
BLASTZ=/cluster/bin/penn/lastz-distrib-1.04.00/bin/lastz
BLASTZ_M=50

# TARGET: Tarsier tarSyr2
SEQ1_DIR=/hive/data/genomes/tarSyr2/tarSyr2.2bit
SEQ1_LEN=/hive/data/genomes/tarSyr2/chrom.sizes
SEQ1_CHUNK=20000000
SEQ1_LIMIT=1500
SEQ1_LAP=10000

# QUERY: Tree shrew tupBel1
SEQ2_DIR=/hive/data/genomes/tupBel1/tupBel1.2bit
SEQ2_LEN=/hive/data/genomes/tupBel1/chrom.sizes
SEQ2_CHUNK=10000000
SEQ2_LIMIT=2000
SEQ2_LAP=0

BASE=/hive/data/genomes/tarSyr2/bed/lastzTupBel1.2015-03-27
TMPDIR=/dev/shm

What this means is, package up to 1,500 sequences into one chunk for tarSyr2 as long as the sum total sequence is less than 20,000,000 bases. And up to 2,000 sequences into one chunk for tupBel1 as long as the sum total sequence is less than 10,000,000.

All of the partitioning is built into the scripts we use here. These scripts created 367 individual files for tarSyr2, and 393 files for tupBel1 for a total number of 144,231 cluster jobs:

144,231 = 367 * 393

The chaining worked on the 367 resulting alignments to the target, tarSyr2, and none of the files were larger than 120 Mb in size. We also don't use the raw axt files from lastz for chaining, instead, we turn the lastz results into PSL files and run axtChain on the psl files. This probably makes a big difference in the files sizes when comparing the axt and psl files.

There could also be a problem of masking, not enough in each genome. If the repeats are not masked, you will produce too much original lastz alignment output, and you should mask with both assemblies with repeat masker and window masker to eliminate the extra repeats.

You may also find the following previously answered questions helpful:

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly-accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro 
UCSC Genomics Institute

Want to share the Browser with colleagues?
Host a workshop: http://bit.ly/ucscTraining


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/307d094a.2c679f.16250cbb1ae.Coremail.bettycatherine%40126.com.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Reply all
Reply to author
Forward
0 new messages