axtChain Segmentation Fault Error

Emilio A. Alvarado Ortiz

unread,

Aug 20, 2015, 6:33:55 PM8/20/15

to genome...@soe.ucsc.edu

Hello,

I am currently following the LASTZ alignment pipeline. I first aligned two whole plant genomes using LASTZ and output in MAF. Then I used mafToAxt to continue with the chain and network steps. However, when I get a “Segmentation fault” message with axtChain. The command that I use to execute the program is as follows:

Ø ./axtChain -verbose=2 -linearGap=loose lettuce.vs.gerk44.1K.Sort.axt -faT renamed.gercombsk44.1000.scaffolds.fa -faQ Merged_Lsat_1_v4.fa lsat_hybrida.chain

Ø using loose gap costs (chicken/human)

Ø 4 blocks after duplicate removal

Ø Segmentation fault

Would you be able to assist me with this issue? I would appreciate your help.

Thank you,

-Emilio

Steve Heitner

unread,

Aug 20, 2015, 7:54:08 PM8/20/15

to Emilio A. Alvarado Ortiz, genome...@soe.ucsc.edu

Hello, Emilio.

You could try running axtChain with -verbose=5 to see if it provides more helpful information to troubleshoot the problem. If this does not help, could you provide us with your data files so we can attempt to recreate the problem? If this is the case, you can contact me directly so we can address this off of the list.

Please contact us again at genome...@soe.ucsc.edu if you have any further questions. Questions sent to that address will be archived in a publicly-accessible forum for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

---
Steve Heitner
UCSC Genome Bioinformatics Group

--

Hiram Clawson

unread,

Aug 21, 2015, 3:01:12 AM8/21/15

to Emilio A. Alvarado Ortiz, genome...@soe.ucsc.edu

Good Evening Emilio:

There are a couple of issues to resolve to get your data to function:
1. sequence names can be cleaned up
2. correct order of arguments on the axtChain command

Due to the way different programs in the pipeline read the fasta files and
how they decide what the sequence names are, you have a difference of the names
between your sequence fasta files and the names in your .axt and .maf alignment results.
Your .axt and .maf files have clean names of the pattern:
Lsat_1_v4_lg_1
Lsat_1_v4_lg_7
Lsat_1_v4_lg_6
Lsat_1_v4_lg_9
Lsat_1_v4_lg_3
jamesonii.5_scaffold5
jamesonii.2442_scaffold2442
jamesonii.904_scaffold904
jamesonii.6059_scaffold6059
... etc ...

Your fasta sequence files have names with extra strings on them of the format:
$ grep "^>" Merged_Lsat_1_v4.fa
>Lsat_1_v4_lg_1:1..252823024
>Lsat_1_v4_lg_2:1..269152615
>Lsat_1_v4_lg_3:1..268469215
>Lsat_1_v4_lg_4
>Lsat_1_v4_lg_5
>Lsat_1_v4_lg_6:1..244784414
>Lsat_1_v4_lg_7:1..242922368
>Lsat_1_v4_lg_8
>Lsat_1_v4_lg_9:1..252881930

$ grep "^>" renamed.gjcombsk44.1000.scaffolds.fa | head
>jamesonii.1_scaffold1|size6467
>jamesonii.2_scaffold2|size6025
>jamesonii.3_scaffold3|size5833
>jamesonii.4_scaffold4|size5657
>jamesonii.5_scaffold5|size5486
>jamesonii.6_scaffold6|size5470
>jamesonii.7_scaffold7|size5165
>jamesonii.8_scaffold8|size5143
>jamesonii.9_scaffold9|size4850
>jamesonii.10_scaffold10|size4653
... etc ...

You can clean these sequence files with sed:
sed -e 's/:1..2.*//;' Merged_Lsat_1_v4.fa > Merged_Lsat_1_v4.cleanNames.fa

sed -e 's/|size.*//;' \
renamed.gjcombsk44.1000.scaffolds.fa > cleanNames.gjcombsk44.1000.scaffolds.fa

And then, in your axtChain command, you have reversed the order of your target
and query sequence files, you should now have something like:
axtChain -faQ -faT -linearGap=loose \
lettuce.vs.gjk44.1000.axt Merged_Lsat_1_v4.cleanNames.fa \
cleanNames.gjcombsk44.1000.scaffolds.fa lsat_gj.chain

Your result is only two tiny chains:

##matrix=axtChain 16 91,-114,-31,-123,-114,100,-125,-31,-31,-125,100,-114,-123,-31,-114,91
##gapPenalties=axtChain O=400 E=30
chain 1076 Lsat_1_v4_lg_5 409821257 + 54712450 54712476 jamesonii.13070_scaffold13070 1381 +
1218 1244 1
26

chain 1059 Lsat_1_v4_lg_4 412366245 + 17188707 17188732 jamesonii.17524_scaffold17524 1263 +
1014 1039 2
25

I note your target sequence has no repeat masking, and is %27 gap sequence:

$ faSize Merged_Lsat_1_v4.cleanNames.fa
2703774318 bases (731646813 N's 1972127505 real 1972127505 upper 0 lower)
in 9 sequences in 1 files
Total size: mean 300419368.7 sd 70656810.6 min 242922367 (Lsat_1_v4_lg_7)
max 412366245 (Lsat_1_v4_lg_4) median 268469214
%0.00 masked total, %0.00 masked real

(gap sequence -> 100*731646813 / 2703774318 = % 27.1)

This could dramatically alter the quality of the results you obtain
from the lastz alignment.

Plus, your query sequence is also not repeated masked, and the contigs are
very small:
$ faSize cleanNames.gjcombsk44.1000.scaffolds.fa
49265962 bases (3197 N's 49262765 real 49262765 upper 0 lower)
in 35219 sequences in 1 files
Total size: mean 1398.8 sd 427.0 min 1000 (jamesonii.35098_scaffold35099)
max 6467 (jamesonii.1_scaffold1) median 1261
%0.00 masked total, %0.00 masked real

This is going to be difficult to obtain good results from this alignment.

--Hiram

Reply all

Reply to author

Forward