Good Evening Emilio:
There are a couple of issues to resolve to get your data to function:
1. sequence names can be cleaned up
2. correct order of arguments on the axtChain command
Due to the way different programs in the pipeline read the fasta files and
how they decide what the sequence names are, you have a difference of the names
between your sequence fasta files and the names in your .axt and .maf alignment results.
Your .axt and .maf files have clean names of the pattern:
Lsat_1_v4_lg_1
Lsat_1_v4_lg_7
Lsat_1_v4_lg_6
Lsat_1_v4_lg_9
Lsat_1_v4_lg_3
jamesonii.5_scaffold5
jamesonii.2442_scaffold2442
jamesonii.904_scaffold904
jamesonii.6059_scaffold6059
... etc ...
Your fasta sequence files have names with extra strings on them of the format:
$ grep "^>" Merged_Lsat_1_v4.fa
>Lsat_1_v4_lg_1:1..252823024
>Lsat_1_v4_lg_2:1..269152615
>Lsat_1_v4_lg_3:1..268469215
>Lsat_1_v4_lg_4
>Lsat_1_v4_lg_5
>Lsat_1_v4_lg_6:1..244784414
>Lsat_1_v4_lg_7:1..242922368
>Lsat_1_v4_lg_8
>Lsat_1_v4_lg_9:1..252881930
$ grep "^>" renamed.gjcombsk44.1000.scaffolds.fa | head
>jamesonii.1_scaffold1|size6467
>jamesonii.2_scaffold2|size6025
>jamesonii.3_scaffold3|size5833
>jamesonii.4_scaffold4|size5657
>jamesonii.5_scaffold5|size5486
>jamesonii.6_scaffold6|size5470
>jamesonii.7_scaffold7|size5165
>jamesonii.8_scaffold8|size5143
>jamesonii.9_scaffold9|size4850
>jamesonii.10_scaffold10|size4653
... etc ...
You can clean these sequence files with sed:
sed -e 's/:1..2.*//;' Merged_Lsat_1_v4.fa > Merged_Lsat_1_v4.cleanNames.fa
sed -e 's/|size.*//;' \
renamed.gjcombsk44.1000.scaffolds.fa > cleanNames.gjcombsk44.1000.scaffolds.fa
And then, in your axtChain command, you have reversed the order of your target
and query sequence files, you should now have something like:
axtChain -faQ -faT -linearGap=loose \
lettuce.vs.gjk44.1000.axt Merged_Lsat_1_v4.cleanNames.fa \
cleanNames.gjcombsk44.1000.scaffolds.fa lsat_gj.chain
Your result is only two tiny chains:
##matrix=axtChain 16 91,-114,-31,-123,-114,100,-125,-31,-31,-125,100,-114,-123,-31,-114,91
##gapPenalties=axtChain O=400 E=30
chain 1076 Lsat_1_v4_lg_5 409821257 + 54712450 54712476 jamesonii.13070_scaffold13070 1381 +
1218 1244 1
26
chain 1059 Lsat_1_v4_lg_4 412366245 + 17188707 17188732 jamesonii.17524_scaffold17524 1263 +
1014 1039 2
25
I note your target sequence has no repeat masking, and is %27 gap sequence:
$ faSize Merged_Lsat_1_v4.cleanNames.fa
2703774318 bases (731646813 N's 1972127505 real 1972127505 upper 0 lower)
in 9 sequences in 1 files
Total size: mean 300419368.7 sd 70656810.6 min 242922367 (Lsat_1_v4_lg_7)
max 412366245 (Lsat_1_v4_lg_4) median 268469214
%0.00 masked total, %0.00 masked real
(gap sequence -> 100*731646813 /
2703774318 = % 27.1)
This could dramatically alter the quality of the results you obtain
from the lastz alignment.
Plus, your query sequence is also not repeated masked, and the contigs are
very small:
$ faSize cleanNames.gjcombsk44.1000.scaffolds.fa
49265962 bases (3197 N's 49262765 real 49262765 upper 0 lower)
in 35219 sequences in 1 files
Total size: mean 1398.8 sd 427.0 min 1000 (jamesonii.35098_scaffold35099)
max 6467 (jamesonii.1_scaffold1) median 1261
%0.00 masked total, %0.00 masked real
This is going to be difficult to obtain good results from this alignment.
--Hiram