ATHLATES producing empty *typing.txt output

530 views
Skip to first unread message

Andrew Oler

unread,
Oct 2, 2013, 1:33:18 PM10/2/13
to viral-to...@googlegroups.com
Hi,

I'm trying to run Athlates to get HLA-DRB types on several exome-seq datasets. 

I downloaded all HLA sequences from IMGT and aligned the reads with novoalign:
novoalign -d hla.nix -F STDFQ -f ${f}.1.fq ${f}.2.fq -t 30 -o SAM -r all -l 80 -e 100 -i PE 225 140

Then I filtered out the DRB and non_DRB sequences:
samtools view -b -L DRB.bed $bam | samtools sort -n - DRB/${bam/.bam/}
samtools view -b -L non_DRB.bed $bam | samtools sort -n - non_DRB/${bam/.bam/}

I downloaded the latest DRB msa from here:
http://hla.alleles.org/data/txt/drb_nuc.txt

And I ran Athlates like this:
typing -bam DRB/$bam -exlbam non_DRB/$bam -msa drb_nuc.txt -o output/DRB/${bam/.bam/} > output/DRB/${bam/.bam/}.log.txt

It appears that the bam files and msa files are being read properly, and the reads are assembled into contigs, but when it tries to determine the types, it always has no output. 

Here is an example output log.txt file:
--------------------------------------------------------
Running cmd:
typing     -bam DRB/HLA_CJF_32.bam -exlbam non_DRB/HLA_CJF_32.bam -msa drb_nuc.txt -hd 2 -o output/DRB/HLA_CJF_32_DRB
--------------------------------------------------------
    Obtain (un)paired-read names in non_DRB/HLA_CJF_32.bam
        #s&p names: 1294, 10929
        create file: output/DRB/HLA_CJF_32_DRB.raw.fa
    Check BAM header of DRB/HLA_CJF_32.bam
    [WARNING]: DRB/HLA_CJF_32.bam sort order: unsorted
        Input BAM file should be sorted by queryname
        101335 ref sequence(s) found; avg len = 636
        0 platform(s) found:
    Merge PE for DRB/HLA_CJF_32.bam
        create file: output/DRB/HLA_CJF_32_DRB.unpair.fa
        create file: output/DRB/HLA_CJF_32_DRB.pair.fa
        Total, merged, unmerged, singleton:
             8942, 340(181 overlap), 1, 21
        target, non_target, ratio:
        951    248    0.260778
    Discard reads containing unique 50-mers
            average freq = 6
        79 reads discarded
    Assemble 363 reads...
        iter: 0
        seed_k = 151    141    131    121    111    101    91    81    71    61    51    41    40   
        iter: 1
        seed_k = 151    141    131    121    111    101    91    81    71    61    51    41    40   
    13 contigs created
        create file: output/DRB/HLA_CJF_32_DRB.contig.fa
        create file: output/DRB/HLA_CJF_32_DRB.contig.detail.txt
    Read in cDNA MSA: drb_nuc.txt
        1454 refs in IMGT/HLA database;  MSA len = 813
        Exon starts:    1    105    381    665    776    800   
    Typing ...
        seed contigs...
        go through each cDNA...
        create file: output/DRB/HLA_CJF_32_DRB.typing.txt
Program finished successfully !


Here is the contig.fa file that was produced:

cat output/DRB/HLA_CJF_32_DRB.contig.fa
>contig 0     len = 291     cov = 5.8866
CATCTCTGACCAGCGACTGATGATGCTATTGTAATCAGATGCTGATTCGTTCTCCAACACTAGATTACCCAATCCACAAGCAAGGAAATCAGTAACTTCTTCCCTATAATTTGGAATGTGGGTGGAGAGGGGTCATAGTTCTCCCTGAGTGAGACTCACCTGCTCCTCTGGCCCCTGGTCCTGTCCTGTTCTCCAGCATGGTGTGTCTGAAGCTCCCTGGAGGCTCCTGCATGGCAGCTCTGACAGTGACACTGATGGTGCTGAGCTCCCCACTGGCTTTGGCTGGGGACA
>contig 1     len = 488     cov = 14.3443
ATGAGCTCCTGGGCTGCAGGTGGTGGGCGTTGCGGGTGGGGCCGGTTAAGGTTCCAGCGCCCGCACCCCGCTCAGGGAGCCCCGGATGGTGGCGTCGCTGTCCGTGTCTTCTCAGGAGGCCGCCCTGTGACCGGATGGTTCGTGTCCCCACAGCACGTTTCCTGTGGCAGCCTAAGAGGGAGTGTCATTTCTTCAATGGGACGGAGCGGGTGCGGTTCCTGGACAGATACTTCTATAACCAGGAGGAGTCCGTGCGCTTCGACAGCGACGTGGGGGAGTTCCGGGCGGTGACGGAGCTGGGGCGGCCTGACGCTGAGTACTGGAACAGCCAGAAGGACATCCTGGAGCAGGCGCGGGCCGCGGTGGACACCTACTGCAGACACAACTACGGGGTTGTGGAGAGCTTCACAGTGCAGCGGCGAGGTGAGCATGGCCGCGGGCGGGGCCTGAGTCCCCGTGAGCTGGGAATCTGAGTGTGTGTGTGTGTG
>contig 2     len = 660     cov = 20.9091
GAATAGATTTGCTTCCATACCTTTTAAGTGTATATCTTTTCCTCTTTTTCCCAGGGCACTCTGGACTTCACCCAACAGGTAATACCTTTTAATCCTCTTTTAGAAACAGATTCAGTTTTCCTAGAATGATGGCAGAGGTGATAAGGCATGAGACAGAAATAGCAGGAAAGACTTTGGATCCAAATTCCTGATCAGGCAATTTATACCAAAACTCCTCCTCTCCACTTAGGCCTGTGCTCTGCAGGAGTATTGGTTCAGGGAGACTTAGGAACTTGTTTTTCTTCTTCCTGCAGTGCTCTCATCTGAGTCCTTGAAAGAGAGGAAAAGAAGCTGTTAGTGGAACCAGGTCTGAAAACAACACTTTCCTCTCTCTCTGCAGGACTCGTGAGCTGAAGTGCAGATGACCACATTCAAGGGGGAACCTTCTGCCCCAGCTTTGCATGATGAAAAGCTTTCCTGCTTGGCTCTTATTCTTCCACAAGAGAGGACTTTCTCAGGCCCTGGTTGCTACCGGTTCAGCAACTCTGCAGAAAATGTCCATCCTTGTGGCTTCCTCAGCTCCTGCCCTTGGCCTGAAGTCCCAGCATTGATGGCAGTGCCTCATCTTCAACTTTAGTGCTCCCCTTTACCTAACCCTACGGCCTCCCATGCATCTGTACT
>contig 3     len = 333     cov = 14.7477
CCCACTATAAAGACTAAATTCCAGGCTGATGACACTGTGAGGCCTCATGGCCAGCTGTGCTGGAGGCCTGGTCAAGGCCAGAGCCTGGGTTTACAGAGAAGCAGACAAACAGCCAAAAAGGGAGATTCACTCTGTCTTCCTGAGTCATTCCCTCTACATTTCCTTTCTCCTAGTTGAGCCTAAGGTGACTGTGTATCCTGCAAGGACCCAGACCCTGCAGCACCACAACCTCCTGGTCTGCTCTGTGAATGGTTTCTATCCAGGCAGCATTGAAGTCAGGTGGTTCCGGAACAGCCAGGAAGAGAAGGCTGGGGTGGTGTCCACAGGCCTGAT
>contig 4     len = 181     cov = 4.31492
GGAGACTTACTCTGTCTTCATGACTCATTCCCACTACCTTTTTTTTCTCCTAGTCCAACCTAAGGTGACTGTGTATCCTTCAAAGACCCAGCCCCTGCAGCACCACAACCTCCTGGTCTGCTCTGTGAATGGTTTCTATCCAGGCAGCATTGAAGTCAGGTGGTTCCGGAACAGCCAGGAA
>contig 5     len = 430     cov = 15.486
TGGCTCAGGGGAGACTCAGGAACTTCTTTTTCTTCTTCCTGCAGTGCTCTCATCTGAGCCCTTGAAAGAGGGGAAAAGAAACTGTTAGTAGAGCCAGGTTGAAAACAACACTCTCCTCTGTCTTTTGCAGGATTCCTGAGCTGAAATGCAGATGACCACATTCAAGGAAGAACTTTCTGCCCCGGCTTTGCAGGATGAAAAGCTTTCCTGCTTGGCAGTTATTCTTCCACAAGAGAGGGCTTTCTCAGGACCTGGTTGCTACTGGTTCGGCAACTGCAGAAAATGTCCTCCCTTGTGGCTTCCTCAGCTCCTGCCCTTGGCCTGAAGTCCCAGCATTGATGGCAGCGCCTCATCTTCAACTTTTGTGCTCCCCTTTGCCTAAACCGTATGGCCTCCCGTGCATCTGTATTCACCCTGTATGACAAACACA
>contig 6     len = 248     cov = 3.93548
GAGCAGCCTTCTTCCCCATTTTCAAAGCTATGAATATTGCAGGGTCTCAATTAAAAAGGTTCAATTTGGGATAAAAATCACTAAACCTGGTTTCCTCTCCCAGGAGCACAGTCTGAATCTGCACAGAGCAAGATGCTGAGTGGAGTCGGGGGCTTTGTGCTGGGCCTGCTCTTCCTTGGGGCCGGGCTATTCATCTACTTCAAGAATCAGAAAGGTGAGGAGCCTTTGGTAGCGGCTCTCTCCATAGA
>contig 7     len = 263     cov = 10.2928
TCCCAGGGAACTCTGGACTTCACCCAACAGGTAATACCTTTTAATCCTCTTTTAGAAACAGATTCAGTTTTCCTAGAATGATGGCAGAGGTGATAAGGCATGAGACAGAAATAGCAGGAAAGACTTTGGATCCAAATTCCTGATCAGGCAATTTATACCAAAACTCCTCCTCTCCACTTAGGCCTGTGCTCTGCAGGAGTATTGGTTCAGGGAGACTTAGGAACTTGTTTTTCTTCTTCCTGCAGTGCTCTCATCTGAGTCCT
>contig 8     len = 283     cov = 8.23675
GACCAGCAACTGATGATACTATTGAACTCAGATGCTGATTGGTTCTCCAACACGAGAGTACCCAAACCAGGAGGAAGGAAATCAGTAACTTCCTCCCCATAATTTGGAATGTGGGTGGAGGGGGATCATAGTTCTCCCTGAGTGAGACTTGCCTGCTCCTCTGGCCCCTGGTCCTGTCCTGTTCTCCAGCATGGTGTGTCTGAAGCTCCCTGGAGGTTCCTACATGGCAAAGCTGACAGTGACACTGATGGTGCTGAGCTCCCCACTGGCTTTGGCTGGGGAC
>contig 9     len = 187     cov = 2
ACCAGGTCAGAAAACAACACTTTCCTCTCTCTCTGCAGGACTCGTGAGCTGAAGTGCAGATGACCACATTCAAGGGGGAACCTTCTGCCCCAGCTTTGnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnCGGTTCAGCAACTCTGC
>contig 10     len = 315     cov = 10.146
CCAGGCTGATGACACTGTAAGGTCACATGGCCAGCTGTGCTATAGGCCTGGTCAAGGCCAGAGCCTGGGTTTGCAGAGAAGCAGACACACAGCCAAACCAGGAGACTTACTCTGTCTTCCTGACTCATTCCCTCTACGTTGTTTTTCTCCTAGTCCAACCTAAGGTGACTGTATATCCTTCAAAGACCCAGCCCCTGCAGCACCACAACCTCCTGGTCTGCTCTGTGAGTGGTTTCTATCCAGGCAGCATTGAAGTCAGGTGGTTCCTGAACGGCCAGGAAGAGAAGGCTGGGATGGTGTCCACAGGCCTGATCC
>contig 11     len = 406     cov = 12.0419
GGGGTTATGGTTGGGATCAGTTAAGGTTCCAGTGCCCGCACCCCGCCCAGGGAGCCCCGGATGGCGGCGTCTCTGTCAGTATCTTCCCCGGAGGCCGCCCCTGTGACCGGATCCTTCGTGTCCCCACAGCACGTTTCTTGCAGCAGGATAAGTATGAGTGTCATTTCTTCAACGGGACGGAGCGGGTGCGGTTCCTGCACAGAGACATCTATAACCAAGAGGAGGACTTGCGCTTCGACAGCGACGTGGGGGAGTACCGGGCGGTGACGGAGCTGGGGCGGCCTGACGCTGAGTACTGGAACAGCCAGAAGGACTTCCTGGAAGACAGGCGCGCCGCGGTGGACACCTACTGCAGACACAACTACGGGGTTGGTGAGAGCTTCACAGTGCAGCGGCGAGGTGAGCA
>contig 12     len = 323     cov = 8.44272
CCTTCATACAGCATCTCTGACCAGCAACTGATGATGCTATTGAACTCAGATGCTGATTGGTTCTCCAACACGAGATTACCCAACCCAGGAGCAAGGAAATCAGTAACTTCCTCCCTATAACTTGGAATGTGGGTGGAGGGGTTCATAGTTCTCCCTGAGTGAGACTTGCCTGCTTCTCTGGCCCCTGGTCCTGTCCTGTTCTCCAGCATGGTGTGTCTGAAGCTCCCTGGAGGCTCCTGCATGACAGCGCTGACAGTGACACTGATGGTGCTGAGCTCCCCACTGGCTTTGTCTGGGGACACCCGACGTAAGTGCACATTGCG

And the typing.txt output file:

cat output/DRB/HLA_CJF_32_DRB.typing.txt
            Name        HD     Aln_len    cDNA_len     Similarity        Avg_cov               Missing_exons (ID, len); mismatches [ID, pos]



 ------------ Candidate Allelic Pairs -------------

            Name        HD     Aln_len    cDNA_len     Similarity        Avg_cov               Missing_exons (ID, len); mismatches [ID, pos]



 ------------ Inferred Allelic Pairs -------------



I also tried a hamming distance up to 10, but no difference in results.  Any suggestions?  Perhaps the coverage for the contigs is too low?  Is there any way to change that setting?  I'm running on Red Hat Linux with Athlates_07082013. 

Thanks,

Andrew


Xiao Yang

unread,
Oct 2, 2013, 1:50:21 PM10/2/13
to viral-to...@googlegroups.com
Hi Andrew,

Thanks for your email. Indeed this is expected in many of the
existing exome-seq data set, which does not provide adequate
SPAN of the target HLA cDNAs. In the example you provided,
the BLAST alignment (see attached) of all contigs against a 801bp reference
DRB1*01:01:01 allele yielded the following alignment. It appears
there exists a large gap in the 525 -- 640 region. In such case, we
do not intend to report any typing results regardless of how
well the remaining regions are covered. We had similar experiences
as presented in our paper. 

Currently, the program doesn't provide any typing results for such
cases but efforts will be made to improve sensitivity although it is
right now unclear how reliable we can predict the final typing results
with a large part of the individual exonic region missing. 

Xiao

Inline image 1


--
You received this message because you are subscribed to the Google Groups "Broad Viral Tool Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to viral-tool-use...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
- Xiao
image.png

Xiao Yang

unread,
Oct 3, 2013, 2:28:35 PM10/3/13
to viral-to...@googlegroups.com
Hi Andrew,

After inspection and correspondence with other users.
I found a major problem with input BAM files, which 
requires I put some instructions in terms of further processing. 

I'm currently working on a fix for this issue. Will release 
a new version shortly when this problem is fixed. So, it is 
likely the samples you had might actually be able to be typed
by ATHLATES.

Thanks,
Xiao 

--
- Xiao
image.png

Smitha Shankar

unread,
Feb 7, 2014, 3:26:14 PM2/7/14
to viral-to...@googlegroups.com
Hello, I am facing the same issues as Andrew. Has this been fixed?

Xiao Yang

unread,
Feb 7, 2014, 3:46:22 PM2/7/14
to viral-to...@googlegroups.com
Hi Smitha,

I have released a new documentation on the website (for which you can check).
Could you check it out ? 

Thanks,
Xiao

Smitha Shankar

unread,
Feb 7, 2014, 3:58:31 PM2/7/14
to viral-to...@googlegroups.com
Hello Xiao, I followed the steps given on the new documentation, but I still get this warning in the log file:
 Input BAM file should be sorted by queryname

As suggested in the document I used this:

sort -k 1,1 -k 3,3 [A.sam] > [A.sort.sam]

samtools -view -bS [A.sort.sam] > [A.sort.bam]

The sorting actually moves the headers to the end of the file and therefore the sam cannot be converted to bam. Therefore after the sorting step i moved the headers back to the top of the file before I converted it to bam. 

Am I doing something wrong? 

Would you like to see my bam file?

Smitha Shankar

unread,
Feb 7, 2014, 4:55:13 PM2/7/14
to viral-to...@googlegroups.com
These steps work fine:

samtools view -b -L A_nucgen.bed [output.sort.bam] > [A.bam]

samtools view -h -o [A.sam] [A.bam]

Where as this step sort -k 1,1 -k 3,3 [A.sam] > [A.sort.sam] creates a sorted sam file with all the headers at the end of the file, due to which I was not able to convert it to the A.sort.bam file as suggested in the next step. 

Therefore I put the headers back on top and then converted it to a bam file which seems to give a empty bam typing.txt file. 

I have used the latest manual created on Oct 3 2013. 

Xiao Yang

unread,
Feb 9, 2014, 11:59:08 AM2/9/14
to viral-to...@googlegroups.com
On Fri, Feb 7, 2014 at 4:55 PM, Smitha Shankar <smitha...@gmail.com> wrote:
These steps work fine:

samtools view -b -L A_nucgen.bed [output.sort.bam] > [A.bam]

samtools view -h -o [A.sam] [A.bam]

Where as this step sort -k 1,1 -k 3,3 [A.sam] > [A.sort.sam] creates a sorted sam file with all the headers at the end of the file, due to which I was not able to convert it to the A.sort.bam file as suggested in the next step. 


You could use LC_ALL=C  sort -k 1,1 -k 3,3 [A.sam] > [A.sort.sam]
to avoid locales been used by sort. 



--
- Xiao

Shing Zhan

unread,
Apr 2, 2014, 4:46:04 PM4/2/14
to viral-to...@googlegroups.com
Dear Xiao,

I tried the following steps to prepare BAM files for ATHLATES as mentioned by Smitha above and with your recommendations:

samtools view -b -L A_nucgen.bed test_novoalign_sort.bam > test_a.bam
samtools view -h test_a.bam > test_a.sam
LC_ALL=C sort -k 1,1 -k 3,3 test_a.sam > test_a.sorted.sam
samtools view -bS test_a.sorted.sam > test_a.sorted.bam

However, I still get the warning:

"[WARNING]: test_a.sorted.bam sort order: unsorted"

I tested the same procedure on the 1000 Genomes exomes for HG01756 and HG01757 which were used in your paper.

And I was able to get the same HLA-A and HLA-B allelic pairs as in your paper.

Is the warning really an issue (well, it seems not)? Can I simply ignore it? I am actually using the first version of ATHLATES you released.

Any input would be much appreciated.

Thank you very much,
Shing

Xiao Yang

unread,
Apr 2, 2014, 4:52:21 PM4/2/14
to viral-to...@googlegroups.com
Hi Shing, 

I responded below. 


On Wed, Apr 2, 2014 at 4:46 PM, Shing Zhan <shing...@gmail.com> wrote:
Dear Xiao,

I tried the following steps to prepare BAM files for ATHLATES as mentioned by Smitha above and with your recommendations:

samtools view -b -L A_nucgen.bed test_novoalign_sort.bam > test_a.bam
samtools view -h test_a.bam > test_a.sam
LC_ALL=C sort -k 1,1 -k 3,3 test_a.sam > test_a.sorted.sam
samtools view -bS test_a.sorted.sam > test_a.sorted.bam

However, I still get the warning:

"[WARNING]: test_a.sorted.bam sort order: unsorted"

I tested the same procedure on the 1000 Genomes exomes for HG01756 and HG01757 which were used in your paper.

And I was able to get the same HLA-A and HLA-B allelic pairs as in your paper.

Is the warning really an issue (well, it seems not)? Can I simply ignore it?

The warning message is a result from inproper setting of the flags after sorted by Samtools. 
And yes you can simply ignore it. 
 
I am actually using the first version of ATHLATES you released.


I recommend you to use the newer version and our documentation also has a major change.
In addition, recently, we've noticed that to prepare initial bam file as ATHLATES input, if 
no mismatch is tolerated (-t 10 in novoalign), we would be able to assembly more complete
exons (hence achieving a better results). These adjustments have already been included
in the newest documentation.  
 
For more options, visit https://groups.google.com/d/optout.



--
- Xiao

Shing Zhan

unread,
Apr 2, 2014, 6:55:24 PM4/2/14
to viral-to...@googlegroups.com
Thank you very much for the quick reply, Xiao!

I still have a few more questions/concerns.

1) Now that you mentioned that using '-t 10' instead o '-t 30' yields better results, I am wondering whether I should redo all the read alignments. So far, I have used this command, based on older documentation and your paper.

novoalign -d ref.nix -f rd1.fq rd2.fq -F STDFQ -o SAM -t 30 -r all -l 60 -e 1 | samtools view -bS -h -F 4 - > output.bam

* Working with 75 bp reads. So, I set '-l 60'.

The problem is that redoing all the alignments is not feasible resource-wise for me right now. So, is there a way that you would recommend around it to improve the results? For example, extract reads already aligned using '-t 30' and then realign using '-t 10'?

2) Sorting by only read name, i.e.

samtools view -b -L A_nucgen.bed test_novoalign_sort.bam > test_a.bam
samtools sort -n test_a.bam test_a.sorted

instead of sorting by read name and then reference name (as in my previous post) resulted in different average coverage numbers ('Avg_cov' in output file), although the nominated alleles are largely identical.

Is this of concern? I don't quite get why the average coverage numbers differ.

Again thank you very much,
Shing

Xiao Yang

unread,
Apr 3, 2014, 1:01:51 PM4/3/14
to viral-to...@googlegroups.com
Hi Shing, 

I responded below. 

On Wed, Apr 2, 2014 at 6:55 PM, Shing Zhan <shing...@gmail.com> wrote:
Thank you very much for the quick reply, Xiao!

I still have a few more questions/concerns.

1) Now that you mentioned that using '-t 10' instead o '-t 30' yields better results, I am wondering whether I should redo all the read alignments. So far, I have used this command, based on older documentation and your paper.

novoalign -d ref.nix -f rd1.fq rd2.fq -F STDFQ -o SAM -t 30 -r all -l 60 -e 1 | samtools view -bS -h -F 4 - > output.bam

* Working with 75 bp reads. So, I set '-l 60'.

The problem is that redoing all the alignments is not feasible resource-wise for me right now. So, is there a way that you would recommend around it to improve the results? For example, extract reads already aligned using '-t 30' and then realign using '-t 10'?


I guess that should work. But it is very important to retain read pair information.
 
2) Sorting by only read name, i.e.

samtools view -b -L A_nucgen.bed test_novoalign_sort.bam > test_a.bam
samtools sort -n test_a.bam test_a.sorted

instead of sorting by read name and then reference name (as in my previous post) resulted in different average coverage numbers ('Avg_cov' in output file), although the nominated alleles are largely identical.

Is this of concern? I don't quite get why the average coverage numbers differ.

Yes, this is a concern. The reason to update the documentation is that I realized my earlier
assumption of intended input format was wrong. So, it is necessary to follow the newer 
documentation. 



--
- Xiao

Shing Zhan

unread,
Apr 3, 2014, 1:36:47 PM4/3/14
to viral-to...@googlegroups.com
Thanks very much again, Xiao.

Instead of redoing the novoalign step, I am considering to retain reads with 'NM:i:0' only, therefore excluding reads with mismatches (regardless of their base qualities). I believe this is approximate to doing the alignments with '-t 10' (or perhaps even more strict).

Thanks again,
Shing

Yan Li

unread,
Jun 23, 2014, 5:22:35 PM6/23/14
to viral-to...@googlegroups.com
Hi Xiao,

When I follow the steps with suggested "LC_ALL=C sort -k 1,1 -k 3,3 test_a.sam > test_a.sorted.sam" as below:

samtools view -b -L A_nucgen.bed test_novoalign_sort.bam > test_a.bam
samtools view -h test_a.bam > test_a.sam
LC_ALL=C sort -k 1,1 -k 3,3 test_a.sam > test_a.sorted.sam
I got error message:
[samopen] SAM header is present: 9415 sequences.
[sam_read1] reference 'SN:HLA:HLA08800    AS:hlall.nix    LN:270


gnMPI -d /home/cliu/HLA_SEQUENCES/hlall.nix -f HG01756_1.filt.fastq.gz HG01756_2.filt.fastq.gz -t 30 -o SAM -r all -l 80 -e 1 -i 230 140
' is recognized as '*'.
[main_samview] truncated file.

The sorted sam files cannot be coverted to the bam files. Do you by any chance know what cause this error? Or anybody else has same problem as me?

Any comments can be big help.

Thanks,
Yan

Xiao Yang

unread,
Jun 23, 2014, 6:41:30 PM6/23/14
to viral-to...@googlegroups.com
Hi Yan,

I'm a bit puzzled here. 
Could you confirm if gnMPI the proper command to use and /home/cliu/HLA_SEQUENCES/hlall.nix 
is where you built your index file ? 

Xiao

Yan Li

unread,
Jun 23, 2014, 8:14:39 PM6/23/14
to viral-to...@googlegroups.com
Hi Xiao,

I think sorted sam file has header of "gnMPI the proper command to use and /home/cliu/HLA_SEQUENCES/
hlall.nix"
***************
[yli2@cri12in02 stefaniSpranger-HLAtyping]$ head A.sort.sam
@HD    VN:1.0    SO:unsorted
@PG    ID:novoalignMPI    VN:V2.07.07    CL:novoalignMPI -d /home/cliu/HLA_SEQUENCES/hlall.nix -f HG01756_1.filt.fastq.gz HG01756_2.filt.fastq.gz -t 30 -o SAM -r all -l 80 -e 1 -i 230 140
@SQ    SN:HLA:HLA00001    AS:hlall.nix    LN:1098
@SQ    SN:HLA:HLA00001    AS:hlall.nix    LN:3503
@SQ    SN:HLA:HLA00002    AS:hlall.nix    LN:1098
@SQ    SN:HLA:HLA00003    AS:hlall.nix    LN:1098
@SQ    SN:HLA:HLA00003    AS:hlall.nix    LN:3503
*************
Basically, I use your provided novoalign file "HG01756hlall.bam" located in /demo/HG01756/ to demo HLA typing followed the section 2 from step 3 to 4.

I do not have problems until $sort -k 1,1 -k 3,3 A.sam > A.sort.sam

Then I ran $samtools view -bS A.sort.sam > A.sort.bam

I got error message
***********

[samopen] SAM header is present: 9415 sequences.
[sam_read1] reference 'SN:HLA:HLA08800    AS:hlall.nix    LN:270


gnMPI -d /home/cliu/HLA_SEQUENCES/
hlall.nix -f HG01756_1.filt.fastq.gz HG01756_2.filt.fastq.gz -t 30 -o SAM -r all -l 80 -e 1 -i 230 140
' is recognized as '*'.
[main_samview] truncated file.
**********

Do I use wrong alignment file to demo? Or anywhere I made mistake to demo HLA typing?

Thank,
Yan

Xiao Yang

unread,
Jun 23, 2014, 9:20:34 PM6/23/14
to viral-to...@googlegroups.com
Hi Yan,

I just checked the demo folder, there are no sam file in the folder only the following files. 

HG01756_a.sort.bam
HG01756_b.sort.bam
HG01756_c.sort.bam
HG01756_na.sort.bam
HG01756_nb.sort.bam
HG01756_nc.sort.bam
HG01756_nq.sort.bam
HG01756_nr.sort.bam
HG01756_q.sort.bam
HG01756_r.sort.bam
HG01756hlall.bam

So I am not sure where you get A.sam from ? If you follow section 4 the walk through step 2,
then you should be able to apply typing to the demo files.


Yan Li

unread,
Jun 24, 2014, 10:51:47 AM6/24/14
to viral-to...@googlegroups.com
Hi Xiao,

That is exactly what I do. I took novoalign result in the folder "demo/HG01756/HG01756hlall.bam", and follow the section 2 step 3 and 4.

No error for step 3, but the last part of step 4 with command "$ samtools view -bS [A.sort.sam] > [A.sort.bam]", in which sorted sam file need to be converted into bam files, I got error message as below:



On Monday, June 23, 2014 8:20:34 PM UTC-5, Xiao Yang wrote:
Hi Yan,

I just checked the demo folder, there are no sam file in the folder only the following files. 

HG01756_a.sort.bam
HG01756_b.sort.bam
HG01756_c.sort.bam
HG01756_na.sort.bam
HG01756_nb.sort.bam
HG01756_nc.sort.bam
HG01756_nq.sort.bam
HG01756_nr.sort.bam
HG01756_q.sort.bam
HG01756_r.sort.bam
HG01756hlall.bam

So I am not sure where you get A.sam from ? If you follow section 4 the walk through step 2,
then you should be able to apply typing to the demo files.
***********
[samopen] SAM header is present: 9415 sequences.
[sam_read1] reference 'SN:HLA:HLA08800    AS:hlall.nix    LN:270


gnMPI -d /home/cliu/HLA_SEQUENCES/
hlall.nix -f HG01756_1.filt.fastq.gz HG01756_2.filt.fastq.gz -t 30 -o SAM -r all -l 80 -e 1 -i 230 140
' is recognized as '*'.
[main_samview] truncated file.
**********
You suggested Shing to use "LC_ALL=C sort -k 1,1 -k 3,3 test_a.sam > test_a.sorted.sam" before command "samtools view -bS test_a.sorted.sam > test_a.sorted.bam". I did the same thing, but still got the same error.
Do you by any chance know what is going on?

Thanks,
Yan

Yan Li

unread,
Jun 24, 2014, 11:04:43 AM6/24/14
to viral-to...@googlegroups.com
Hi Xiao,

In case my previous reply is not clear, here is what I did to demo typing.

I followed the section 4 go through section 2 step 3 and 4 with novoalign bam file in “demo/HG01756/HG01756hlall.bam”.

No problem for step 2 in the section 2 - “$ samtools sort [output.bam] [output.sort]”

But when I reach step 3 in the section3 until command “$samtools view -bS [A.sort.sam] > [A.sort.bam]”, I got error message
*****************************
$ samtools view -bS A.sort.sam > A.sort.bam
[samopen] SAM header is present: 9415 sequences.
[sam_read1] reference 'SN:HLA:HLA08800    AS:hlall.nix    LN:270


gnMPI -d /home/cliu/HLA_SEQUENCES/hlall.nix -f HG01756_1.filt.fastq.gz HG01756_2.filt.fastq.gz -t 30 -o SAM -r all -l 80 -e 1 -i 230 140

' is recognized as '*'.
[main_samview] truncated file.

********************************

If I ignore this message, keep going through until typing with command "typing -bam ~/A.sort.bam -exlbam ~/non-A.sort.bam -msa /db/msa/A_nuc.txt -o ~/HGtest_a > ~/HGtest_a.log.txt”, no output files can be generated and error message in the log file is as below:

****************************
--------------------------------------------------------
Running cmd:
typing     -bam testA.sort.bam -exlbam testnonA.sort.bam -msa /db/msa/A_nuc.txt -hd 2 -o ~/HGtest-a
--------------------------------------------------------

    Obtain (un)paired-read names in testnonA.sort.bam

[EXIT]: gather_alignments: Failed to locate bam index of file testnonA.sort.bam

****************************

Do you by any chance know what is going on, where did I make mistake?

Looking forward to your reply.

Thank,
Yan

Ekaterina Esaulova

unread,
Mar 14, 2016, 3:54:14 PM3/14/16
to Broad Viral Tool Users
Hello,

I’m trying to use Athlates for typing HLA class II. I decided to test my pipeline on HLA class I typing to make sure that it runs properly, so I took Opty-type article (http://bioinformatics.oxfordjournals.org/content/30/23/3310.full.pdf) - they also run Athlates on 1000 genomes data, so they have typing results.

I run Polysolver and Athlates, but for some reason Athlates is not working well; sometimes the data output is empty, and there are a lot of homozygotes wrong results. The most confusing part is that sometimes results are correct and sometimes not - with the same pioeline. I tried to figure it out by myself or find people with same issues, but I failed. 
Could you help me out with that? 

My main questions is: 
  • Why does Athlates sometimes give correct results, but mostly wrong or no results at all with the same pipeline?
  • Can I replace the sorting resulting sam files not with samtools, but with novosort -n?

My pipeline looks like this (I tried both single and paired end options):
  1. 1000 genomes provide bam files so first I extracting fq files:
    # input for bamtofastq must be sorted by name
    novosort -n -m 24 -c 4 initial.file.bam >  file.sort.bam 

    bedtools bamtofastq -i  file.qsort.bam -fq  file.end1.tmp.fq -fq2  file.end2.tmp.fq
    # Clean up the read names from the paired end (1 or 2) indicator
    awk '{ if ( (NR - 1) % 4 == 0) gsub(/\/[12]$/,"",\$1); print \$1 } '  file.end1.tmp.fq >  file.end1.fq
    awk '{ if ( (NR - 1) % 4 == 0) gsub(/\/[12]$/,"",\$1); print \$1 } '  file.end2.tmp.fq >  file.end2.fq

  2. For novoalign I count the mean and sd for fragment size based on 9 column in bam file;
    novoalign -c 4 -d ./db/ref/ref.nix -f file end1.fq file.end2.fq -t 30 -o SAM -r all -l 40 -e 100 -i PE $MEAN,$SD | samtools view -bS -h -F 4 - > file.bam
    samtools sort -@ 4 -o file.sorted.bam file.bam
  3. Then I preparing files for athlates and run program:
    for allele in  "A" "B" "C" 
    do 
             samtools view -b -o file.\$allele.bam -L /scratch/kate/cancer_project/Athlates_2014_04_26/db/bed/hla.\$allele.bed file.sorted.bam
             novosort -n -m 24 -c 4 -t $WDIR file.\$allele.bam >  file.\$allele.sort.bam

             samtools view -b -o file.non-\$allele.bam -L /scratch/kate/cancer_project/Athlates_2014_04_26/db/bed/hla.non-\$allele.bed file.sorted.bam 
             novosort -n -m 24 -c 4 -t $WDIR $NAME.non-\$allele.bam >  $NAME.non-\$allele.sort.bam
             ./bin/typing -bam file.\$allele.sort.bam -exlbam file.non-\$allele.sort.bam -msa ./db/msa/\${allele}_nuc.txt -o file.hla-\$allele.result
    done

    I also tried tutorial way with "sort -k 1,1 -k 3,3” and LC_ALL=C, but sam header always in the bottom of file. Then I split sam file by header and body, sort them separately and then merge and get the same result.

I would greatly appreciate any clues you may have to help me solve this problem. 

Thank you for you attention,

Kate 

Reply all
Reply to author
Forward
0 new messages