ATHLATES producing empty *typing.txt output

Andrew Oler

unread,

Oct 2, 2013, 1:33:18 PM10/2/13

to viral-to...@googlegroups.com

Hi,

I'm trying to run Athlates to get HLA-DRB types on several exome-seq datasets.

I downloaded all HLA sequences from IMGT and aligned the reads with novoalign:
novoalign -d hla.nix -F STDFQ -f ${f}.1.fq ${f}.2.fq -t 30 -o SAM -r all -l 80 -e 100 -i PE 225 140

Then I filtered out the DRB and non_DRB sequences:
samtools view -b -L DRB.bed $bam | samtools sort -n - DRB/${bam/.bam/}
samtools view -b -L non_DRB.bed $bam | samtools sort -n - non_DRB/${bam/.bam/}

I downloaded the latest DRB msa from here:
http://hla.alleles.org/data/txt/drb_nuc.txt

And I ran Athlates like this:
typing -bam DRB/$bam -exlbam non_DRB/$bam -msa drb_nuc.txt -o output/DRB/${bam/.bam/} > output/DRB/${bam/.bam/}.log.txt

It appears that the bam files and msa files are being read properly, and the reads are assembled into contigs, but when it tries to determine the types, it always has no output.

Here is an example output log.txt file:
--------------------------------------------------------
Running cmd:
typing    -bam DRB/HLA_CJF_32.bam -exlbam non_DRB/HLA_CJF_32.bam -msa drb_nuc.txt -hd 2 -o output/DRB/HLA_CJF_32_DRB
--------------------------------------------------------
    Obtain (un)paired-read names in non_DRB/HLA_CJF_32.bam
        #s&p names: 1294, 10929
        create file: output/DRB/HLA_CJF_32_DRB.raw.fa
    Check BAM header of DRB/HLA_CJF_32.bam
    [WARNING]: DRB/HLA_CJF_32.bam sort order: unsorted
        Input BAM file should be sorted by queryname
        101335 ref sequence(s) found; avg len = 636
        0 platform(s) found:
    Merge PE for DRB/HLA_CJF_32.bam
        create file: output/DRB/HLA_CJF_32_DRB.unpair.fa
        create file: output/DRB/HLA_CJF_32_DRB.pair.fa
        Total, merged, unmerged, singleton:
            8942, 340(181 overlap), 1, 21
        target, non_target, ratio:
        951    248    0.260778
    Discard reads containing unique 50-mers
            average freq = 6
        79 reads discarded
    Assemble 363 reads...
        iter: 0
        seed_k = 151    141    131    121    111    101    91    81    71    61    51    41    40
        iter: 1
        seed_k = 151    141    131    121    111    101    91    81    71    61    51    41    40
    13 contigs created
        create file: output/DRB/HLA_CJF_32_DRB.contig.fa
        create file: output/DRB/HLA_CJF_32_DRB.contig.detail.txt
    Read in cDNA MSA: drb_nuc.txt
        1454 refs in IMGT/HLA database; MSA len = 813
        Exon starts:    1    105    381    665    776    800
    Typing ...
        seed contigs...
        go through each cDNA...
        create file: output/DRB/HLA_CJF_32_DRB.typing.txt
Program finished successfully !

Here is the contig.fa file that was produced:

cat output/DRB/HLA_CJF_32_DRB.contig.fa
>contig 0    len = 291    cov = 5.8866
CATCTCTGACCAGCGACTGATGATGCTATTGTAATCAGATGCTGATTCGTTCTCCAACACTAGATTACCCAATCCACAAGCAAGGAAATCAGTAACTTCTTCCCTATAATTTGGAATGTGGGTGGAGAGGGGTCATAGTTCTCCCTGAGTGAGACTCACCTGCTCCTCTGGCCCCTGGTCCTGTCCTGTTCTCCAGCATGGTGTGTCTGAAGCTCCCTGGAGGCTCCTGCATGGCAGCTCTGACAGTGACACTGATGGTGCTGAGCTCCCCACTGGCTTTGGCTGGGGACA
>contig 1    len = 488    cov = 14.3443
ATGAGCTCCTGGGCTGCAGGTGGTGGGCGTTGCGGGTGGGGCCGGTTAAGGTTCCAGCGCCCGCACCCCGCTCAGGGAGCCCCGGATGGTGGCGTCGCTGTCCGTGTCTTCTCAGGAGGCCGCCCTGTGACCGGATGGTTCGTGTCCCCACAGCACGTTTCCTGTGGCAGCCTAAGAGGGAGTGTCATTTCTTCAATGGGACGGAGCGGGTGCGGTTCCTGGACAGATACTTCTATAACCAGGAGGAGTCCGTGCGCTTCGACAGCGACGTGGGGGAGTTCCGGGCGGTGACGGAGCTGGGGCGGCCTGACGCTGAGTACTGGAACAGCCAGAAGGACATCCTGGAGCAGGCGCGGGCCGCGGTGGACACCTACTGCAGACACAACTACGGGGTTGTGGAGAGCTTCACAGTGCAGCGGCGAGGTGAGCATGGCCGCGGGCGGGGCCTGAGTCCCCGTGAGCTGGGAATCTGAGTGTGTGTGTGTGTG
>contig 2    len = 660    cov = 20.9091
GAATAGATTTGCTTCCATACCTTTTAAGTGTATATCTTTTCCTCTTTTTCCCAGGGCACTCTGGACTTCACCCAACAGGTAATACCTTTTAATCCTCTTTTAGAAACAGATTCAGTTTTCCTAGAATGATGGCAGAGGTGATAAGGCATGAGACAGAAATAGCAGGAAAGACTTTGGATCCAAATTCCTGATCAGGCAATTTATACCAAAACTCCTCCTCTCCACTTAGGCCTGTGCTCTGCAGGAGTATTGGTTCAGGGAGACTTAGGAACTTGTTTTTCTTCTTCCTGCAGTGCTCTCATCTGAGTCCTTGAAAGAGAGGAAAAGAAGCTGTTAGTGGAACCAGGTCTGAAAACAACACTTTCCTCTCTCTCTGCAGGACTCGTGAGCTGAAGTGCAGATGACCACATTCAAGGGGGAACCTTCTGCCCCAGCTTTGCATGATGAAAAGCTTTCCTGCTTGGCTCTTATTCTTCCACAAGAGAGGACTTTCTCAGGCCCTGGTTGCTACCGGTTCAGCAACTCTGCAGAAAATGTCCATCCTTGTGGCTTCCTCAGCTCCTGCCCTTGGCCTGAAGTCCCAGCATTGATGGCAGTGCCTCATCTTCAACTTTAGTGCTCCCCTTTACCTAACCCTACGGCCTCCCATGCATCTGTACT
>contig 3    len = 333    cov = 14.7477
CCCACTATAAAGACTAAATTCCAGGCTGATGACACTGTGAGGCCTCATGGCCAGCTGTGCTGGAGGCCTGGTCAAGGCCAGAGCCTGGGTTTACAGAGAAGCAGACAAACAGCCAAAAAGGGAGATTCACTCTGTCTTCCTGAGTCATTCCCTCTACATTTCCTTTCTCCTAGTTGAGCCTAAGGTGACTGTGTATCCTGCAAGGACCCAGACCCTGCAGCACCACAACCTCCTGGTCTGCTCTGTGAATGGTTTCTATCCAGGCAGCATTGAAGTCAGGTGGTTCCGGAACAGCCAGGAAGAGAAGGCTGGGGTGGTGTCCACAGGCCTGAT
>contig 4    len = 181    cov = 4.31492
GGAGACTTACTCTGTCTTCATGACTCATTCCCACTACCTTTTTTTTCTCCTAGTCCAACCTAAGGTGACTGTGTATCCTTCAAAGACCCAGCCCCTGCAGCACCACAACCTCCTGGTCTGCTCTGTGAATGGTTTCTATCCAGGCAGCATTGAAGTCAGGTGGTTCCGGAACAGCCAGGAA
>contig 5    len = 430    cov = 15.486
TGGCTCAGGGGAGACTCAGGAACTTCTTTTTCTTCTTCCTGCAGTGCTCTCATCTGAGCCCTTGAAAGAGGGGAAAAGAAACTGTTAGTAGAGCCAGGTTGAAAACAACACTCTCCTCTGTCTTTTGCAGGATTCCTGAGCTGAAATGCAGATGACCACATTCAAGGAAGAACTTTCTGCCCCGGCTTTGCAGGATGAAAAGCTTTCCTGCTTGGCAGTTATTCTTCCACAAGAGAGGGCTTTCTCAGGACCTGGTTGCTACTGGTTCGGCAACTGCAGAAAATGTCCTCCCTTGTGGCTTCCTCAGCTCCTGCCCTTGGCCTGAAGTCCCAGCATTGATGGCAGCGCCTCATCTTCAACTTTTGTGCTCCCCTTTGCCTAAACCGTATGGCCTCCCGTGCATCTGTATTCACCCTGTATGACAAACACA
>contig 6    len = 248    cov = 3.93548
GAGCAGCCTTCTTCCCCATTTTCAAAGCTATGAATATTGCAGGGTCTCAATTAAAAAGGTTCAATTTGGGATAAAAATCACTAAACCTGGTTTCCTCTCCCAGGAGCACAGTCTGAATCTGCACAGAGCAAGATGCTGAGTGGAGTCGGGGGCTTTGTGCTGGGCCTGCTCTTCCTTGGGGCCGGGCTATTCATCTACTTCAAGAATCAGAAAGGTGAGGAGCCTTTGGTAGCGGCTCTCTCCATAGA
>contig 7    len = 263    cov = 10.2928
TCCCAGGGAACTCTGGACTTCACCCAACAGGTAATACCTTTTAATCCTCTTTTAGAAACAGATTCAGTTTTCCTAGAATGATGGCAGAGGTGATAAGGCATGAGACAGAAATAGCAGGAAAGACTTTGGATCCAAATTCCTGATCAGGCAATTTATACCAAAACTCCTCCTCTCCACTTAGGCCTGTGCTCTGCAGGAGTATTGGTTCAGGGAGACTTAGGAACTTGTTTTTCTTCTTCCTGCAGTGCTCTCATCTGAGTCCT
>contig 8    len = 283    cov = 8.23675
GACCAGCAACTGATGATACTATTGAACTCAGATGCTGATTGGTTCTCCAACACGAGAGTACCCAAACCAGGAGGAAGGAAATCAGTAACTTCCTCCCCATAATTTGGAATGTGGGTGGAGGGGGATCATAGTTCTCCCTGAGTGAGACTTGCCTGCTCCTCTGGCCCCTGGTCCTGTCCTGTTCTCCAGCATGGTGTGTCTGAAGCTCCCTGGAGGTTCCTACATGGCAAAGCTGACAGTGACACTGATGGTGCTGAGCTCCCCACTGGCTTTGGCTGGGGAC
>contig 9    len = 187    cov = 2
ACCAGGTCAGAAAACAACACTTTCCTCTCTCTCTGCAGGACTCGTGAGCTGAAGTGCAGATGACCACATTCAAGGGGGAACCTTCTGCCCCAGCTTTGnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnCGGTTCAGCAACTCTGC
>contig 10    len = 315    cov = 10.146
CCAGGCTGATGACACTGTAAGGTCACATGGCCAGCTGTGCTATAGGCCTGGTCAAGGCCAGAGCCTGGGTTTGCAGAGAAGCAGACACACAGCCAAACCAGGAGACTTACTCTGTCTTCCTGACTCATTCCCTCTACGTTGTTTTTCTCCTAGTCCAACCTAAGGTGACTGTATATCCTTCAAAGACCCAGCCCCTGCAGCACCACAACCTCCTGGTCTGCTCTGTGAGTGGTTTCTATCCAGGCAGCATTGAAGTCAGGTGGTTCCTGAACGGCCAGGAAGAGAAGGCTGGGATGGTGTCCACAGGCCTGATCC
>contig 11    len = 406    cov = 12.0419
GGGGTTATGGTTGGGATCAGTTAAGGTTCCAGTGCCCGCACCCCGCCCAGGGAGCCCCGGATGGCGGCGTCTCTGTCAGTATCTTCCCCGGAGGCCGCCCCTGTGACCGGATCCTTCGTGTCCCCACAGCACGTTTCTTGCAGCAGGATAAGTATGAGTGTCATTTCTTCAACGGGACGGAGCGGGTGCGGTTCCTGCACAGAGACATCTATAACCAAGAGGAGGACTTGCGCTTCGACAGCGACGTGGGGGAGTACCGGGCGGTGACGGAGCTGGGGCGGCCTGACGCTGAGTACTGGAACAGCCAGAAGGACTTCCTGGAAGACAGGCGCGCCGCGGTGGACACCTACTGCAGACACAACTACGGGGTTGGTGAGAGCTTCACAGTGCAGCGGCGAGGTGAGCA
>contig 12    len = 323    cov = 8.44272
CCTTCATACAGCATCTCTGACCAGCAACTGATGATGCTATTGAACTCAGATGCTGATTGGTTCTCCAACACGAGATTACCCAACCCAGGAGCAAGGAAATCAGTAACTTCCTCCCTATAACTTGGAATGTGGGTGGAGGGGTTCATAGTTCTCCCTGAGTGAGACTTGCCTGCTTCTCTGGCCCCTGGTCCTGTCCTGTTCTCCAGCATGGTGTGTCTGAAGCTCCCTGGAGGCTCCTGCATGACAGCGCTGACAGTGACACTGATGGTGCTGAGCTCCCCACTGGCTTTGTCTGGGGACACCCGACGTAAGTGCACATTGCG

And the typing.txt output file:

cat output/DRB/HLA_CJF_32_DRB.typing.txt
            Name        HD     Aln_len    cDNA_len     Similarity        Avg_cov               Missing_exons (ID, len); mismatches [ID, pos]

------------ Candidate Allelic Pairs -------------

            Name        HD     Aln_len    cDNA_len     Similarity        Avg_cov               Missing_exons (ID, len); mismatches [ID, pos]

------------ Inferred Allelic Pairs -------------

I also tried a hamming distance up to 10, but no difference in results. Any suggestions? Perhaps the coverage for the contigs is too low? Is there any way to change that setting? I'm running on Red Hat Linux with Athlates_07082013.

Thanks,

Andrew

Xiao Yang

unread,

Oct 2, 2013, 1:50:21 PM10/2/13

to viral-to...@googlegroups.com

Hi Andrew,

Thanks for your email. Indeed this is expected in many of the

existing exome-seq data set, which does not provide adequate

SPAN of the target HLA cDNAs. In the example you provided,

the BLAST alignment (see attached) of all contigs against a 801bp reference

DRB1*01:01:01 allele yielded the following alignment. It appears

there exists a large gap in the 525 -- 640 region. In such case, we

do not intend to report any typing results regardless of how

well the remaining regions are covered. We had similar experiences

as presented in our paper.

Currently, the program doesn't provide any typing results for such

cases but efforts will be made to improve sensitivity although it is

right now unclear how reliable we can predict the final typing results

with a large part of the individual exonic region missing.

Xiao

--
You received this message because you are subscribed to the Google Groups "Broad Viral Tool Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to viral-tool-use...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
- Xiao

image.png

Xiao Yang

unread,

Oct 3, 2013, 2:28:35 PM10/3/13

to viral-to...@googlegroups.com

Hi Andrew,

After inspection and correspondence with other users.

I found a major problem with input BAM files, which

requires I put some instructions in terms of further processing.

I'm currently working on a fix for this issue. Will release

a new version shortly when this problem is fixed. So, it is

likely the samples you had might actually be able to be typed

by ATHLATES.

Thanks,

Xiao

--
- Xiao

image.png

Smitha Shankar

unread,

Feb 7, 2014, 3:26:14 PM2/7/14

to viral-to...@googlegroups.com

Hello, I am facing the same issues as Andrew. Has this been fixed?

Xiao Yang

unread,

Feb 7, 2014, 3:46:22 PM2/7/14

to viral-to...@googlegroups.com

Hi Smitha,

I have released a new documentation on the website (for which you can check).

Could you check it out ?

Thanks,

Xiao

Smitha Shankar

unread,

Feb 7, 2014, 3:58:31 PM2/7/14

to viral-to...@googlegroups.com

Hello Xiao, I followed the steps given on the new documentation, but I still get this warning in the log file:

Input BAM file should be sorted by queryname

As suggested in the document I used this:

sort -k 1,1 -k 3,3 [A.sam] > [A.sort.sam]

samtools -view -bS [A.sort.sam] > [A.sort.bam]

The sorting actually moves the headers to the end of the file and therefore the sam cannot be converted to bam. Therefore after the sorting step i moved the headers back to the top of the file before I converted it to bam.

Am I doing something wrong?

Would you like to see my bam file?

Smitha Shankar

unread,

Feb 7, 2014, 4:55:13 PM2/7/14

to viral-to...@googlegroups.com

These steps work fine:

samtools view -b -L A_nucgen.bed [output.sort.bam] > [A.bam]

samtools view -h -o [A.sam] [A.bam]

Where as this step sort -k 1,1 -k 3,3 [A.sam] > [A.sort.sam] creates a sorted sam file with all the headers at the end of the file, due to which I was not able to convert it to the A.sort.bam file as suggested in the next step.

Therefore I put the headers back on top and then converted it to a bam file which seems to give a empty bam typing.txt file.

I have used the latest manual created on Oct 3 2013.

Xiao Yang

unread,

Feb 9, 2014, 11:59:08 AM2/9/14

to viral-to...@googlegroups.com

On Fri, Feb 7, 2014 at 4:55 PM, Smitha Shankar <smitha...@gmail.com> wrote:

These steps work fine:

samtools view -b -L A_nucgen.bed [output.sort.bam] > [A.bam]

samtools view -h -o [A.sam] [A.bam]
Where as this step sort -k 1,1 -k 3,3 [A.sam] > [A.sort.sam] creates a sorted sam file with all the headers at the end of the file, due to which I was not able to convert it to the A.sort.bam file as suggested in the next step.

You could use LC_ALL=C sort -k 1,1 -k 3,3 [A.sam] > [A.sort.sam]

to avoid locales been used by sort.

--
- Xiao

Shing Zhan

unread,

Apr 2, 2014, 4:46:04 PM4/2/14

to viral-to...@googlegroups.com

Dear Xiao,

I tried the following steps to prepare BAM files for ATHLATES as mentioned by Smitha above and with your recommendations:

samtools view -b -L A_nucgen.bed test_novoalign_sort.bam > test_a.bam

samtools view -h test_a.bam > test_a.sam

LC_ALL=C sort -k 1,1 -k 3,3 test_a.sam > test_a.sorted.sam

samtools view -bS test_a.sorted.sam > test_a.sorted.bam

However, I still get the warning:

"[WARNING]: test_a.sorted.bam sort order: unsorted"

I tested the same procedure on the 1000 Genomes exomes for HG01756 and HG01757 which were used in your paper.

And I was able to get the same HLA-A and HLA-B allelic pairs as in your paper.

Is the warning really an issue (well, it seems not)? Can I simply ignore it? I am actually using the first version of ATHLATES you released.

Any input would be much appreciated.

Thank you very much,

Shing

Xiao Yang

unread,

Apr 2, 2014, 4:52:21 PM4/2/14

to viral-to...@googlegroups.com

Hi Shing,

I responded below.

On Wed, Apr 2, 2014 at 4:46 PM, Shing Zhan <shing...@gmail.com> wrote:

Dear Xiao,

I tried the following steps to prepare BAM files for ATHLATES as mentioned by Smitha above and with your recommendations:

samtools view -b -L A_nucgen.bed test_novoalign_sort.bam > test_a.bam
samtools view -h test_a.bam > test_a.sam
LC_ALL=C sort -k 1,1 -k 3,3 test_a.sam > test_a.sorted.sam

samtools view -bS test_a.sorted.sam > test_a.sorted.bam

However, I still get the warning:

"[WARNING]: test_a.sorted.bam sort order: unsorted"

I tested the same procedure on the 1000 Genomes exomes for HG01756 and HG01757 which were used in your paper.

And I was able to get the same HLA-A and HLA-B allelic pairs as in your paper.

Is the warning really an issue (well, it seems not)? Can I simply ignore it?

The warning message is a result from inproper setting of the flags after sorted by Samtools.

And yes you can simply ignore it.

I am actually using the first version of ATHLATES you released.

I recommend you to use the newer version and our documentation also has a major change.

In addition, recently, we've noticed that to prepare initial bam file as ATHLATES input, if

no mismatch is tolerated (-t 10 in novoalign), we would be able to assembly more complete

exons (hence achieving a better results). These adjustments have already been included

in the newest documentation.

For more options, visit https://groups.google.com/d/optout.

--
- Xiao

Shing Zhan

unread,

Apr 2, 2014, 6:55:24 PM4/2/14

to viral-to...@googlegroups.com

Thank you very much for the quick reply, Xiao!

I still have a few more questions/concerns.

1) Now that you mentioned that using '-t 10' instead o '-t 30' yields better results, I am wondering whether I should redo all the read alignments. So far, I have used this command, based on older documentation and your paper.

novoalign -d ref.nix -f rd1.fq rd2.fq -F STDFQ -o SAM -t 30 -r all -l 60 -e 1 | samtools view -bS -h -F 4 - > output.bam

* Working with 75 bp reads. So, I set '-l 60'.

The problem is that redoing all the alignments is not feasible resource-wise for me right now. So, is there a way that you would recommend around it to improve the results? For example, extract reads already aligned using '-t 30' and then realign using '-t 10'?

2) Sorting by only read name, i.e.

samtools view -b -L A_nucgen.bed test_novoalign_sort.bam > test_a.bam

samtools sort -n test_a.bam test_a.sorted

instead of sorting by read name and then reference name (as in my previous post) resulted in different average coverage numbers ('Avg_cov' in output file), although the nominated alleles are largely identical.

Is this of concern? I don't quite get why the average coverage numbers differ.

Again thank you very much,

Shing

Xiao Yang

unread,

Apr 3, 2014, 1:01:51 PM4/3/14

to viral-to...@googlegroups.com

Hi Shing,

I responded below.

On Wed, Apr 2, 2014 at 6:55 PM, Shing Zhan <shing...@gmail.com> wrote:

Thank you very much for the quick reply, Xiao!

I still have a few more questions/concerns.

1) Now that you mentioned that using '-t 10' instead o '-t 30' yields better results, I am wondering whether I should redo all the read alignments. So far, I have used this command, based on older documentation and your paper.

novoalign -d ref.nix -f rd1.fq rd2.fq -F STDFQ -o SAM -t 30 -r all -l 60 -e 1 | samtools view -bS -h -F 4 - > output.bam

* Working with 75 bp reads. So, I set '-l 60'.

The problem is that redoing all the alignments is not feasible resource-wise for me right now. So, is there a way that you would recommend around it to improve the results? For example, extract reads already aligned using '-t 30' and then realign using '-t 10'?

I guess that should work. But it is very important to retain read pair information.

2) Sorting by only read name, i.e.

samtools view -b -L A_nucgen.bed test_novoalign_sort.bam > test_a.bam
samtools sort -n test_a.bam test_a.sorted

instead of sorting by read name and then reference name (as in my previous post) resulted in different average coverage numbers ('Avg_cov' in output file), although the nominated alleles are largely identical.

Is this of concern? I don't quite get why the average coverage numbers differ.

Yes, this is a concern. The reason to update the documentation is that I realized my earlier

assumption of intended input format was wrong. So, it is necessary to follow the newer

documentation.

--
- Xiao

Shing Zhan

unread,

Apr 3, 2014, 1:36:47 PM4/3/14

to viral-to...@googlegroups.com

Thanks very much again, Xiao.

Instead of redoing the novoalign step, I am considering to retain reads with 'NM:i:0' only, therefore excluding reads with mismatches (regardless of their base qualities). I believe this is approximate to doing the alignments with '-t 10' (or perhaps even more strict).

Thanks again,

Shing

Yan Li

unread,

Jun 23, 2014, 5:22:35 PM6/23/14

to viral-to...@googlegroups.com

Hi Xiao,

When I follow the steps with suggested "LC_ALL=C sort -k 1,1 -k 3,3 test_a.sam > test_a.sorted.sam" as below:

samtools view -b -L A_nucgen.bed test_novoalign_sort.bam > test_a.bam

samtools view -h test_a.bam > test_a.sam

LC_ALL=C sort -k 1,1 -k 3,3 test_a.sam > test_a.sorted.sam

I got error message:
[samopen] SAM header is present: 9415 sequences.
[sam_read1] reference 'SN:HLA:HLA08800 AS:hlall.nix LN:270

gnMPI -d /home/cliu/HLA_SEQUENCES/hlall.nix -f HG01756_1.filt.fastq.gz HG01756_2.filt.fastq.gz -t 30 -o SAM -r all -l 80 -e 1 -i 230 140
' is recognized as '*'.
[main_samview] truncated file.

The sorted sam files cannot be coverted to the bam files. Do you by any chance know what cause this error? Or anybody else has same problem as me?

Any comments can be big help.

Thanks,
Yan

Xiao Yang

unread,

Jun 23, 2014, 6:41:30 PM6/23/14

to viral-to...@googlegroups.com

Hi Yan,

I'm a bit puzzled here.

Could you confirm if gnMPI the proper command to use and /home/cliu/HLA_SEQUENCES/hlall.nix

is where you built your index file ?

Xiao

Yan Li

unread,

Jun 23, 2014, 8:14:39 PM6/23/14

to viral-to...@googlegroups.com

Hi Xiao,

I think sorted sam file has header of "gnMPI the proper command to use and /home/cliu/HLA_SEQUENCES/

hlall.nix"
***************
[yli2@cri12in02 stefaniSpranger-HLAtyping]$ head A.sort.sam
@HD    VN:1.0    SO:unsorted
@PG    ID:novoalignMPI    VN:V2.07.07    CL:novoalignMPI -d /home/cliu/HLA_SEQUENCES/hlall.nix -f HG01756_1.filt.fastq.gz HG01756_2.filt.fastq.gz -t 30 -o SAM -r all -l 80 -e 1 -i 230 140
@SQ    SN:HLA:HLA00001    AS:hlall.nix    LN:1098
@SQ    SN:HLA:HLA00001    AS:hlall.nix    LN:3503
@SQ    SN:HLA:HLA00002    AS:hlall.nix    LN:1098
@SQ    SN:HLA:HLA00003    AS:hlall.nix    LN:1098
@SQ    SN:HLA:HLA00003    AS:hlall.nix    LN:3503
*************

Basically, I use your provided novoalign file "HG01756hlall.bam" located in /demo/HG01756/ to demo HLA typing followed the section 2 from step 3 to 4.

I do not have problems until $sort -k 1,1 -k 3,3 A.sam > A.sort.sam

Then I ran $samtools view -bS A.sort.sam > A.sort.bam

I got error message
***********

[samopen] SAM header is present: 9415 sequences.
[sam_read1] reference 'SN:HLA:HLA08800 AS:hlall.nix LN:270

gnMPI -d /home/cliu/HLA_SEQUENCES/

hlall.nix -f HG01756_1.filt.fastq.gz HG01756_2.filt.fastq.gz -t 30 -o SAM -r all -l 80 -e 1 -i 230 140
' is recognized as '*'.
[main_samview] truncated file.

**********

Do I use wrong alignment file to demo? Or anywhere I made mistake to demo HLA typing?

Thank,
Yan

Xiao Yang

unread,

Jun 23, 2014, 9:20:34 PM6/23/14

to viral-to...@googlegroups.com

Hi Yan,

I just checked the demo folder, there are no sam file in the folder only the following files.

HG01756_a.sort.bam

HG01756_b.sort.bam

HG01756_c.sort.bam

HG01756_na.sort.bam

HG01756_nb.sort.bam

HG01756_nc.sort.bam

HG01756_nq.sort.bam

HG01756_nr.sort.bam

HG01756_q.sort.bam

HG01756_r.sort.bam

HG01756hlall.bam

So I am not sure where you get A.sam from ? If you follow section 4 the walk through step 2,

then you should be able to apply typing to the demo files.

Yan Li

unread,

Jun 24, 2014, 10:51:47 AM6/24/14

to viral-to...@googlegroups.com

Hi Xiao,

That is exactly what I do. I took novoalign result in the folder "demo/HG01756/HG01756hlall.bam", and follow the section 2 step 3 and 4.

No error for step 3, but the last part of step 4 with command "$ samtools view -bS [A.sort.sam] > [A.sort.bam]", in which sorted sam file need to be converted into bam files, I got error message as below:

On Monday, June 23, 2014 8:20:34 PM UTC-5, Xiao Yang wrote:

Hi Yan,

I just checked the demo folder, there are no sam file in the folder only the following files.

HG01756_a.sort.bam
HG01756_b.sort.bam
HG01756_c.sort.bam
HG01756_na.sort.bam
HG01756_nb.sort.bam
HG01756_nc.sort.bam
HG01756_nq.sort.bam
HG01756_nr.sort.bam
HG01756_q.sort.bam
HG01756_r.sort.bam

HG01756hlall.bam

So I am not sure where you get A.sam from ? If you follow section 4 the walk through step 2,
then you should be able to apply typing to the demo files.

***********
[samopen] SAM header is present: 9415 sequences.
[sam_read1] reference 'SN:HLA:HLA08800 AS:hlall.nix LN:270

gnMPI -d /home/cliu/HLA_SEQUENCES/
hlall.nix -f HG01756_1.filt.fastq.gz HG01756_2.filt.fastq.gz -t 30 -o SAM -r all -l 80 -e 1 -i 230 140
' is recognized as '*'.
[main_samview] truncated file.
**********

You suggested Shing to use "LC_ALL=C sort -k 1,1 -k 3,3 test_a.sam > test_a.sorted.sam" before command "samtools view -bS test_a.sorted.sam > test_a.sorted.bam". I did the same thing, but still got the same error.

Do you by any chance know what is going on?

Thanks,
Yan

Yan Li

unread,

Jun 24, 2014, 11:04:43 AM6/24/14

to viral-to...@googlegroups.com

Hi Xiao,

In case my previous reply is not clear, here is what I did to demo typing.

I followed the section 4 go through section 2 step 3 and 4 with novoalign bam file in “demo/HG01756/HG01756hlall.bam”.

No problem for step 2 in the section 2 - “$ samtools sort [output.bam] [output.sort]”

But when I reach step 3 in the section3 until command “$samtools view -bS [A.sort.sam] > [A.sort.bam]”, I got error message
*****************************

$ samtools view -bS A.sort.sam > A.sort.bam

[samopen] SAM header is present: 9415 sequences.
[sam_read1] reference 'SN:HLA:HLA08800 AS:hlall.nix LN:270

gnMPI -d /home/cliu/HLA_SEQUENCES/hlall.nix -f HG01756_1.filt.fastq.gz HG01756_2.filt.fastq.gz -t 30 -o SAM -r all -l 80 -e 1 -i 230 140

' is recognized as '*'.
[main_samview] truncated file.

********************************

If I ignore this message, keep going through until typing with command "typing -bam ~/A.sort.bam -exlbam ~/non-A.sort.bam -msa /db/msa/A_nuc.txt -o ~/HGtest_a > ~/HGtest_a.log.txt”, no output files can be generated and error message in the log file is as below:

****************************
--------------------------------------------------------
Running cmd:
typing -bam testA.sort.bam -exlbam testnonA.sort.bam -msa /db/msa/A_nuc.txt -hd 2 -o ~/HGtest-a
--------------------------------------------------------

Obtain (un)paired-read names in testnonA.sort.bam

[EXIT]: gather_alignments: Failed to locate bam index of file testnonA.sort.bam
****************************

Do you by any chance know what is going on, where did I make mistake?

Looking forward to your reply.

Thank,
Yan

Ekaterina Esaulova

unread,

Mar 14, 2016, 3:54:14 PM3/14/16

to Broad Viral Tool Users

Hello,

I’m trying to use Athlates for typing HLA class II. I decided to test my pipeline on HLA class I typing to make sure that it runs properly, so I took Opty-type article (http://bioinformatics.oxfordjournals.org/content/30/23/3310.full.pdf) - they also run Athlates on 1000 genomes data, so they have typing results.

I run Polysolver and Athlates, but for some reason Athlates is not working well; sometimes the data output is empty, and there are a lot of homozygotes wrong results. The most confusing part is that sometimes results are correct and sometimes not - with the same pioeline. I tried to figure it out by myself or find people with same issues, but I failed.

Could you help me out with that?

My main questions is:

Why does Athlates sometimes give correct results, but mostly wrong or no results at all with the same pipeline?
Can I replace the sorting resulting sam files not with samtools, but with novosort -n?

My pipeline looks like this (I tried both single and paired end options):

1000 genomes provide bam files so first I extracting fq files:
# input for bamtofastq must be sorted by name
novosort -n -m 24 -c 4 initial.file.bam >  file.sort.bam

bedtools bamtofastq -i  file.qsort.bam -fq  file.end1.tmp.fq -fq2  file.end2.tmp.fq
# Clean up the read names from the paired end (1 or 2) indicator
awk '{ if ( (NR - 1) % 4 == 0) gsub(/\/[12]$/,"",\$1); print \$1 } '  file.end1.tmp.fq >  file.end1.fq
awk '{ if ( (NR - 1) % 4 == 0) gsub(/\/[12]$/,"",\$1); print \$1 } '  file.end2.tmp.fq >  file.end2.fq
For novoalign I count the mean and sd for fragment size based on 9 column in bam file;
novoalign -c 4 -d ./db/ref/ref.nix -f file end1.fq file.end2.fq -t 30 -o SAM -r all -l 40 -e 100 -i PE $MEAN,$SD | samtools view -bS -h -F 4 - > file.bam
samtools sort -@ 4 -o file.sorted.bam file.bam
Then I preparing files for athlates and run program:
for allele in "A" "B" "C"
do
samtools view -b -o file.\$allele.bam -L /scratch/kate/cancer_project/Athlates_2014_04_26/db/bed/hla.\$allele.bed file.sorted.bam
novosort -n -m 24 -c 4 -t $WDIR file.\$allele.bam > file.\$allele.sort.bam

samtools view -b -o file.non-\$allele.bam -L /scratch/kate/cancer_project/Athlates_2014_04_26/db/bed/hla.non-\$allele.bed file.sorted.bam
novosort -n -m 24 -c 4 -t $WDIR $NAME.non-\$allele.bam > $NAME.non-\$allele.sort.bam
./bin/typing -bam file.\$allele.sort.bam -exlbam file.non-\$allele.sort.bam -msa ./db/msa/\${allele}_nuc.txt -o file.hla-\$allele.result
done

I also tried tutorial way with "sort -k 1,1 -k 3,3” and LC_ALL=C, but sam header always in the bottom of file. Then I split sam file by header and body, sort them separately and then merge and get the same result.

I would greatly appreciate any clues you may have to help me solve this problem.