elprep 5.1.2 sorting issue

24 views
Skip to first unread message

geertva...@gmail.com

unread,
Feb 16, 2022, 5:17:28 AM2/16/22
to elprep
Hey, 

I ran the following commands to push multipe bam files into elprep filter. Reason : sfm supports multiple files in a folder, but "filter" only takes one file. Both input file were valid bam files, sorted by coordinate (same happens with unsorted input bams). The resulting BAM generated by elprep is no longer sorted correctly : 

# extract headers for both files:
samtools view --no-PG -H ~/elprep_streaming/bams/wesid-226998-m_S1_L001_sorted.bam > header.1.bam

samtools view --no-PG -H ~/elprep_streaming/bams/wesid-226998-m_S1_L002_sorted.bam > header.2.bam

# merge headers to have RG info from both files
samtools merge --no-PG -o full_header.bam header.1.bam header.2.bam

# use samtools cat to quickly combine both files:
samtools cat -o input.bam -h full_header.bam ~/elprep_streaming/bams/wesid-226998-m_S1_L001_sorted.bam ~/elprep_streaming/bams/wesid-226998-m_S1_L002_sorted.bam
# note : goals is to use named pipes here for the catted bam file, but used real files for testing

# run elprep
 elprep filter input.bam result.bam --mark-duplicates --remove-duplicates --mark-optical-duplicates elprep.metrics -
-sorting-order coordinate --nr-of-threads 36


=> Elprep runs fine, no errors/issues. But when I try to index the bam file afterwards, samtools (1.14) complains: 

samtools index result.bam  
[E::hts_idx_push] NO_COOR reads not in a single block at the end 0 -1
[E::sam_index] Read 'A00982:335:H2WF2DMXY:2:1201:31015:13150' with ref_name='chrM', ref_length=16571, flags=163, pos=1 cannot be indexed


Header file of the result.bam looks ok: 
@HD     VN:1.6  SO:coordinate
@SQ     SN:chrM LN:16571
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:171115067
@SQ     SN:chr7 LN:159138663
@SQ     SN:chr8 LN:146364022
@SQ     SN:chr9 LN:141213431
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
@SQ     SN:chr12        LN:133851895
@SQ     SN:chr13        LN:115169878
@SQ     LN:107349540    SN:chr14
@SQ     SN:chr15        LN:102531392
@SQ     SN:chr16        LN:90354753
@SQ     SN:chr17        LN:81195210
@SQ     SN:chr18        LN:78077248
@SQ     SN:chr19        LN:59128983
@SQ     SN:chr20        LN:63025520
@SQ     SN:chr21        LN:48129895
@SQ     SN:chr22        LN:51304566
@SQ     SN:chrX LN:155270560
@SQ     LN:59373566     SN:chrY
@RG     PL:ILLUMINA     SM:wesid-226998-m       ID:wesid-226998-m_S1_L001       CN:CMG
@RG     ID:wesid-226998-m_S1_L002       CN:CMG  PL:ILLUMINA     SM:wesid-226998-m
@PG     PN:bwa  VN:0.7.17-r1188 CL:bwa mem -M -t 8 -R @RG\tID:wesid-226998-m_S1_L001\tCN:CMG\tPL:ILLUMINA\tSM:wesid-226998-m /tmp/scratch/uza-cloud-wes-data-001/genomes/hg19/genome/hg19_ref_genome /tmp/scratch/uza-cloud-wes-temp-001/wes-tmp/WES/
b7cb342c-be1b-4b33-9d75-d822f4d40acd/call-PREPSubworkflow/shard-4/PreprocessingWF/42ae185f-e467-4362-9041-58e9547b5670/call-cutadapt/execution/wesid-226998-m--20211209-011336--analysis-2/tmp_files/FASTQ_TRIMMED/wesid-226998-m_S1_L001_R1_trimmed.
fastq.gz /tmp/scratch/uza-cloud-wes-temp-001/wes-tmp/WES/b7cb342c-be1b-4b33-9d75-d822f4d40acd/call-PREPSubworkflow/shard-4/PreprocessingWF/42ae185f-e467-4362-9041-58e9547b5670/call-cutadapt/execution/wesid-226998-m--20211209-011336--analysis-2/t
mp_files/FASTQ_TRIMMED/wesid-226998-m_S1_L001_R2_trimmed.fastq.gz       ID:bwa
@PG     ID:samtools     PN:samtools     PP:bwa  VN:1.10 CL:samtools view -u -S -h -t /tmp/scratch/uza-cloud-wes-data-001/genomes/hg19/genome/hg19_ref_genome.fasta.fai
@PG     PP:samtools     VN:1.10 CL:samtools sort -@ 8 -m 2000M  ID:samtools.1   PN:samtools
@PG     PN:bwa  VN:0.7.17-r1188 CL:bwa mem -M -t 8 -R @RG\tID:wesid-226998-m_S1_L002\tCN:CMG\tPL:ILLUMINA\tSM:wesid-226998-m /tmp/scratch/uza-cloud-wes-data-001/genomes/hg19/genome/hg19_ref_genome /tmp/scratch/uza-cloud-wes-temp-001/wes-tmp/WES/
b7cb342c-be1b-4b33-9d75-d822f4d40acd/call-PREPSubworkflow/shard-5/PreprocessingWF/9ada6bbb-32c2-426e-bf64-e4de1ada8ac1/call-cutadapt/execution/wesid-226998-m--20211209-011336--analysis-2/tmp_files/FASTQ_TRIMMED/wesid-226998-m_S1_L002_R1_trimmed.
fastq.gz /tmp/scratch/uza-cloud-wes-temp-001/wes-tmp/WES/b7cb342c-be1b-4b33-9d75-d822f4d40acd/call-PREPSubworkflow/shard-5/PreprocessingWF/9ada6bbb-32c2-426e-bf64-e4de1ada8ac1/call-cutadapt/execution/wesid-226998-m--20211209-011336--analysis-2/t
mp_files/FASTQ_TRIMMED/wesid-226998-m_S1_L002_R2_trimmed.fastq.gz       ID:bwa-1936FB4F
@PG     PP:bwa-1936FB4F VN:1.10 CL:samtools view -u -S -h -t /tmp/scratch/uza-cloud-wes-data-001/genomes/hg19/genome/hg19_ref_genome.fasta.fai  ID:samtools-2E43D7C4    PN:samtools
@PG     ID:samtools.1-2B680488  PN:samtools     PP:samtools-2E43D7C4    VN:1.10 CL:samtools sort -@ 8 -m 2000M
@PG     PN:samtools     PP:samtools.1   VN:1.14 CL:samtools cat -o input.bam -h full_header.bam /home/gvandeweyer/elprep_streaming/bams/wesid-226998-m_S1_L001_sorted.bam /home/gvandeweyer/elprep_streaming/bams/wesid-226998-m_S1_L002_sorted.bam I
D:samtools.2
@PG     CL:samtools cat -o input.bam -h full_header.bam /home/gvandeweyer/elprep_streaming/bams/wesid-226998-m_S1_L001_sorted.bam /home/gvandeweyer/elprep_streaming/bams/wesid-226998-m_S1_L002_sorted.bam     ID:samtools.3   PN:samtools     PP:sa
mtools.1-2B680488       VN:1.14
@PG     ID:elprep 5.1.2 PN:elprep       VN:5.1.2        DS:http://github.com/exascience/elprep  CL:elprep filter input.bam result.bam --mark-duplicates --mark-optical-duplicates elprep.metrics --optical-duplicates-pixel-distance 100 --remove-dup
licates --sorting-order coordinate --nr-of-threads 36   PP:samtools.2


When I look at the order of chromosomes in the bam file, it looks like the sorting of both data-groups (RGIDs) was not combined, but kept as is. 

samtools view result.bam | cut -f 3 | uniq -c

  13483 chrM
4371566 chr1
3078755 chr2
2359612 chr3
1725225 chr4
1888285 chr5
2149297 chr6
2191086 chr7
1428463 chr8
1844162 chr9
1784199 chr10
2432695 chr11
2118172 chr12
758308 chr13
1357263 chr14
1674054 chr15
2122710 chr16
2573572 chr17
671917 chr18
2814644 chr19
1020045 chr20
474698 chr21
1077687 chr22
1978281 chrX
 45873 chrY
 25236 *
 13830 chrM
4454413 chr1
3129882 chr2
2400898 chr3
1756700 chr4
1921463 chr5
2189527 chr6
2231373 chr7
1453118 chr8
1882811 chr9
1817069 chr10
2484724 chr11
2154547 chr12
772038 chr13
1377865 chr14
1707827 chr15
2165727 chr16
2631472 chr17
684686 chr18
2871121 chr19
1035488 chr20
483627 chr21
1098829 chr22
2016104 chrX
 47216 chrY
 25592 *


Any ideas what might be the problem ?

Best, 
Geert

geertva...@gmail.com

unread,
Feb 16, 2022, 5:24:07 AM2/16/22
to elprep
EDIT : on closer investigation, this only seems to happen for PRE-SORTED BAM files, in contrast to what's stated above !  so : 
- BWA [lane 1-2] => samtools cat => elprep => samtools index : OK
- BWA [lane 1-2] => samtools sort [ lane 1-2] => samtools cat => elprep => samtools index => fail

Op woensdag 16 februari 2022 om 11:17:28 UTC+1 schreef geertva...@gmail.com:
Reply all
Reply to author
Forward
0 new messages