Hey,
I ran the following commands to push multipe bam files into elprep filter. Reason : sfm supports multiple files in a folder, but "filter" only takes one file. Both input file were valid bam files, sorted by coordinate (same happens with unsorted input bams). The resulting BAM generated by elprep is no longer sorted correctly :
# extract headers for both files:
samtools view --no-PG -H ~/elprep_streaming/bams/wesid-226998-m_S1_L001_sorted.bam > header.1.bam
samtools view --no-PG -H ~/elprep_streaming/bams/wesid-226998-m_S1_L002_sorted.bam > header.2.bam
# merge headers to have RG info from both files
samtools merge --no-PG -o full_header.bam header.1.bam header.2.bam
# use samtools cat to quickly combine both files:
samtools cat -o input.bam -h full_header.bam ~/elprep_streaming/bams/wesid-226998-m_S1_L001_sorted.bam ~/elprep_streaming/bams/wesid-226998-m_S1_L002_sorted.bam
# note : goals is to use named pipes here for the catted bam file, but used real files for testing
# run elprep
elprep filter input.bam result.bam --mark-duplicates --remove-duplicates --mark-optical-duplicates elprep.metrics -
-sorting-order coordinate --nr-of-threads 36
=> Elprep runs fine, no errors/issues. But when I try to index the bam file afterwards, samtools (1.14) complains:
samtools index result.bam
[E::hts_idx_push] NO_COOR reads not in a single block at the end 0 -1
[E::sam_index] Read 'A00982:335:H2WF2DMXY:2:1201:31015:13150' with ref_name='chrM', ref_length=16571, flags=163, pos=1 cannot be indexed
Header file of the result.bam looks ok:
@HD VN:1.6 SO:coordinate
@SQ SN:chrM LN:16571
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ LN:107349540 SN:chr14
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr19 LN:59128983
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr22 LN:51304566
@SQ SN:chrX LN:155270560
@SQ LN:59373566 SN:chrY
@RG PL:ILLUMINA SM:wesid-226998-m ID:wesid-226998-m_S1_L001 CN:CMG
@RG ID:wesid-226998-m_S1_L002 CN:CMG PL:ILLUMINA SM:wesid-226998-m
@PG PN:bwa VN:0.7.17-r1188 CL:bwa mem -M -t 8 -R @RG\tID:wesid-226998-m_S1_L001\tCN:CMG\tPL:ILLUMINA\tSM:wesid-226998-m /tmp/scratch/uza-cloud-wes-data-001/genomes/hg19/genome/hg19_ref_genome /tmp/scratch/uza-cloud-wes-temp-001/wes-tmp/WES/
b7cb342c-be1b-4b33-9d75-d822f4d40acd/call-PREPSubworkflow/shard-4/PreprocessingWF/42ae185f-e467-4362-9041-58e9547b5670/call-cutadapt/execution/wesid-226998-m--20211209-011336--analysis-2/tmp_files/FASTQ_TRIMMED/wesid-226998-m_S1_L001_R1_trimmed.
fastq.gz /tmp/scratch/uza-cloud-wes-temp-001/wes-tmp/WES/b7cb342c-be1b-4b33-9d75-d822f4d40acd/call-PREPSubworkflow/shard-4/PreprocessingWF/42ae185f-e467-4362-9041-58e9547b5670/call-cutadapt/execution/wesid-226998-m--20211209-011336--analysis-2/t
mp_files/FASTQ_TRIMMED/wesid-226998-m_S1_L001_R2_trimmed.fastq.gz ID:bwa
@PG ID:samtools PN:samtools PP:bwa VN:1.10 CL:samtools view -u -S -h -t /tmp/scratch/uza-cloud-wes-data-001/genomes/hg19/genome/hg19_ref_genome.fasta.fai
@PG PP:samtools VN:1.10 CL:samtools sort -@ 8 -m 2000M ID:samtools.1 PN:samtools
@PG PN:bwa VN:0.7.17-r1188 CL:bwa mem -M -t 8 -R @RG\tID:wesid-226998-m_S1_L002\tCN:CMG\tPL:ILLUMINA\tSM:wesid-226998-m /tmp/scratch/uza-cloud-wes-data-001/genomes/hg19/genome/hg19_ref_genome /tmp/scratch/uza-cloud-wes-temp-001/wes-tmp/WES/
b7cb342c-be1b-4b33-9d75-d822f4d40acd/call-PREPSubworkflow/shard-5/PreprocessingWF/9ada6bbb-32c2-426e-bf64-e4de1ada8ac1/call-cutadapt/execution/wesid-226998-m--20211209-011336--analysis-2/tmp_files/FASTQ_TRIMMED/wesid-226998-m_S1_L002_R1_trimmed.
fastq.gz /tmp/scratch/uza-cloud-wes-temp-001/wes-tmp/WES/b7cb342c-be1b-4b33-9d75-d822f4d40acd/call-PREPSubworkflow/shard-5/PreprocessingWF/9ada6bbb-32c2-426e-bf64-e4de1ada8ac1/call-cutadapt/execution/wesid-226998-m--20211209-011336--analysis-2/t
mp_files/FASTQ_TRIMMED/wesid-226998-m_S1_L002_R2_trimmed.fastq.gz ID:bwa-1936FB4F
@PG PP:bwa-1936FB4F VN:1.10 CL:samtools view -u -S -h -t /tmp/scratch/uza-cloud-wes-data-001/genomes/hg19/genome/hg19_ref_genome.fasta.fai ID:samtools-2E43D7C4 PN:samtools
@PG ID:samtools.1-2B680488 PN:samtools PP:samtools-2E43D7C4 VN:1.10 CL:samtools sort -@ 8 -m 2000M
@PG PN:samtools PP:samtools.1 VN:1.14 CL:samtools cat -o input.bam -h full_header.bam /home/gvandeweyer/elprep_streaming/bams/wesid-226998-m_S1_L001_sorted.bam /home/gvandeweyer/elprep_streaming/bams/wesid-226998-m_S1_L002_sorted.bam I
D:samtools.2
@PG CL:samtools cat -o input.bam -h full_header.bam /home/gvandeweyer/elprep_streaming/bams/wesid-226998-m_S1_L001_sorted.bam /home/gvandeweyer/elprep_streaming/bams/wesid-226998-m_S1_L002_sorted.bam ID:samtools.3 PN:samtools PP:sa
mtools.1-2B680488 VN:1.14
@PG ID:elprep 5.1.2 PN:elprep VN:5.1.2 DS:
http://github.com/exascience/elprep CL:elprep filter input.bam result.bam --mark-duplicates --mark-optical-duplicates elprep.metrics --optical-duplicates-pixel-distance 100 --remove-dup
licates --sorting-order coordinate --nr-of-threads 36 PP:samtools.2
When I look at the order of chromosomes in the bam file, it looks like the sorting of both data-groups (RGIDs) was not combined, but kept as is.
samtools view result.bam | cut -f 3 | uniq -c
13483 chrM
4371566 chr1
3078755 chr2
2359612 chr3
1725225 chr4
1888285 chr5
2149297 chr6
2191086 chr7
1428463 chr8
1844162 chr9
1784199 chr10
2432695 chr11
2118172 chr12
758308 chr13
1357263 chr14
1674054 chr15
2122710 chr16
2573572 chr17
671917 chr18
2814644 chr19
1020045 chr20
474698 chr21
1077687 chr22
1978281 chrX
45873 chrY
25236 *
13830 chrM
4454413 chr1
3129882 chr2
2400898 chr3
1756700 chr4
1921463 chr5
2189527 chr6
2231373 chr7
1453118 chr8
1882811 chr9
1817069 chr10
2484724 chr11
2154547 chr12
772038 chr13
1377865 chr14
1707827 chr15
2165727 chr16
2631472 chr17
684686 chr18
2871121 chr19
1035488 chr20
483627 chr21
1098829 chr22
2016104 chrX
47216 chrY
25592 *
Any ideas what might be the problem ?
Best,
Geert