dedup from stdin

67 views
Skip to first unread message

oabio...@gmail.com

unread,
Oct 27, 2015, 12:30:48 PM10/27/15
to bamUtils
Dear all,

I am having an issue that I cannot figure out the fix for, so I am hoping one of you may have an insight into it. I am going to summarize the issue.

I am using samtools cat to concat small bam files, pipe that to samtools merge and pipe that to bamutils dedup, but I get a empty file at the end.

my command line is
/mnt/samtools-1.2/samtools cat -h headers.sam *.bam | /mnt/samtools-1.2/samtools sort -m 1G -l 0 -O bam -T /scratch/ -@4 | /mnt/bamutil-1.0.12/bin/bam dedup --in -.ubam --out -.bam --force --oneChrom --noPhoneHome > t.bam

[bam_sort_core] merging from 40 files...
Reading ReferenceID 0

Reading ReferenceID 1

Reading ReferenceID 2

Reading ReferenceID 3

Reading ReferenceID 4

Reading ReferenceID 5

Reading ReferenceID 6

Reading ReferenceID 7

Reading ReferenceID 8

Reading ReferenceID 9

Reading ReferenceID 10

Reading ReferenceID 11

Reading ReferenceID 12

Reading ReferenceID 13

Reading ReferenceID 14

Reading ReferenceID 15

Reading ReferenceID 16

Reading ReferenceID 17

Reading ReferenceID 18

Reading ReferenceID 19

Reading ReferenceID 20

Reading ReferenceID 21

Reading ReferenceID 22

Reading ReferenceID 23

Reading ReferenceID 24

Reading ReferenceID 25

Reading ReferenceID 26

Reading ReferenceID 27

Reading ReferenceID 28

Reading ReferenceID 29

Reading ReferenceID 30

Reading ReferenceID 31

Reading ReferenceID 32

Reading ReferenceID 33

Reading ReferenceID 34

Reading ReferenceID 35

Reading ReferenceID 36

Reading ReferenceID 37

Reading ReferenceID 38

Reading ReferenceID 39

Reading ReferenceID 40

Reading ReferenceID 41

Reading ReferenceID 42

Reading ReferenceID 43

Reading ReferenceID 44

Reading ReferenceID 45

Reading ReferenceID 46

Reading ReferenceID 47

Reading ReferenceID 48

Reading ReferenceID 49

Reading ReferenceID 50

Reading ReferenceID 51

Reading ReferenceID 52

Reading ReferenceID 53

Reading ReferenceID 54

Reading ReferenceID 55

Reading ReferenceID 56

Reading ReferenceID 57

Reading ReferenceID 58

Reading ReferenceID 59

Reading ReferenceID 60

Reading ReferenceID 61

Reading ReferenceID 62

Reading ReferenceID 63

Reading ReferenceID 64

Reading ReferenceID 65

Reading ReferenceID 66

Reading ReferenceID 67

Reading ReferenceID 68

Reading ReferenceID 69

Reading ReferenceID 70

Reading ReferenceID 71

Reading ReferenceID 72

Reading ReferenceID 73

Reading ReferenceID 74

Reading ReferenceID 75

Reading ReferenceID 76

Reading ReferenceID 77

Reading ReferenceID 78

Reading ReferenceID 79

Reading ReferenceID 80

Reading ReferenceID 81

Reading ReferenceID 82

Reading ReferenceID 83

Reading ReferenceID 84

Reading ReferenceID 85

Reading ReferenceID -1

--------------------------------------------------------------------------
SUMMARY STATISTICS OF THE READS
Total number of reads: 100314134
Total number of paired-end reads: 100314134
Total number of properly paired reads: 98358061
Total number of unmapped reads: 551978
Total number of reverse strand mapped reads: 50040303
Total number of QC-failed reads: 0
Total number of secondary reads: 101472
Size of singleKeyMap (must be zero): 0
Size of pairedKeyMap (must be zero): 0
Total number of missing mates: 0
Total number of reads excluded from duplicate checking: 653450
--------------------------------------------------------------------------
Sorting the indices of 712915 duplicated records
Exiting due to ERROR:
    FAIL_IO: Failed to read the BAM header length.

It ends with this FAIL_IO,

but,

if I generate a temp file on disk before bamUtils and use that as the input, all is well. So its not the output but how its output maybe?

/mnt/samtools-1.2/samtools cat -h test3e2_bwa_5_tFBzvH_headers.sam *dedup.bam | /mnt/samtools-1.2/samtools sort -m 1G -l 0 -O bam -T /scratch/ -@4 > t.bam; /mnt/bamutil-1.0.12/bin/bam dedup --in t.bam --out -.bam --force --oneChrom --noPhoneHome > ts.bam
[bam_sort_core] merging from 40 files...
Reading ReferenceID 0

Reading ReferenceID 1

Reading ReferenceID 2

Reading ReferenceID 3

Reading ReferenceID 4

Reading ReferenceID 5

Reading ReferenceID 6

Reading ReferenceID 7

Reading ReferenceID 8

Reading ReferenceID 9

Reading ReferenceID 10

Reading ReferenceID 11

Reading ReferenceID 12

Reading ReferenceID 13

Reading ReferenceID 14

Reading ReferenceID 15

Reading ReferenceID 16

Reading ReferenceID 17

Reading ReferenceID 18

Reading ReferenceID 19

Reading ReferenceID 20

Reading ReferenceID 21

Reading ReferenceID 22

Reading ReferenceID 23

Reading ReferenceID 24

Reading ReferenceID 25

Reading ReferenceID 26

Reading ReferenceID 27

Reading ReferenceID 28

Reading ReferenceID 29

Reading ReferenceID 30

Reading ReferenceID 31

Reading ReferenceID 32

Reading ReferenceID 33

Reading ReferenceID 34

Reading ReferenceID 35

Reading ReferenceID 36

Reading ReferenceID 37

Reading ReferenceID 38

Reading ReferenceID 39

Reading ReferenceID 40

Reading ReferenceID 41

Reading ReferenceID 42

Reading ReferenceID 43

Reading ReferenceID 44

Reading ReferenceID 45

Reading ReferenceID 46

Reading ReferenceID 47

Reading ReferenceID 48

Reading ReferenceID 49

Reading ReferenceID 50

Reading ReferenceID 51

Reading ReferenceID 52

Reading ReferenceID 53

Reading ReferenceID 54

Reading ReferenceID 55

Reading ReferenceID 56

Reading ReferenceID 57

Reading ReferenceID 58

Reading ReferenceID 59

Reading ReferenceID 60

Reading ReferenceID 61

Reading ReferenceID 62

Reading ReferenceID 63

Reading ReferenceID 64

Reading ReferenceID 65

Reading ReferenceID 66

Reading ReferenceID 67

Reading ReferenceID 68

Reading ReferenceID 69

Reading ReferenceID 70

Reading ReferenceID 71

Reading ReferenceID 72

Reading ReferenceID 73

Reading ReferenceID 74

Reading ReferenceID 75

Reading ReferenceID 76

Reading ReferenceID 77

Reading ReferenceID 78

Reading ReferenceID 79

Reading ReferenceID 80

Reading ReferenceID 81

Reading ReferenceID 82

Reading ReferenceID 83

Reading ReferenceID 84

Reading ReferenceID 85

Reading ReferenceID -1

--------------------------------------------------------------------------
SUMMARY STATISTICS OF THE READS
Total number of reads: 100314134
Total number of paired-end reads: 100314134
Total number of properly paired reads: 98358061
Total number of unmapped reads: 551978
Total number of reverse strand mapped reads: 50040303
Total number of QC-failed reads: 0
Total number of secondary reads: 101472
Size of singleKeyMap (must be zero): 0
Size of pairedKeyMap (must be zero): 0
Total number of missing mates: 0
Total number of reads excluded from duplicate checking: 653450
--------------------------------------------------------------------------
Sorting the indices of 712915 duplicated records

Writing -.bam
Successfully marked 19176 unpaired and 346869 paired duplicate reads

Dedup complete!


Any thoughts anyone?

Mary Kate Wing

unread,
Oct 27, 2015, 12:46:40 PM10/27/15
to bamutils

The current version of the deduper requires two passes through the input bam. One to identify the duplicates, one to mark them.  This two pass method does not work for streamed files.

I should add validation to reject streaming input to dedup.

--
You received this message because you are subscribed to the Google Groups "bamUtils" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bamutils+u...@googlegroups.com.
To post to this group, send email to bamu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bamutils/245166c3-c43d-4c57-9226-58e682ec85de%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

oabio...@gmail.com

unread,
Oct 27, 2015, 10:14:10 PM10/27/15
to bamUtils
Makes perfect sense now, thank you.

Mary Kate Wing

unread,
Oct 27, 2015, 10:16:07 PM10/27/15
to bamutils
Glad to help.  Let me know if you have any additional questions.

Mary Kate Wing

oabio...@gmail.com

unread,
Oct 29, 2015, 11:06:18 AM10/29/15
to bamUtils
Actually I am not sure why this stream does not work for two pass files, what if you have an internal " | tee output " logic to create a tmp file only to read from it for the second pass and write back to stdout, wouldn't that save you one write and one read step via the stream?

Mary Kate Wing

unread,
Oct 29, 2015, 11:18:16 PM10/29/15
to bamutils
I may be misunderstanding your recommendation, which I believe is for dedup to create a temporary file when reading from a stream that it will then read from in the 2nd pass.
That is a potential update for the code, but the difference is 2 file reads vs 1 stream and 1 file read.  I don't think it would save a write since the temp file has to get written at some point, it is just a matter of when.



oabio...@gmail.com

unread,
Nov 2, 2015, 12:13:25 PM11/2/15
to bamUtils
I guess you are right in that it would not save one write, but practically, it may be more efficient in some pipelines where you stream and use dedicated tmp storage on local disk rather than distributed storage.

Thank you for the discussion
Best
Ogan
Reply all
Reply to author
Forward
0 new messages