Bamtobed file not passing is.region.valid (and then segfaults on merge or intersect)

48 views
Skip to first unread message

Liana Engie

unread,
Apr 10, 2020, 1:09:38 PM4/10/20
to bedtools-discuss
Hello all,

I'm troubleshooting some problems I was having with intersect and merge; I've been using a terminal in Ubuntu as well as R. Ideally I will be trying to accomplish all of this in R. While tracing back, I realized that the bed files I have are failing is.valid.region but I can't tell why. I converted it to bed using bedtools, straight from the bam file generated by STAR alignment. The entire "chr" column uses "chr" prefixes, even if I test it on the first 6 lines it fails:

is.valid.region(head(bam2bed))
VALIDATE REGIONS
 * Checking input type... PASS
   Input seems to be in bed format but chr/start/end column names are missing
 * Check if index is a string... PASS
 * Check index pattern... FAIL
   Use check.chr = FALSE if no 'chr' prefix
[1] "c(\"chr1\", \"chr1\", \"chr1\", \"chr1\", \"chr1\", \"chr1\"):c(24, 119, 206, 264, 1023, 1094)-c(100, 194, 282, 340, 1099, 1170)"
  * Check for missing values... FAIL
                                                chr start end
1 c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1")    NA  NA
 * Check for larger start position... PASS.
 * Check if zero based... PASS
[1] FALSE

If I ignore all this and try to continue with the "invalid" files, I can't use merge or intersect even with small test files. All the advice online I've seen about segfaults are about memory size, but even using 100GB on my local cluster and trying to do an intersection between the two strands of the small 100 line file, I get a segfault. No error codes or anything, just a segfault. I've attached one of the test files here.

Thanks everyone, any advice would be appreciated.

Liana


firsthundred.bed

Aaron Quinlan

unread,
Apr 12, 2020, 10:44:52 AM4/12/20
to Liana Engie, bedtools...@googlegroups.com
Hi Liana,

Could you show the first few lines of the file that represents your “bamtobed” object in R?  I am not familiar with the is.valid.region() function, and thus don’t have much insight.

Apologies,
Aaron
--
You received this message because you are subscribed to the Google Groups "bedtools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bedtools-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bedtools-discuss/430f6b4e-dc5e-4d30-a822-bc12789bb5ba%40googlegroups.com.

Liana Engie

unread,
Apr 12, 2020, 5:40:40 PM4/12/20
to Aaron Quinlan, bedtools...@googlegroups.com
Hello all,

Here's what my bam file looks like:
> bam[1:20,]
GAlignmentPairs object with 20 pairs, strandMode=1, and 0 metadata columns:
       seqnames strand   :    ranges  --    ranges
          <Rle>  <Rle>   : <IRanges>  -- <IRanges>
   [1]     chr1      +   :    25-100  --   120-194
   [2]     chr1      +   :   207-282  --   265-340
   [3]     chr1      +   : 1024-1099  -- 1095-1170
   [4]     chr1      +   : 1482-1556  -- 1497-1572
   [5]     chr1      +   : 2536-2611  -- 2538-2613
   ...      ...    ... ...       ... ...       ...
  [16]     chr1      +   : 4095-4169  -- 4184-4257
  [17]     chr1      +   : 4181-4256  -- 4225-4300
  [18]     chr1      +   : 4681-4755  -- 4997-5071
  [19]     chr1      +   : 4687-4758  -- 4727-4802
  [20]     chr1      +   : 4751-4826  -- 4864-4939
  -------
  seqinfo: 1061 sequences from an unspecified genome


I've converted it to a bed file manually (constructing a dataframe) or with bedtools in the console.
> bam2bed <- fread("bedtools_converted_bam2bed.bed",data.table=TRUE)
> head(bam2bed)
    chr start  end                                         name score strand
1: chr1    24  100 NS500449:173:HH7L3BGXX:3:21504:14989:20106/1   255      +
2: chr1   119  194 NS500449:173:HH7L3BGXX:3:21504:14989:20106/2   255      -
3: chr1   206  282 NS500449:173:HH7L3BGXX:2:21210:13252:12427/1   255      +
4: chr1   264  340 NS500449:173:HH7L3BGXX:2:21210:13252:12427/2   255      -
5: chr1  1023 1099  NS500449:173:HH7L3BGXX:1:12210:20287:2447/1   255      +
6: chr1  1094 1170  NS500449:173:HH7L3BGXX:1:12210:20287:2447/2   255      -


This has issues with is.valid.region(), which I started to pay more attention to after the intersect began giving me problems. I had also tried simplifying it to a 6 column or 4 column bed and removing all non "chr" prefixed lines. The files are sorted and I should have plenty of memory allocated from my computer cluster. 
      V1   V2   V3 V4 V5 V6
1: chr 1   24  194  0  0  +
2: chr 1  206  340  0  0  +
3: chr 1 1023 1170  0  0  +
4: chr 1 1481 1572  0  0  +
5: chr 1 2535 3075  0  0  +
6: chr 1 3162 3441  0  0  +


             V1    V2    V3        V4
1451 chr1 11873 14409 NR_046018
1448 chr1 14361 29370 NR_024540
1443 chr1 17368 17436 NR_107063
1444 chr1 17368 17436 NR_128720
1446 chr1 17368 17436 NR_107062
1447 chr1 17368 17436 NR_106918


If I take even this 6 line bed and intersect it with another 6 line bed, I get a segfault. Either of these:
bedtools intersect -b 100linetest.bed 100linetest.bed -S -sorted > testresult.bed
bedtools intersect -b onlychr1.bed -S -sorted > onlychr1int.bed
Results in a segfault
# Segmentation fault (core dumped)

Liana

Milind Chauhan

unread,
Jan 31, 2023, 1:49:40 PM1/31/23
to bedtools-discuss
Hi, 
I am facing the same problem using bedtools. Did you find the solution to this?

Reply all
Reply to author
Forward
0 new messages