fastafromBed intervals

970 views
Skip to first unread message

Alpesh Querer

unread,
Nov 11, 2013, 4:54:52 PM11/11/13
to bedtools...@googlegroups.com
Hi Aaron,

I want to extract fasta sequence with a bed file with coordinates of interest and fasta file containing contigs.
I also want flanking sequence 100bp on either side, the problem is the contig might start or end within that 100bp uspstream or downstream
and fastaFromBed throws an error

"Feature xxx  beyond the length of xxx size (214 bp).  Skipping. "

Is there a way of getting around the problem where it considers bases only till the beginning of the contig or the end of the contig if they fall in range?

Thanks,
Al

Aaron Quinlan

unread,
Nov 12, 2013, 7:46:18 AM11/12/13
to bedtools...@googlegroups.com
Hi Al,

One solution would be to create a "genome" file representing each of your contigs.  Then supply this to the "slop" tool - if you do so, the 100bp of slop added will be truncated based on the boundaries of your contig. This should then prevent the error you observe when extracting the intervals with the getfasta tool.

- Aaron






--
You received this message because you are subscribed to the Google Groups "bedtools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bedtools-discu...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Alpesh Querer

unread,
Nov 12, 2013, 6:21:25 PM11/12/13
to bedtools...@googlegroups.com
Hi Aaron,

We do not have any information on the contigs, just the sequences, Is there any other solution ?

Thanks,
Al

Urban, John

unread,
Nov 12, 2013, 6:40:38 PM11/12/13
to bedtools...@googlegroups.com
Hi Al,

I think Aaron means that if you have a fasta file full of contigs, presumably with names, then you can create a genome file, which is just each contigs name and length in a tab delimited file.

e.g.
contig1     10000
contig2     5000
etc

One way to generate such a file would be to use "faSize -detailed contigs.fa" (where faSize is from the UCSC/Kent tools) to generate such a file.

Hope that helps.

Best,

   John
John Urban
Ph.D. Candidate
NSF Graduate Research Fellow
Brown University
Dept. of Molecular Biology, Cell Biology, and Biochemistry
Laboratory of Susan Gerbi, Sydney Frank Hall 257
Providence, RI 02906

Aaron Quinlan

unread,
Nov 12, 2013, 6:57:48 PM11/12/13
to bedtools...@googlegroups.com
Yes, this is exactly what I meant yet explained very poorly.  Thanks John.

- Aaron




Aaron Quinlan

unread,
Nov 12, 2013, 7:28:00 PM11/12/13
to bedtools...@googlegroups.com
An alternative to faSize is bioawk, a really nice genomics-focused extension of awk maintained by Heng Li (https://github.com/lh3/bioawk).

Using bioawk, one could do the following to provide a "genome" file (i.e., contig labe and size) of your contigs:

$ cat contigs.fa
>contig1
accagagcta
accatacctg
>contig2
tttttttttt
cccccccccc
aaaaaaaaaa

# note "awk" here is bioawk
$ awk -c fastx '{print $name"\t"length($seq)}' contigs.fa
contig1 20
contig2 30

$ awk -c fastx '{print $name"\t"length($seq)}' contigs.fa > contigs.fa.genome


Now, you can use these lengths to constrain the "slop"

$ bedtools slop -b 100 -i al.bed -g  contigs.fa.genome > al.slop00.bed

Now, extract the sequences:

$ bedtools getfasta -fi contigs.fa -b al.slop00.bed -fo stdout

On Nov 12, 2013, at 6:40 PM, "Urban, John" <john_...@brown.edu> wrote:

Alpesh Querer

unread,
Nov 13, 2013, 12:22:12 PM11/13/13
to bedtools...@googlegroups.com
Hi Aaron and John,

First of all, thank you for being so helpful.

I have created the genome file as instructed, now I`m getting the following error

# cat ex1.genome
comp4035_c0_seq1_len=214_path=[192:0-213]       214

# cat ex1.bed
comp4035_c0_seq1_len=214_path=[192:0-213]       50      65

 # bedtools slop -b 100 -i ex1.bed  -g ex1.genome
bedtools: symbol lookup error: /usr/lib64/libstdc++.so.6: undefined symbol: _ZNSt8messagesIcE2idE, version GLIBCXX_3.4


Is there something else I need to install beside bedtools?

Thanks,
Al



Aaron Quinlan

unread,
Nov 14, 2013, 9:48:27 AM11/14/13
to bedtools...@googlegroups.com
Hi Alpesh,

I tried the same command with your test data and everything works fine for me. I am not sure exactly what is going on, but it looks like there is an issue with a dynamica library that bedtools needs.  Are you running this command on a new machine perhaps?

- Aaron





António Miguel de Jesus Domingues

unread,
Feb 13, 2015, 9:39:50 AM2/13/15
to bedtools...@googlegroups.com
(refers to pybedtools.BedTool.sequence, but the error message is similar)

Hi Aaron,

sorry to ressuscitate this topic, but I am having a similar problem, that is, my script is breaking with the message:

Error message was:
c
(16596 bp).  Skipping.
Feature (chrM:16589-16599) beyond the length of chrM size (16596 bp).  Skipping.

The difference to the OP, is that I am not using slop, but rather I am extracting 5' and 3' location using the (quite useful) helper functions pybedtools.featurefuncs.five_prime and three_prime. I looked in manual, and also directly in the code, but could not find an option similar to 'chromsizes', as in:

c = a.slop(g=pybedtools.chromsizes('hg19'), b=100)

for the functions five_prime/three_prime. Was this by design? Or am I missing something?  Of course I could try and recreate those function with some sort of 'slop', but maybe there is a better solution.

Cheers,
António

Ryan Dale

unread,
Feb 17, 2015, 1:10:48 PM2/17/15
to bedtools...@googlegroups.com
Hi António -

That's correct, five_prime and three_prime don't currently have an option for chromsizes to bound their respective coordinate adjustments. 

If I remember correctly, I didn't add the bounds check in those functions for performance reasons. But they probably should at least have the option. I'll add the ability to accept chromsizes (either as a filename, dict, or assembly name).

-ryan
For more options, visit https://groups.google.com/d/optout.

ryan

unread,
Feb 22, 2015, 10:20:51 AM2/22/15
to bedtools...@googlegroups.com
Hi António -

I just added this to the development version of pybedtools. You can now provide a custom dictionary or one returned by pybedtools.chromsizes() to the five_prime, three_prime, and TSS functions in pybedtools.featurefuncs.  The returned intervals will be truncated to the provided chromosome sizes, just like slop.

Side note: For cases where features fall completely outside of the chromosome, these functions set the start and stop positions to the end of the chromosome.  This is different from the behavior of bedtools slop, which adjusts only the stop position.

-ryan

António Miguel de Jesus Domingues

unread,
Feb 22, 2015, 10:42:23 AM2/22/15
to bedtools...@googlegroups.com
Dear Ryan,

once again thank you for adding another feature. I will test it sometime next week and let you know if I run into problems.

Cheers,
António
 

--
You received this message because you are subscribed to a topic in the Google Groups "bedtools-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bedtools-discuss/mmBumZEmd4U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bedtools-discu...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
--
António Miguel de Jesus Domingues, MSc/PhD
Postdoc in René Ketting group
Institute of Molecular Biology (IMB)
Ackermannweg 4
55128 Mainz
Germany
Phone: +49--6131-39-21472
www.imb-mainz.de/research-at-imb/ketting/rna-interference/

[The unbearable lightness of Molecular Biology]

António Miguel de Jesus Domingues

unread,
Mar 6, 2015, 1:13:39 PM3/6/15
to bedtools...@googlegroups.com
Hi Ryan,

just to report that it is working.

Cheers,
António

Reply all
Reply to author
Forward
0 new messages