retrieving paired sequences from BAM

1,558 views
Skip to first unread message

Mic

unread,
Feb 20, 2012, 1:37:59 AM2/20/12
to Biopython Mailing List, pysam-us...@googlegroups.com
Hello,
I am trying to retrieve paired end sequences from BAM file. However, I am only able to retrieve the A sequence with the following code:
''''''''''''''''''''''''''''''''''''''''
import pysam
samfile = pysam.Samfile("b.sorted.bam", "rb")
for read in samfile.fetch():
        if read.is_paired:
                print read.qname
                print read.seq
samfile.close()

$ python a.py | head -n 2
HWI-ST226_0154:4:1206:12773:170407#CTTGTA
AGTATAAAACTAAGCAAACTGTTAGAACTTTGATTACGTTTTGTTTATCAGTGATACGCAAAAGTTTAAGATCCTTGAGTACCTCTTTCGATGGCGGATT
''''''''''''''''''''''''''''''''''''''''

How is it possible to retrieve also the B sequence?
What is the difference between is_proper_pair and is_paired?

-------------
$ grep -A 1 'HWI-ST226_0154:4:1206:12773:170407#CTTGTA' X_A.fastq 
@HWI-ST226_0154:4:1206:12773:170407#CTTGTA/1
AGTATAAAACTAAGCAAACTGTTAGAACTTTGATTACGTTTTGTTTATCAGTGATACGCAAAAGTTTAAGATCCTTGAGTACCTCTTTCGATGGCGGATT

$ grep -A 1 'HWI-ST226_0154:4:1206:12773:170407#CTTGTA' X_B.fastq
@HWI-ST226_0154:4:1206:12773:170407#CTTGTA/2
GGCGCTGTGACTGAAGTCCTCAGATTTCGGTACGGTTTTGTCTATTTCTGGGTTCCTGCGGAAACACCTTCTCGATTATTTTCTAATCTCAATTAGGTTT
-------------

Thank you in advance.

Cheers 

Ben Schiller

unread,
Feb 20, 2012, 1:46:36 AM2/20/12
to pysam-us...@googlegroups.com
is_proper_pair means both reads are mapped to the reference sequence
(genome). is_paired just means there was a paired read. the .mate()
method of the Samfile object will return an AlignedRead's pair partner
but will change the file position, so you may have to seek/tell to get
back to where you were.

Ben

Peter Cock

unread,
Feb 20, 2012, 2:55:45 AM2/20/12
to pysam-us...@googlegroups.com
On Mon, Feb 20, 2012 at 6:46 AM, Ben Schiller <ben.j.s...@gmail.com> wrote:
> is_proper_pair means both reads are mapped to the reference sequence
> (genome).

Well, it is actually dependent on the aligner - it could also require
the distance be appropriate for the library preparation (e.g. 3kp
apart), and in the correct orientation (e.g. inwards or outwards),
in order to be considered "properly" paired.

Lots of improperly paired reads in one region could be a sign
that there was be something locally wrong with the assembly.

> is_paired just means there was a paired read. the .mate()
> method of the Samfile object will return an AlignedRead's pair partner
> but will change the file position, so you may have to seek/tell to get
> back to where you were.

Are there any usage guidelines on doing this?

Peter

Mic

unread,
Feb 20, 2012, 3:15:27 AM2/20/12
to pysam-us...@googlegroups.com
I need is_proper_pair, because I am interested in reads which actually aligned to the reference genome. SOAP2 has been used for the alignment, but is it possible to be aligner independent? 

The only thing what I need is to retrieve A and B reads and whether the A or B read has been reversed and/or reversed complement.

$ samtools view b.sorted.bam | grep 'HWI-ST226_0154:4:1206:12773:170407#CTTGTA'
HWI-ST226_0154:4:1206:12773:170407#CTTGTA 97 A1r 1046 30 100M = 1382 436 AGTATAAAACTAAGCAAACTGTTAGAACTTTGATTACGTTTTGTTTATCAGTGATACGCAAAAGTTTAAGATCCTTGAGTACCTCTTTCGATGGCGGATT fdfffdfffbddaec_dSc^ddd^dQc^`Udddad_`c^^ac`R_NV\\]T^c`T_Mc]aV[V\R`Xa^^EKIIVcccc`YNY]UV[U`BBBBBBBBBBB NM:i:1 MD:Z:73T26
HWI-ST226_0154:4:1206:12773:170407#CTTGTA 145 A1r 1382 30 100M = 1046 -436 AAACCTAATTGAGATTAGAAAATAATCGAGAAGGTGTTTCCGCAGGAACCCAGAAATAGACAAAACCGTACCGAAATCTGAGGACTTCAGTCACAGCGCC BBBBBBBBBBBBBBBBBBBBBBBcca^`Vb[N`WNW`Ma^`[cYc\d[^fcQ`cc]Y[aX]]]SbY``cU]`bebc]`ccYb[S]^b]`fddOddbcb`f NM:i:2 MD:Z:2C19C77

Thank you in advance.

Mic

unread,
Feb 20, 2012, 5:50:28 AM2/20/12
to pysam-us...@googlegroups.com
I modified the code in the following way:
import pysam
samfile = pysam.Samfile("B.sorted.bam", "rb")
for read in samfile.fetch():
        if read.is_paired:
                print read.qname
                if read.is_read1:
                        print 'A: ', read.seq
                else:
                        print 'B: ', read.seq
samfile.close()


Unfortunately, I am still not able to catch the B read.


$ python a.py | head -n 3
HWI-ST226_0154:4:1206:12773:170407#CTTGTA
A:  AGTATAAAACTAAGCAAACTGTTAGAACTTTGATTACGTTTTGTTTATCAGTGATACGCAAAAGTTTAAGATCCTTGAGTACCTCTTTCGATGGCGGATT
HWI-ST226_0154:4:1207:19972:10892#CTTGTA
Traceback (most recent call last):
  File "a.py", line 5, in <module>
    print read.qname
IOError: [Errno 32] Broken pipe

What do I do wrong?

Kevin Jacobs <jacobs@bioinformed.com>

unread,
Feb 20, 2012, 9:38:02 AM2/20/12
to pysam-us...@googlegroups.com
On Mon, Feb 20, 2012 at 5:50 AM, Mic <mict...@gmail.com> wrote:
Unfortunately, I am still not able to catch the B read.


Here is code that has worked for me:

def main():
  filename = sys.argv[1]

  reads = pysam.Samfile(filename)

  for read in reads:
    if read.is_proper_pair and read.is_read1:
      pos = reads.tell()

      try:
        mate = reads.mate(read)
      except ValueError:
        # Invalid mate (usually post-filtered)
        continue
      finally:
        reads.seek(pos)

      print read.qname,'read1'
      print mate.qname,'read2'
      print
 
Note that with BAMs created using CASAVA 1.8 using certain aligners, the qname for the paired ends may be identical.  This is not a problem, so long as the right flags are set to distinguish the records and your code is set up to expects this.

-Kevin
Reply all
Reply to author
Forward
0 new messages