Help with get_tag

1,247 views
Skip to first unread message

Jonathan Keats

unread,
Jul 31, 2016, 5:17:38 AM7/31/16
to Pysam User group
I just started using pysam today and have found it quite useful.  I'm building a function to extract features from a SAM file.  Everything seems to work except I keep getting errors with get_tag().  Does anyone have a good example or can point me to how I would pull the MC and MQ tags?  I'm fairly certain its user error but I can't seem to find an example that gets me past this issue.  Thanks in advance

SAM file line as example:









D1HJQQN1:440:C7CU0ACXX:7:1101:1358:62711        97      12      76403087        60      101M    =       76403423        437     TTTCTATCCCATAGGAAACACAGGGCAGATAGTCTTCACCAGACACTGCAGGAAAGCAGGGATAGAGAGGTGGTATTTGTCCTGCACAGACCTACCCCTCC   ??=>AAA?@@A@@A?AAA?A?AA@@ABAAAAAAAABAA@@BBA@A?ABBBA@AABBABBAABABAAAAAA>AA:?ABCBABBCCBCACCBABCBABB@??@   MC:Z:101M       BD:Z:IICJNPLPMLMMIMMKJCJHGHMLIMMMJKHLLLIIIKGLKMJLHGKMNNNMKJCKONNMJKLIMKLKLMLILLLIJCKJNLMNOOIJPMOPOPPROJLJK      MD:Z:101        PG:Z:MarkDuplicates     RG:Z:C7CU0ACXX_7        BI:Z:LLFLOQNPMKNOKNMMLELKJKOMKOOOMNKNNMLKKNJMMOMNKJNONOOMMMFNQPPNLNOLONNNNNPLMPPNNGOMONQQPRMNRQRQTSQSQKPKM      NM:i:0  MQ:i:60 AS:i:101        XS:i:0


Using the function below to produce a pandas data frame:
def bam_to_df(bam, chr = None, start=None, stop = None):
    seq = []
    name = []
    r1_chr = []
    r1_pos = []
    r1_isRead1 = []
    r1_isReversed = []
    r1_cigar = []
    r1_mapq = []
    r2_isReversed = []
    r2_chr = []
    r2_pos = []
    frag_length = []
    for read in bam.fetch(chr, start, stop):
        seq.append(read.query_sequence)
        name.append(read.query_name)
        r1_chr.append(read.reference_name)
        r1_pos.append(read.reference_start)
        r1_isRead1.append(read.is_read1)
        r1_isReversed.append(read.is_reverse)
        r1_cigar.append(read.cigarstring)
        r1_mapq.append(read.mapping_quality)
        r2_isReversed.append(read.mate_is_reverse)
        r2_chr.append(read.next_reference_name)
        r2_pos.append(read.next_reference_start)
        frag_length.append(read.template_length)
    return pd.DataFrame({'seq': seq, 
                         'name': name, 
                         'r1_pos': r1_pos, 
                         'r1_chr': r1_chr, 
                         'r1_isRead1': r1_isRead1, 
                         'r1_isReversed': r1_isReversed, 
                         'r1_cigar': r1_cigar, 
                         'r1_mapq': r1_mapq, 
                         'r2_isReversed': r2_isReversed, 
                         'r2_chr': r2_chr, 
                         'r2_pos': r2_pos, 
                         'frag_length': frag_length})


Andreas

unread,
Aug 1, 2016, 11:40:04 AM8/1/16
to Pysam User group
Hi Jonathan,

to obtain a tag, use:

read.get_tag("MC")

To get all tags, use:

read.get_tags()


There are some examples in the tests directory, see for example AlignedSement_test.py

Note that you might want to look at the pandas.DataFrame.from_records method to create your dataframe:

df = pd.DataFrame.from_records((x.query_name, x.reference_pos for x in bam.fetch(chr, start, stop),
                                columns
=["name", "r1_pos"])



Best wishes,
Andreas

Jonathan Keats

unread,
Aug 1, 2016, 4:56:46 PM8/1/16
to Pysam User group
Thanks Andreas,

I get the following error message, could it be if the MC tag is not on every line? 
To extract the value I used in the code block I noted in the initial post and maybe this is the issue, I'll try the from_records method but I'm trying to avoid reworking a bunch of code.
r2_mc.append(read.get_tag("MC"))

KeyError                                  Traceback (most recent call last)
<ipython-input-29-49624e58c7e0> in <module>()
----> 1 table = bam_to_df(samfile)

<ipython-input-27-6ee043f48f1c> in bam_to_df(bam, chr, start, stop)
     26         r2_pos.append(read.next_reference_start)
     27         frag_length.append(read.template_length)
---> 28         r2_mc.append(read.get_tag("MC"))
     29     return pd.DataFrame({'seq': seq, 
     30                          'name': name,

pysam/calignedsegment.pyx in pysam.calignedsegment.AlignedSegment.get_tag (pysam/calignedsegment.c:21307)()

pysam/calignedsegment.pyx in pysam.calignedsegment.AlignedSegment.get_tag (pysam/calignedsegment.c:20683)()

KeyError: "tag 'MC' not present"

Florian Finkernagel

unread,
Aug 2, 2016, 2:45:21 AM8/2/16
to Pysam User group
Dear Jonathan,


KeyError: "tag 'MC' not present"


as the documentation reads: 
Raises:

KeyError – If tag is not present, a KeyError is raised.


 You will therefore need to wrap the get_tag call into a try: except block if not all of your reads have an 'MC' tag.

More elegantly, you'd wrap the try except into a function returning a default value,
like this.
def robust_get_tag(read, tag_name, default_value):  
 
try:  
   
return read.get_tag(tag_name)
 
except KeyError:
   
return default_value


So long,
Florian

Jonathan Keats

unread,
Aug 2, 2016, 10:57:47 AM8/2/16
to Pysam User group
Thanks,

that makes sense and answers my suspicion.  Also thanks for the code suggestion I appreciate the help.

Jonathan
Reply all
Reply to author
Forward
0 new messages