0.3 release tag setting broken

24 views
Skip to first unread message

James Casbon

unread,
Sep 8, 2010, 12:08:32 PM9/8/10
to Pysam User group
pysam expects tags need to be two chars long, but the standard tags
have one char. This means you cannot update the tags in a file
without destroying the 'Z' and 'C' ones...

Example code:

for entry in inp:
print entry.tags
entry.tags = entry.tags

traceback:

[('C', 13), ('Z', '51^T13^A23^A13^AAAA8^T28^T5^A2^T19^T1')]
Traceback (most recent call last):
File "/tmp/annotate_bam.py", line 50, in <module>
sys.exit(app.main())
File "/home/james/simple/lib/python2.6/site-packages/cmdln.py", line
257, in main
return self.cmd(args)
File "/home/james/simple/lib/python2.6/site-packages/cmdln.py", line
280, in cmd
retval = self.onecmd(argv)
File "/home/james/simple/lib/python2.6/site-packages/cmdln.py", line
412, in onecmd
return self._dispatch_cmd(handler, argv)
File "/home/james/simple/lib/python2.6/site-packages/cmdln.py", line
1100, in _dispatch_cmd
return handler(argv[0], opts, *args)
File "/tmp/annotate_bam.py", line 42, in do_mark
entry.tags = entry.tags
File "csamtools.pyx", line 1599, in
csamtools.AlignedRead.tags.__set__ (pysam/csamtools.c:14442)
IndexError: string index out of range

James Casbon

unread,
Sep 9, 2010, 4:45:22 AM9/9/10
to Pysam User group
Ah, its the tags that are coming back wrong, they should be:
NM:i:13 MD:Z:51^T13^A23^A13^AAAA8^T28^T5^A2^T19^T1

James Casbon

unread,
Sep 9, 2010, 5:00:46 AM9/9/10
to Pysam User group
Fixed, here's the diff to 0.3-rel

diff -r df9522142af8 pysam/csamtools.pyx
--- a/pysam/csamtools.pyx Wed Aug 04 10:32:03 2010 +0100
+++ b/pysam/csamtools.pyx Thu Sep 09 09:00:08 2010 +0000
@@ -1511,9 +1511,6 @@
s += 2

# convert type - is there a better way?
- ctag[0] = s[0]
- ctag[1] = 0
- pytype = ctag
# get type and value
# how do I do char literal comparison in cython?
# the code below works (i.e, is C comparison)


On Sep 8, 5:08 pm, James Casbon <cas...@gmail.com> wrote:

Andreas

unread,
Sep 10, 2010, 8:11:24 AM9/10/10
to Pysam User group
Hi James,

thanks for the bug report and the fix. However, I can't seem
to be able to reproduce the problem. Could you send me
the exact line in "inp" that causes the problem?

Thanks,
Andreas

James Casbon

unread,
Sep 10, 2010, 8:21:02 AM9/10/10
to pysam-us...@googlegroups.com


Well all the lines caused the problem, but e.g:
GJP00TM04CAQ5W 0 2 38297693 60
45H51M1D13M1D12M1D9M2D5M1D7M4D2M1I6M1D28M1D5M1D2M1D18M55H *
0 0
CATGAAGAACCGCTGGGTATGGAGCACACCTCACCTGATGGACAGTTGATTATGCTCACCTTAACGCTAATTGAGAGCAGCACAAGAGGACTGGAAACTAGAATTTACTCCTCATCTCCGAAGATGTGAATATTCTAAATTCAGCTTGCCTCTTGCTTC
IID7757111/=;?///:D>777;EEGAAAEEIHHIIIIIIIIIIIIIIBBBIIIIH==<<<DDGEEE;<<<A><<<DEDDA>>>D?1112544556::03---//25.22=;DD?;;;>BDDDEEEGGGA<888<BAA888<GGGGGEB?9::DD551
NM:i:15 MD:Z:51^T13^A12^A9^AA5^A7^AAAA8^T28^T5^A2^T18 RG:Z:GJP00TM04

You should see the problem if you retrieve any optional tag using
entry.tags since the code was returning the type rather than the tag.

ie. for the tag RG:Z:GJP00TM04 you should be getting ('RG',
'GJP00TM04') but I actually got ('Z', 'GJP00TM04').

Kevin Jacobs <jacobs@bioinformed.com>

unread,
Sep 10, 2010, 8:33:02 AM9/10/10
to pysam-us...@googlegroups.com
On Fri, Sep 10, 2010 at 8:21 AM, James Casbon <cas...@gmail.com> wrote:
On 10 September 2010 13:11, Andreas <andrea...@gmail.com> wrote:
> Hi James,
>
> thanks for the bug report and the fix. However, I can't seem
> to be able to reproduce the problem. Could you send me
> the exact line in "inp" that causes the problem?


I also have time to look at this, if the solution isn't already complete.

-Kevin


Andreas

unread,
Sep 10, 2010, 8:55:45 AM9/10/10
to Pysam User group
Hi James,

still no success. Using the following code

samfile=pysam.Samfile( "ex8.bam","rb" )

for entry in samfile:
before = entry.tags
entry.tags = entry.tags
after = entry.tags
print after
self.assertEqual( after, before )

I get the following output:

['NM', 15), ('MD', '51^T13^A12^A9^AA5^A7^AAAA8^T28^T5^A2^T18'), ('RG',
'GJP00TM04')]

Looks clean to me. This is ex8.sam, from which ex8.bam is created.

@HD VN:
1.0
@SQ SN:2 LN:
48297693
GJP00TM04CAQ5W 0 2 38297693 60
45H51M1D13M1D12M1D9M2D5M1D7M4D2M1I6M1D28M1D5M1D2M1D18M55H
* 0 0
CATGAAGAACCGCTGGGTATGGAGCACACCTCACCTGATGGACAGTTGATTATGCTCACCTTAACGCTAATTGAGAGCAGCACAAGAGGACTGGAAACTAGAATTTACTCCTCATCTCCGAAGATGTGAATATTCTAAATTCAGCTTGCCTCTTGCTTC
IID7757111/
=;?///:D>777;EEGAAAEEIHHIIIIIIIIIIIIIIBBBIIIIH==<<<DDGEEE;<<<A><<<DEDDA>>>D?
1112544556::03---//25.22=;DD?;;;>BDDDEEEGGGA<888<BAA888<GGGGGEB?
9::DD551 NM:i:15 MD:Z:51^T13^A12^A9^AA5^A7^AAAA8^T28^T5^A2^T18
RG:Z:GJP00TM04


On Sep 10, 1:21 pm, James Casbon <cas...@gmail.com> wrote:

Andreas

unread,
Sep 10, 2010, 8:57:04 AM9/10/10
to Pysam User group
Hi Kevin,

thanks for the offer. I am working on it, but can't verify the problem
yet.
Might need your help if I can't get it to work.

Best wishes,
Andreas

On Sep 10, 1:33 pm, "Kevin Jacobs <jac...@bioinformed.com>"
<bioinfor...@gmail.com> wrote:
> On Fri, Sep 10, 2010 at 8:21 AM, James Casbon <cas...@gmail.com> wrote:

Kevin Jacobs <jacobs@bioinformed.com>

unread,
Sep 10, 2010, 9:09:20 AM9/10/10
to pysam-us...@googlegroups.com
On Fri, Sep 10, 2010 at 8:57 AM, Andreas <andrea...@gmail.com> wrote:
Hi Kevin,

thanks for the offer. I am working on it, but can't verify the problem
yet.
Might need your help if I can't get it to work.


I'm reworking the tags code in general, since it doesn't respect signed vs. unsigned conversions and the type conversions can be simplified (and not require aux functions at all).  Let me know if you'd like me to perform a similar reworking to the __set__ code, since it may flush out this bug and possibly others.

This rendition currently passes the test suite:

    property tags:
        def __get__(self):
            cdef bam1_t * src
            cdef char *s, *stop
            cdef char tpe

            src = self._delegate
            if src.l_aux == 0: return None

            s = <char*>bam1_aux(src)

            stop = <char*>(src.data + src.data_len)
            result = []
            while s < stop:
                tag = s[:2]
                tpe = s[2]
                s += 3

                if tpe == 'A':
                    value = s[:1]
                    s += 1
                elif tpe == 'C':
                    value = (<uint8_t*>s)[0]
                    s += 1
                elif tpe == 'c':
                    value = (<int8_t*>s)[0]
                    s += 1
                elif tpe == 'S':
                    value = (<uint16_t*>s)[0]
                    s += 2
                elif tpe == 's':
                    value = (<int16_t*>s)[0]
                    s += 2
                elif tpe == 'I':
                    value = (<uint32_t*>s)[0]
                    s += 4
                elif tpe == 'i':
                    value = (<int32_t*>s)[0]
                    s += 4
                elif tpe == 'f':
                    value = (<float*>s)[0]
                    s += 4
                elif tpe == 'd':
                    value = (<double*>s)[0]
                    s += 8
                elif tpe == 'Z' or tpe == 'H':
                    value = <char*>s
                    s += len(value)+1

                result.append( (tag, value) )

            return result

Andreas

unread,
Sep 10, 2010, 9:21:14 AM9/10/10
to Pysam User group
Hi Kevin,

thanks, you are much further than I am. Please continue and I will
stop.

The tags issue is also related to issue 40 - losing type information
for tags.
I think it is nice to have transparent type conversion between python
- samtools.
However, to fix issue 40, "tags" probably needs to be encapsulated to
preserve the original types if they are ambiguous - or "set" needs to
go
through existing tags to see which tags have not changed.

What do you think?

Thanks,
Andreas

On Sep 10, 2:09 pm, "Kevin Jacobs <jac...@bioinformed.com>"
<bioinfor...@gmail.com> wrote:

Marcel Martin

unread,
Sep 10, 2010, 9:23:22 AM9/10/10
to pysam-us...@googlegroups.com
On Friday 10. September 2010 15:09:20 Kevin Jacobs <jac...@bioinformed.com>
wrote:

> I'm reworking the tags code in general, since it doesn't respect signed
> vs. unsigned conversions and the type conversions can be simplified (and
> not require aux functions at all). Let me know if you'd like me to perform
> a similar reworking to the __set__ code, since it may flush out this bug
> and possibly others.

Hello Kevin,

the following issue may be related and you may want to take it into account
while reworking that code:
http://code.google.com/p/pysam/issues/detail?id=40

Marcel
--
Dipl.-Inform. Marcel Martin, https://ls11-www.cs.tu-dortmund.de/staff/martin/

Kevin Jacobs <jacobs@bioinformed.com>

unread,
Sep 10, 2010, 9:31:54 AM9/10/10
to pysam-us...@googlegroups.com
On Fri, Sep 10, 2010 at 9:21 AM, Andreas <andrea...@gmail.com> wrote:
thanks, you are much further than I am. Please continue and I will
stop.

The tags issue is also related to issue 40 - losing type information
for tags.
I think it is nice to have transparent type conversion between python
- samtools.
However, to fix issue 40, "tags" probably needs to be encapsulated to
preserve the original types if they are ambiguous - or "set" needs to
go
through existing tags to see which tags have not changed.

What do you think?


I'll have to think about it.  My impulse was to use the tag that results in the smallest data storage size and preserves the distinction between strings, integers and floating point values.  This does imply that overly wide storage tags may be narrowed.  e.g.: NM:i:32 -> NM:C:32.  So smaller BAM files at the expense of muddy types.  

I'll check what samtools and other interfaces do, since that defies the guidance on predefined tag types in the SAM standard.  One possibility to consider is to support optional type specifiers and use a table of the standard tag types to ensure we remain standards compatible.

-Kevin

Marcel Martin

unread,
Sep 10, 2010, 9:32:27 AM9/10/10
to pysam-us...@googlegroups.com
On Friday 10. September 2010 15:21:14 Andreas wrote:
> However, to fix issue 40, "tags" probably needs to be encapsulated to
> preserve the original types if they are ambiguous - or "set" needs to
> go
> through existing tags to see which tags have not changed.

The problem seems to be that there's not a one-to-one mapping between SAM
types and Python types, which means that some information must get lost during
conversion. Looking at the SAM spec, only type 'A' (printable character) and
type 'H' (hex string) cannot be represented as Python types.

Maybe one solution is to introduce 'char' and 'hexstring' classes, which would
have to be used if one wants to set a character or hex tag and which would be
returned when retrieving such a tag.

Kevin Jacobs <jacobs@bioinformed.com>

unread,
Sep 10, 2010, 9:49:18 AM9/10/10
to pysam-us...@googlegroups.com
On Fri, Sep 10, 2010 at 9:32 AM, Marcel Martin <marcel...@tu-dortmund.de> wrote:
The problem seems to be that there's not a one-to-one mapping between SAM
types and Python types, which means that some information must get lost during
conversion. Looking at the SAM spec, only type 'A' (printable character) and
type 'H' (hex string) cannot be represented as Python types.

Maybe one solution is to introduce 'char' and 'hexstring' classes, which would
have to be used if one wants to set a character or hex tag and which would be
returned when retrieving such a tag.

Assuming backward compatibility is not a major issue, I am pondering a two possible approaches and am open to other suggestions.  The Python maxim of "explicit is better than implicit" is guiding my thoughts.

1: Simple: Make types explicit by requiring a sequence of 3-tuples: (tag, type_code, value).  Still suffers from awkward updates.

2: Make tags into a true proxy object with a wider interface:

   print read.tags.keys()
   print read.tags.values()  # returns tuples of (type_code, Python value)
   tags = read.tags
   print read.tags['NM']        # dict-like interface returning (type_code,Python value)
   read.tags['NM'] = 1          # implicit type
   read.tags['NM'] = 'Z',1     # explicit type
   assert 'NM' in read.tags  # Simple containment checks
   for key in read.tags:         # key iteration
   for key,tag,value in read.items():   # not quite dict-like, but clear

Thoughts?

-Kevin

James Casbon

unread,
Sep 10, 2010, 9:52:23 AM9/10/10
to pysam-us...@googlegroups.com
On 10 September 2010 13:55, Andreas <andrea...@gmail.com> wrote:
> Hi James,
>
> still no success. Using the following code
>
>        samfile=pysam.Samfile( "ex8.bam","rb" )
>
>        for entry in samfile:
>            before = entry.tags
>            entry.tags = entry.tags
>            after = entry.tags
>            print after
>            self.assertEqual( after, before )
>
> I get the following output:
>
> ['NM', 15), ('MD', '51^T13^A12^A9^AA5^A7^AAAA8^T28^T5^A2^T18'), ('RG',
> 'GJP00TM04')]
>


Which version, this is using 0.3 which I just installed into a new virtualenv:

(pysam)$ pip freeze | grep pysam
pysam==0.3
(pysam)$ cat test.sam
@HD VN:1.0
@SQ SN:2 LN:30000000
GJP00TM04CLM2L 0 2 38297693 60 45H89M1D14M3D2M1I66M56H * 0 0 CATGAAGAACCGCTGGGTATGGAGCACACCTCACCTGATGGACAGTTGATTTATGCTCACCTTAAACGCTAATTGAGAAGCAGCACAAAGAGGAACTGGAAAACTAGAATTTTACTCCTCATCTCCGAAGATGTGAATATTTCTAAAATTTCAGCTTGCCTCTTGCTTCTTA IIIIIIIIIIIIHGGGHIIIIFFFGIIIFFCAADIIIIIFFFFIIIIHIDDDGCG@@>@@C7731147EGGGEEGIHHHHIIIIIF===IIHH;;;FFF@@?=AA4888=:::AA8??AAAAIIIIIIIIIIIHGFDGE@@@?23333?????ADAEGGA=788=DEEFEED NM:i:5 MD:Z:89^A14^AAA68 RG:Z:GJP00TM04
(pysam)$ samtools view -Sb test.sam -o test.bam
[samopen] SAM header is present: 1 sequences.
(pysam)$ python
Python 2.6.2 (release26-maint, Apr 19 2009, 01:58:18)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pysam
>>> sf = pysam.Samfile('test.bam', 'rb')
>>> ar = sf.next()
>>> print ar.tags
[('C', 5), ('Z', '89^A14^AAA68'), ('Z', 'GJP00TM04')]

James Casbon

unread,
Sep 10, 2010, 9:55:24 AM9/10/10
to pysam-us...@googlegroups.com
On 10 September 2010 14:49, Kevin Jacobs <jac...@bioinformed.com>

I would prefer the second type of interface to exist, saves having to
maintain the tag list when you are updating an item inside it.

Marcel Martin

unread,
Sep 10, 2010, 10:16:19 AM9/10/10
to pysam-us...@googlegroups.com
On Friday 10. September 2010 15:49:18 Kevin Jacobs <jac...@bioinformed.com>
wrote:

> 2: Make tags into a true proxy object with a wider interface:
>
> print read.tags.keys()
> print read.tags.values() # returns tuples of (type_code, Python value)
> tags = read.tags
> print read.tags['NM'] # dict-like interface returning
> (type_code,Python value)
> read.tags['NM'] = 1 # implicit type
> read.tags['NM'] = 'Z',1 # explicit type

The type is explicit, but there's an implicit type conversion (1 is converted
to "1"). An explicit conversion such as read.tags['NM'] = str(1) would be
clearer.

> assert 'NM' in read.tags # Simple containment checks
> for key in read.tags: # key iteration
> for key,tag,value in read.items(): # not quite dict-like, but clear
>
> Thoughts?

I like the idea of the proxy object, but I think the way the assignments and
iteration work are still awkward and make the standard case harder to read.

The current syntax is only problematic in two cases, as far as I can see:
* When I want to assign a single-character string to a tag of type string
(Z).
* When hex strings are involved.
The first case could be solved by specifying that assignments of Python
strings to tags always result in string-typed tags. And if one wants to assign
a char, that must be done similar to this:

read.tags['XY'] = pysam.samchar('A')

The second case could be dealt with similarly by specifying that a
pysam.hexstring class must be used. In that way, only those people who need
character tags ever need to be concerned with how to use them.

I'm not quite sure if it's actually the case, but a nice thing of the current
syntax is that one never gets in contact with those SAM-specific type codes
(AifZH). Thery are, as it should be, seamlessly converted to Python types.

Andreas

unread,
Sep 10, 2010, 10:27:02 AM9/10/10
to Pysam User group
Hi James,

thanks for your patience, it works now (i.e., I see the same as you in
a virtualenv installation) -
but it is still a mystery to me why I don't see it in my dev
directory.

Thanks,
Andreas

On Sep 10, 2:52 pm, James Casbon <cas...@gmail.com> wrote:

Kevin Jacobs <jacobs@bioinformed.com>

unread,
Sep 10, 2010, 10:28:37 AM9/10/10
to pysam-us...@googlegroups.com
On Fri, Sep 10, 2010 at 10:16 AM, Marcel Martin <marcel...@tu-dortmund.de> wrote:
I like the idea of the proxy object, but I think the way the assignments and
iteration work are still awkward and make the standard case harder to read.

The current syntax is only problematic in two cases, as far as I can see:
 * When I want to assign a single-character string to a tag of type string
(Z).
 * When hex strings are involved.

I'm worried that there are more problematic cases.  Preserving the value representation (char, str, int, float, hex string) is necessary, but from reading the SAM standard it seems necessary to also preserve the detailed type (signed/unsigned 8/16/32 bit int, float/double, etc.)  samtools supports a broader array of type codes than you're considering, including ACcSsIifdZH.  I also worry that using proxy types will interfere with the use of values, many interfaces won't know what to do with samchar('A')'s, even if they inherit from Python str.

-Kevin

Andreas

unread,
Sep 10, 2010, 10:35:30 AM9/10/10
to Pysam User group
Hi Kevin, James,

yes, the general idea was to not have pysam users exposed to the
encoding of a value within a bamfile.

I like James' suggestion with the explicit type conversions.
It can however, be combined with Kevin's syntax if pysam.samchar is
simply

def samchar(x): return "A", x

However, I believe that read.tags.values and read.tags["NM"] should
only return
python values and no type codes. We can make a concession by adding

read.tags.type("NM") for those wanting to preserve/force an encoding?

Cheers,
Andreas

On Sep 10, 3:16 pm, Marcel Martin <marcel.mar...@tu-dortmund.de>
wrote:

Kevin Jacobs <jacobs@bioinformed.com>

unread,
Sep 10, 2010, 10:48:28 AM9/10/10
to pysam-us...@googlegroups.com
On Fri, Sep 10, 2010 at 10:35 AM, Andreas <andrea...@gmail.com> wrote:
yes, the general idea was to not have pysam users exposed to the
encoding of a value within a bamfile.

I like James' suggestion with the explicit type conversions.
It can however, be combined with Kevin's syntax if pysam.samchar is
simply

def samchar(x): return "A", x

However, I believe that read.tags.values and read.tags["NM"] should
only return
python values and no type codes. We can make a concession by adding

read.tags.type("NM") for those wanting to preserve/force an encoding?


Thanks for the input, Andreas.   Your compromise is an excellent synthesis of the two approaches.  Assuming we go the route of using a proxy object, the syntax could be:

> read.tags['NM'] = 23      # untyped
> read.tags['NM'] = 'Z',23  # typed
> print read.tags['NM']
23
> print list(read.tags)
[('NM', 'Z', 23)]
> print read.tags.get_type('NM')
'Z'

We can also support more of the dict-interface, like .get, .update, .keys, .values, .items, .clear, etc.

With some helpers like samchar that adds the type specifiers, I think this looks to be viable.  If everyone is happy, I'll code this up over the weekend.

-Kevin

Andreas

unread,
Sep 10, 2010, 11:24:53 AM9/10/10
to Pysam User group
Hi James,

it seems that different versions of cython might have been
the culprit.

However, the bug should disappear with Kevin's code.

Thanks,
Andreas

Andreas

unread,
Sep 10, 2010, 11:28:36 AM9/10/10
to Pysam User group
Hi Kevin,

quite happy, however, please only let pysam encroach on your weekend
if you do enjoy it.

My personal preference would be to return

>>> print list(read.tags)
[('NM', 23)]

i.e, without the tag, preserving the pythonesque dictionary->list
type conversion.

Best wishes,
Andreas

James Casbon

unread,
Sep 10, 2010, 11:30:52 AM9/10/10
to pysam-us...@googlegroups.com
Ok thanks, but I think you should apply my patch to 0.3. The pytype
variable is not even used:
http://code.google.com/p/pysam/source/browse/pysam/csamtools.pyx?r=5966fa37911cb5b51ff2558ac2555978adf210f5#1505

Andreas

unread,
Sep 10, 2010, 11:33:01 AM9/10/10
to Pysam User group
right, will do.

Thanks,
Andreas

On Sep 10, 4:30 pm, James Casbon <cas...@gmail.com> wrote:
> Ok thanks, but I think you should apply my patch to 0.3.  The pytype
> variable is not even used:http://code.google.com/p/pysam/source/browse/pysam/csamtools.pyx?r=59...

Kevin Jacobs <jacobs@bioinformed.com>

unread,
Sep 10, 2010, 1:46:49 PM9/10/10
to pysam-us...@googlegroups.com
On Fri, Sep 10, 2010 at 11:30 AM, James Casbon <cas...@gmail.com> wrote:
Ok thanks, but I think you should apply my patch to 0.3.  The pytype
variable is not even used:
http://code.google.com/p/pysam/source/browse/pysam/csamtools.pyx?r=5966fa37911cb5b51ff2558ac2555978adf210f5#1505


I'll have that bug and several others fixed shortly.  Thanks so much for  helping to bring these problems to the forefront.

-Kevin

Kevin Jacobs <jacobs@bioinformed.com>

unread,
Sep 10, 2010, 1:49:14 PM9/10/10
to pysam-us...@googlegroups.com
On Fri, Sep 10, 2010 at 11:28 AM, Andreas <andrea...@gmail.com> wrote:
quite happy, however, please only let pysam encroach on your weekend
if you do enjoy it.


No worries-- we use pysam for so many things at work that it is my pleasure to help.

 
My personal preference would be to return

>>> print list(read.tags)
[('NM', 23)]

i.e, without the tag, preserving the pythonesque dictionary->list
type conversion.


No problem-- I'll start with that and we can explore how best to expose the type information without making it a distraction.

-Kevin

Reply all
Reply to author
Forward
0 new messages