Well all the lines caused the problem, but e.g:
GJP00TM04CAQ5W 0 2 38297693 60
45H51M1D13M1D12M1D9M2D5M1D7M4D2M1I6M1D28M1D5M1D2M1D18M55H *
0 0
CATGAAGAACCGCTGGGTATGGAGCACACCTCACCTGATGGACAGTTGATTATGCTCACCTTAACGCTAATTGAGAGCAGCACAAGAGGACTGGAAACTAGAATTTACTCCTCATCTCCGAAGATGTGAATATTCTAAATTCAGCTTGCCTCTTGCTTC
IID7757111/=;?///:D>777;EEGAAAEEIHHIIIIIIIIIIIIIIBBBIIIIH==<<<DDGEEE;<<<A><<<DEDDA>>>D?1112544556::03---//25.22=;DD?;;;>BDDDEEEGGGA<888<BAA888<GGGGGEB?9::DD551
NM:i:15 MD:Z:51^T13^A12^A9^AA5^A7^AAAA8^T28^T5^A2^T18 RG:Z:GJP00TM04
You should see the problem if you retrieve any optional tag using
entry.tags since the code was returning the type rather than the tag.
ie. for the tag RG:Z:GJP00TM04 you should be getting ('RG',
'GJP00TM04') but I actually got ('Z', 'GJP00TM04').
On 10 September 2010 13:11, Andreas <andrea...@gmail.com> wrote:
> Hi James,
>
> thanks for the bug report and the fix. However, I can't seem
> to be able to reproduce the problem. Could you send me
> the exact line in "inp" that causes the problem?
Hi Kevin,
thanks for the offer. I am working on it, but can't verify the problem
yet.
Might need your help if I can't get it to work.
Hello Kevin,
the following issue may be related and you may want to take it into account
while reworking that code:
http://code.google.com/p/pysam/issues/detail?id=40
Marcel
--
Dipl.-Inform. Marcel Martin, https://ls11-www.cs.tu-dortmund.de/staff/martin/
thanks, you are much further than I am. Please continue and I will
stop.
The tags issue is also related to issue 40 - losing type information
for tags.
I think it is nice to have transparent type conversion between python
- samtools.
However, to fix issue 40, "tags" probably needs to be encapsulated to
preserve the original types if they are ambiguous - or "set" needs to
go
through existing tags to see which tags have not changed.
What do you think?
The problem seems to be that there's not a one-to-one mapping between SAM
types and Python types, which means that some information must get lost during
conversion. Looking at the SAM spec, only type 'A' (printable character) and
type 'H' (hex string) cannot be represented as Python types.
Maybe one solution is to introduce 'char' and 'hexstring' classes, which would
have to be used if one wants to set a character or hex tag and which would be
returned when retrieving such a tag.
The problem seems to be that there's not a one-to-one mapping between SAMtypes and Python types, which means that some information must get lost during
conversion. Looking at the SAM spec, only type 'A' (printable character) and
type 'H' (hex string) cannot be represented as Python types.
Maybe one solution is to introduce 'char' and 'hexstring' classes, which would
have to be used if one wants to set a character or hex tag and which would be
returned when retrieving such a tag.
Which version, this is using 0.3 which I just installed into a new virtualenv:
(pysam)$ pip freeze | grep pysam
pysam==0.3
(pysam)$ cat test.sam
@HD VN:1.0
@SQ SN:2 LN:30000000
GJP00TM04CLM2L 0 2 38297693 60 45H89M1D14M3D2M1I66M56H * 0 0 CATGAAGAACCGCTGGGTATGGAGCACACCTCACCTGATGGACAGTTGATTTATGCTCACCTTAAACGCTAATTGAGAAGCAGCACAAAGAGGAACTGGAAAACTAGAATTTTACTCCTCATCTCCGAAGATGTGAATATTTCTAAAATTTCAGCTTGCCTCTTGCTTCTTA IIIIIIIIIIIIHGGGHIIIIFFFGIIIFFCAADIIIIIFFFFIIIIHIDDDGCG@@>@@C7731147EGGGEEGIHHHHIIIIIF===IIHH;;;FFF@@?=AA4888=:::AA8??AAAAIIIIIIIIIIIHGFDGE@@@?23333?????ADAEGGA=788=DEEFEED NM:i:5 MD:Z:89^A14^AAA68 RG:Z:GJP00TM04
(pysam)$ samtools view -Sb test.sam -o test.bam
[samopen] SAM header is present: 1 sequences.
(pysam)$ python
Python 2.6.2 (release26-maint, Apr 19 2009, 01:58:18)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pysam
>>> sf = pysam.Samfile('test.bam', 'rb')
>>> ar = sf.next()
>>> print ar.tags
[('C', 5), ('Z', '89^A14^AAA68'), ('Z', 'GJP00TM04')]
I would prefer the second type of interface to exist, saves having to
maintain the tag list when you are updating an item inside it.
The type is explicit, but there's an implicit type conversion (1 is converted
to "1"). An explicit conversion such as read.tags['NM'] = str(1) would be
clearer.
> assert 'NM' in read.tags # Simple containment checks
> for key in read.tags: # key iteration
> for key,tag,value in read.items(): # not quite dict-like, but clear
>
> Thoughts?
I like the idea of the proxy object, but I think the way the assignments and
iteration work are still awkward and make the standard case harder to read.
The current syntax is only problematic in two cases, as far as I can see:
* When I want to assign a single-character string to a tag of type string
(Z).
* When hex strings are involved.
The first case could be solved by specifying that assignments of Python
strings to tags always result in string-typed tags. And if one wants to assign
a char, that must be done similar to this:
read.tags['XY'] = pysam.samchar('A')
The second case could be dealt with similarly by specifying that a
pysam.hexstring class must be used. In that way, only those people who need
character tags ever need to be concerned with how to use them.
I'm not quite sure if it's actually the case, but a nice thing of the current
syntax is that one never gets in contact with those SAM-specific type codes
(AifZH). Thery are, as it should be, seamlessly converted to Python types.
I like the idea of the proxy object, but I think the way the assignments and
iteration work are still awkward and make the standard case harder to read.
The current syntax is only problematic in two cases, as far as I can see:
* When I want to assign a single-character string to a tag of type string
(Z).
* When hex strings are involved.
yes, the general idea was to not have pysam users exposed to the
encoding of a value within a bamfile.
I like James' suggestion with the explicit type conversions.
It can however, be combined with Kevin's syntax if pysam.samchar is
simply
def samchar(x): return "A", x
However, I believe that read.tags.values and read.tags["NM"] should
only return
python values and no type codes. We can make a concession by adding
read.tags.type("NM") for those wanting to preserve/force an encoding?
Ok thanks, but I think you should apply my patch to 0.3. The pytype
variable is not even used:
http://code.google.com/p/pysam/source/browse/pysam/csamtools.pyx?r=5966fa37911cb5b51ff2558ac2555978adf210f5#1505
quite happy, however, please only let pysam encroach on your weekend
if you do enjoy it.
My personal preference would be to return
>>> print list(read.tags)
[('NM', 23)]
i.e, without the tag, preserving the pythonesque dictionary->list
type conversion.