Is there a way to save an entire "alignment field/column" into a NumPy array/Pandas data frame?

330 views
Skip to first unread message

Evan

unread,
Jul 18, 2016, 1:43:44 PM7/18/16
to Pysam User group
In the SAM format, each alignment line represents the linear alignment of a segment, and each line have 11 mandatory fields, i.e. QNAME, FLAG, RNAME, POS, MAPQ, etc. 

Let's say I wanted a NumPy array of all "QNAMES" in a given BAM file. Or, one could take several columns and import them into Pandas Dataframe. 

Is this functionality possible with pysam? 

One can naturally open a given BAM file with `pysam.AlignmentFile()` and then access individual segments with `pysam.AlignmentSegment()`, e.g. 

    seg = AlignmentSegment()
    print(seg.qname)

However, could you save all qnames into NumPy array? 

Florian Finkernagel

unread,
Jul 19, 2016, 3:14:37 AM7/19/16
to Pysam User group
Coincidentally, I just answered it in a reddit post for pandas DataFrames: https://www.reddit.com/r/bioinformatics/comments/4tgcbr/what_would_be_the_best_way_to_import_all_sambam/

Same thing will work for numpy arrays

def bam_to_np(bam, chr = None, start=None, stop = None): seq = [] name = [] pos = [] for read in bam.fetch(chr, start, stop): seq.append(read.seq) name.append(read.qname) pos.append(read.pos) return np.arary(seq), np.array(name), np.array(pos)

So long,
Florian
Reply all
Reply to author
Forward
0 new messages