read bam files to create feature tables

11 views

Skip to first unread message

ullo...@googlemail.com

unread,

Aug 9, 2024, 5:41:45 AM8/9/24

to NGLess

Dear all,

for a bigger project, I have saved the GMGs mapping results as bam file. I now want to skip the mapping, read in the bam files and count and collect the results with different annotations. However, I am not really sure how to accomplish this task. Can anyone share the snippets with me? This is what I've started with:

cat GMGCbin.ngl

ngless "1.5"

local import "gmgc" version "1.0"

import "parallel" version "1.1"

import "samtools" version "1.0"

samples=load_sample_list('gmgc_bam.txt')

input = samfile(samples)

current=input.name()

##GMGC

##"How counts are adjusted in the presence of multiple annotations is defined by the multiple argument. Generally, for obtaining gene abundances, distribution of multiple mappers is the best (using multiple={dist1}), while for functional annotations, you want to count them all (using multiple={all1}). This implies that the functional annotations will sum to a higher value than the number of reads. This may seem strange at first, but it is the intended behaviour."

#gmgc_mapped = map (non_human_reads, reference='gmgc:no-rare',mode_all=True,block_size_megabases=10000)

#gmgc_mapped = map (non_human_reads, reference='gmgc',mode_all=True,block_size_megabases=10000)

#gmgc_mapped = map (input, reference='gmgc',mode_all=True)

#gmgc_mapped_post = select(gmgc_mapped) using |mr|:

# mr = mr.filter(min_match_size=45, min_identity_pc=95, action={drop})

# if not mr.flag({mapped}):

# discard

#write(mapstats(gmgc_mapped_post),ofile=current+'gmgc.full.stats.txt')

#write(gmgc_mapped_post,ofile=current+'gmgc.bam')

#Predicted_taxonomic_group

gmgc_counts_new = count(gmgc_mapped_post,

features=['seqname'],

multiple={all1},

normalization={raw},

discard_zeros=True)

collect(gmgc_counts_new,

current=current,

ofile='gmgc'</>'Metacardis_gmgc_genemap.all1.raw.txt')

head gmgc_bam.txt

gmgc/M0x10MCx1418filtered_qc.fq.gzgmgc.bam

gmgc/M0x10MCx1472filtered_qc.fq.gzgmgc.bam

gmgc/M0x10MCx1857filtered_qc.fq.gzgmgc.bam

gmgc/M0x10MCx1985filtered_qc.fq.gzgmgc.bam

gmgc/M0x10MCx2066filtered_qc.fq.gzgmgc.bam

gmgc/M0x10MCx2112filtered_qc.fq.gzgmgc.bam

gmgc/M0x10MCx2152filtered_qc.fq.gzgmgc.bam

Thanks in advance!

Best wishes,

Ulrike

Luis Pedro Coelho

unread,

Aug 13, 2024, 2:06:04 AM8/13/24

to NGLess List

Hi Ulrike et al.,

I now want to skip the mapping, read in the bam files and count and collect the results with different annotations.

This should be possible and reuse the compute, yes.

However, I am not really sure how to accomplish this task. Can anyone share the snippets with me? This is what I've started with:
cat GMGCbin.ngl
ngless "1.5"
local import "gmgc" version "1.0"
import "parallel" version "1.1"
import "samtools" version "1.0"

samples=load_sample_list('gmgc_bam.txt')

I thinks this should be readlines('gmgc_bam.txt') because that file seems like just a single

input = samfile(samples)
current=input.name()

maybe

current = lock1(samples)

gmgc_mapped_post = samfile(current)

Also, using the name gmgc_mapped_post so that the following line works:

gmgc_counts_new = count(gmgc_mapped_post,
features=['seqname'],
multiple={all1},
normalization={raw},
discard_zeros=True)

collect(gmgc_counts_new,
current=current,
ofile='gmgc'</>'Metacardis_gmgc_genemap.all1.raw.txt')

This should now work, but this resulting matrix may be huge because the `collect` function will try to fill in all the missing zeros. Maybe:

write(gmgc_counts_new, ofile='gmgc_out_'+current)

HTH, Luis

Reply all

Reply to author

Forward

0 new messages