Problems with "grab" and find

dario.aka

unread,

Feb 5, 2016, 8:00:20 AM2/5/16

to biopieces

Hello, I'm a master student who is now starting to work with bioinformatics. What I want to do is read a sam file containing many reads and select only the ones that contain a specific SNP and save them in a new sam file. I found here http://seqanswers.com/forums/showthread.php?t=26713 some good info to start with and I've also read the wiki and got basic info on the grab and find_SNPs commands.

However if i try to run something like:

read_sam -i dario.sam |
find_SNPs |
grab -e 'POS == 6535' |
grab -e 'A>G'|
write_sam -o dario2.sam

or even just: read_sam -i file.sam | find_SNPs

i get this error:

/home/dario/Bioinformatic_programs/biopieces/code_ruby/lib/maasha/sam.rb:157:in `align_pair': undefined method `scan' for nil:NilClass (NoMethodError)
    from /home/dario/Bioinformatic_programs/biopieces/code_ruby/lib/maasha/sam.rb:86:in `to_bp'
    from /home/dario/Bioinformatic_programs/biopieces/bp_bin/read_sam:55:in `block (4 levels) in <main>'
    from /home/dario/Bioinformatic_programs/biopieces/code_ruby/lib/maasha/sam.rb:213:in `block in each'
    from /home/dario/Bioinformatic_programs/biopieces/code_ruby/lib/maasha/sam.rb:195:in `each_line'
    from /home/dario/Bioinformatic_programs/biopieces/code_ruby/lib/maasha/sam.rb:195:in `each'
    from /home/dario/Bioinformatic_programs/biopieces/bp_bin/read_sam:54:in `block (3 levels) in <main>'
    from /home/dario/Bioinformatic_programs/biopieces/code_ruby/lib/maasha/filesys.rb:76:in `open'
    from /home/dario/Bioinformatic_programs/biopieces/bp_bin/read_sam:53:in `block (2 levels) in <main>'
    from /home/dario/Bioinformatic_programs/biopieces/bp_bin/read_sam:52:in `each'
    from /home/dario/Bioinformatic_programs/biopieces/bp_bin/read_sam:52:in `block in <main>'
    from /home/dario/Bioinformatic_programs/biopieces/code_ruby/lib/maasha/biopieces.rb:89:in `open'
    from /home/dario/Bioinformatic_programs/biopieces/bp_bin/read_sam:46:in `<main>'

Is it just that I have to read more and learn how to write proper lines or I have some problem with ruby? if I run the bp_test everything is fine for what concern ruby (I have the last version installed and I use ubuntu 14.04)

Another thing I'm interested in is to use grab -r. On the wiki I found that i can use grab -r to grab records that contain a specific sequence. Does that mean that I can grab reads with a specific sequence from a fasta or sam or fastaq file that contains thousand of reads?

where can I find more information on how to use this grab function? For example: grab -e 'A>G' where I can find info about 'arguments' for the -e option?

thanks for the help in advanced and sorry for the many questions... I'm try to read and increase my knowledge but without a bioinformatic background it's not easy.

Martin Asser Hansen

unread,

Feb 5, 2016, 1:52:31 PM2/5/16

to biop...@googlegroups.com

Hi there,

So this is a tricky one I am afraid. According to the error message the issue is due to missing MD field in your SAM file - and the MD field is used to recreate the pairwise alignment between the query and subject sequence to call the SNPs. Can you make file.sam available to me and I will have a look.

I also note that you have a syntax error:

grab -e 'A>G'

should be

grab -e 'EVENT eq A>G'

or probably better still

grab -p "A>G" -k EVENT

The grab wiki might be missing some details. The -r option is based on the Ruby regex engine. The -e option is based on Ruby's eval method which basically evaluates a string allowing you to express a calculation as a string and obtain a result that is true or false.

While grab is pretty efficient, it is not well suited for heavy duty sequence searching. Have a look at patscan_seq

Also, have a read of the Howto.

Cheers,

Martin

--
You received this message because you are subscribed to the Google Groups "biopieces" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biopieces+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dario.aka

unread,

Feb 7, 2016, 6:48:15 AM2/7/16

to biopieces

Hello and thank you very much for your time,

Unfortunately I can't send the file because It's an hospital file from a patience. What I can tell you is that I have the fastq file and what I did is this:

1) bwa index XXX_full.fasta (reference genome)
2)bwa mem XXX_full.fasta XXXX_reads.fastq > XXXX_reads.sam
at this point I have my sam file. Maybe I'm doing something wrong here?

then i would go on because I want to visualize the SNPs on IGV

3)samtools faidx XXX_full.fasta
4)samtools view -b -S -o XXXX_reads.bam XXXX_reads.sam
5)samtools sort XXXX_reads.bam XXXX_reads.sorted
6)samtools index XXXX_reads.sorted.bam
7)samtools mpileup -u -f XXX_full.fasta XXXX_reads.sorte.bam > XXXX_reads.bcf
8)samtools view XXXX_reads.bcf > XXXX_reads.vcf

now I can open the file XXXX_reads.sorted.bam with IGV and see all the statistic for different mutation in each positions so for example i see that in position 6535 I Have in 34 reads an A that become a G. But what I want to do now is look for two particular mutations in two different positions on the same reads. That's why I want to use biopieces to grab all the reads with a particular mutation then from this group of reads grab the ones that have also the second mutation and then visuaize the results on IGV to see if the reads I got really have both the mutations that I'm looking for at the same time.

how come the bwa mem command gives me a sam file that I can't use with biopieces? maybe the problem is the fastq flle?

Martin Asser Hansen

unread,

Feb 7, 2016, 1:27:56 PM2/7/16

to biop...@googlegroups.com

This could be a good exercise in anonymizing patient data so you can send me a snippet of that SAM file... Basically, you are free to replace all sequences with N's and any identifiers - as long as you don't mess with the format of the file.

The SAM format is a terrible abomination for many reasons. One reason is that the different versions of the SAM format are not compatible. The Biopieces parser was build to handle v1.4-r962 - April 17, 2011. What version is produced by your pipeline?

Cheers,

Martin

--

dario.aka

unread,

Feb 7, 2016, 4:20:30 PM2/7/16

to biopieces

Ok you will find the file attached. For what concern the sam version I really don't know, how can check it? I know that the bwa version is 0.7.5a-r405

Example_ccs_reads.sam

Martin Asser Hansen

unread,

Feb 8, 2016, 2:13:32 PM2/8/16

to biop...@googlegroups.com

So there is something funny with the sequence length now:

$ read_sam -i Example_ccs_reads.sam

/Users/maasha/install/src/biopieces/src/ruby/lib/maasha/seq.rb:140:in `initialize': Sequence length and score length mismatch: 629 != 628 (SeqError)

But the file is indeed missing the MD field - which normally holds information about the number of mismatches in the alignment - I suppose this is a feature of BWA-MEM which I have no experience with. Can you perhaps map your sequences with Bowtie2 instead?

Cheers,

Martin

dario.aka

unread,

Feb 8, 2016, 6:00:53 PM2/8/16

to biopieces

I think it might be this version of bwa because I found an older sam file generated by and older version of bwa and it works.
Today I also tried to use bowtie2 for alignment and the sam file that i got works to. However using bowtie2 at the end of the pipeline I found less mutations. To explain myself better, using bwa mem in position 6535 i find 43 A that became G while with bowtie2 I find 35. It was the first time I used bowtie though so I'll read more about the various command and options. Anyway now biopieces read the sam file, I run:

read_sam -i dario.sam |
find_SNPs |
grab -e 'POS == 6535' |

grab -e 'event eq A>G'|
write_sam -o dario2.sam

and the output sam file is empty. So i try to run:

read_sam -i dario.sam |
find_SNPs |
grab -e "REC_TYPE eq SNP" |
sort_records -rk SNP_COUNTn |
write_tab -cx

I get the results and what I see is that the number of SNPs for position is too low and the position is wrong. I get the same results both with the sam file that i got with the older version of bwa and with bowtie2. so i guess that why i can't grab A>G at position 6535, because find_SNPs doesn't report it. If i use:

read_fastq -i XXX.fastq|
write_fasta -O XXX.fasta|
patscan_seq -p 'GCCTGCACC'| # GCCT[G]CACC, [G] is the mutations the sequences next to the mutation are from the reference HCV genome
write_fasta -o patscan_res.fasta

Where the fastq file is the file that i use to get the sam file. I can get some of the reads with the mutations in position 6535. The reason i don't get all of them is because of mutations around the mismatch I'm looking for. one mutation and the pattern is not recognize and the read is not taken (I'm working with HCV which polymerase error rate is very high). So, looking at IGV I know the mutation is there in 43 reads (out of 18000 kind of), i can find it with patscan_seq but i can't with find_SNPs. I think I'm using find_SNPs in the wrong way, I'll read more about it.

For what concern the sequences length, it might be that i've substituted the nucleotides sequences with a different number of Ns?

dario.aka

unread,

Feb 8, 2016, 6:02:00 PM2/8/16

to biopieces

Anyway, again thank you really much for your time and support,

Dario

Martin Asser Hansen

unread,

Feb 9, 2016, 11:15:24 AM2/9/16

to biop...@googlegroups.com

I think it is very sound to be sceptical about SNP finders - that is why I originally wrote my own - which clearly tells me that the results you get are dependent on the short read mapper you use. Different SNP finders use all sorts of statistical models to repair bad read mapping, but I find that careful cleaning and filtering of reads is the way to go.

Cheers,

Martin

On Tue, Feb 9, 2016 at 12:02 AM, dario.aka <dario...@gmail.com> wrote:

Anyway, again thank you really much for your time and support,

Dario

dario.aka

unread,

Feb 10, 2016, 4:48:51 AM2/10/16

to biopieces

Don't get me wrong, the program work. If I type in the terminal: read_sam -i XXX.sam | find_SNPs | grab -e 'POS = 6535' | I will see what i've attached.
Now as you can see the program gives all the SNPs in the different read that have mutation in that position, the problem is that in the reference genome in position 6535 there is an A. So, if i read in the results pos 6535 event T>C, what i get out of it is that the read is not aligned correctly (but i think it is because if I look at the with IGV everithing is aligned correctly to the reference genome) or the program has problem in reading info from my file. Either where the read start or alignment info. I know that the read start number is there i checked the sam file and now i'm using the last version of bwa so it looks like al the info are there.
What i want to try is to have all the read start at one position and end at one position. In this way i should kind of bypass the alignment problem. Then, if it works, I'll have to figure out how to have a sam file with the reads that I want as output. the different reads have a different Q_ID but the same S_ID so i'm confuse about how biopieces can get the reads i want and write them in a sam file.

Again I'm new to bioinformatics, every day I make a new small step...

Find_SNPs out put.png

Martin Asser Hansen

unread,

Feb 12, 2016, 3:36:08 PM2/12/16

to biop...@googlegroups.com

So bioinformatics is all about taking responsibility of your own data.

Try to explicitly check the + strand only:

read_sam -i in.sam |

grab -e 'STRAND eq +' |

find_SNP | ...

Martin

Reply all

Reply to author

Forward

Problems with "grab" and find_SNPs

dario.aka

Martin Asser Hansen

dario.aka

Martin Asser Hansen

dario.aka

Martin Asser Hansen

dario.aka

dario.aka

Martin Asser Hansen

dario.aka

Martin Asser Hansen