incomplete job

11 views
Skip to first unread message

yaximik

unread,
Jan 15, 2014, 2:14:00 PM1/15/14
to biop...@googlegroups.com
Martin,

Here is a line of code from my cleaning script:

# remove CL73 adaptor from 3'-ends of read 2 (shown as complement
# to its 3'->5' sequence in read 2
find_adaptor
-l 3 -r GGAAGAGCGTCGTGTAGGGAAAGAGTGT |
clip_adaptor
|

The scrip executes fine. However, when I aligned cleaned reads to a reference, I found the following (image attached). The two to lines still show the presense of the CL73 adaptor end (shown in the code in bold underscored), which supposed to be removed.
Is anything wrong with my code or what may be the reason from incomplete job? Should I just re-run the script on cleaned data?

Vladimir
CH3Ask2mtDNA (Reads) without duplicates.jpg

Martin Asser Hansen

unread,
Jan 15, 2014, 2:20:26 PM1/15/14
to biop...@googlegroups.com
Try to make a test a single read containing a partial adaptor. Remember that -r is 3'-5' and you should use -R for 5'-3' adaptor.

Cheers,


Martin


--
You received this message because you are subscribed to the Google Groups "biopieces" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biopieces+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

yaximik

unread,
Feb 3, 2014, 10:19:39 PM2/3/14
to biop...@googlegroups.com, ma...@maasha.dk
Martin,

I am at complete loss. I created test file and tried to clean it. I observed the same behavior, but I cannot explain why my script removes only some part of adaptors. Here is the script:
#!/bin/sh

perl
-pne 'if (($. % 4) == 0) {tr/[_-~]/_/}' | #change all scores in the range "_"(95) to "~"(126) to "_"(95)
read_fastq
-e base_33 -i - |         # Read in raw FASTQ data from stdin
mask_seq
|                # soft mask quality < 20 in lowercase
trim_seq
-m 30 -l 4 |            #trim Q<30 until min good stretch > 4
grab
-e 'SEQ_LEN >= 35' |         # Filter out reads < 35 nt
# remove CL78 adaptor from read 1
find_adaptor
-L 6 -R AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC |
clip_adaptor
|
grab
-e 'SEQ_LEN >= 35' |         # Filter out reads < 35 nt
# remove CL73 adaptor from read 2
find_adaptor
-L 6 -R GGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATC |
clip_adaptor
|
grab
-e 'SEQ_LEN >= 35' |         # Filter out reads < 35 nt
mean_scores
|                           # Calculate mean quality score
grab
-e "SCORES_MEAN >= 30" |           # Filter reads with score mean < 30
mean_scores
-l |                        # Calculate local mean quality score
grab
-e "SCORES_MEAN_LOCAL >= 20" |     # Filter reads with local score mean < 20
write_fastq
-e base_33 -x               # Write raw FASTQ data to stdout

Here is the test file:

@read 1 full CL78
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNNN
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhBBBBBBBBBB
@read 1 half CL78
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAGATCGGAAGAGCACACGTCNNNNNNNNNN
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhBBBBBBBBBB
@read 1 10nt CL78
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAGATCGGAAGNNNNNNNNNN
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhBBBBBBBBBB
@read 1 8nt CL78
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAGATCGGANNNNNNNNNN
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhBBBBBBBBBB
@read 1 7nt CL78
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAGATCGGNNNNNNNNNN
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhBBBBBBBBBB
@read 1 6nt CL78
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAGATCGNNNNNNNNNN
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhBBBBBBBBBB
@read 1 5nt CL78
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAGATCNNNNNNNNNN
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhBBBBBBBBBB
@read 2 33nt CL73
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCNNNNNNNNNNNN
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhBBBBBBBBBBBB
@read 2 20 nt CL73
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGAGCGTCGTGTAGGGANNNNNNNNNNNN
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhBBBBBBBBBBBB
@read 2 10 nt CL73
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGAGCGTNNNNNNNNNNNN
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhBBBBBBBBBBBB
@read 2 7 nt CL73
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGAGNNNNNNNNNNNN
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhBBBBBBBBBBBB
@read 2 6 nt CL73
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGANNNNNNNNNNNN
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhBBBBBBBBBBBB
@read 2 5 nt CL73
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGNNNNNNNNNNNN
+
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhBBBBBBBBBBBB


Here is the cleaning result

@read 1 full CL78
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAGATCGGAAGAGCACACGTCTGAACTCC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read 1 half CL78
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAGATCGGAAGAGCAC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read 1 10nt CL78
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAGATC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read 1 8nt CL78
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read 1 7nt CL78
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read 1 6nt CL78
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read 1 5nt CL78
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read 2 33nt CL73
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGAGCGTCGTGTAGGGAAAGAGT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read 2 20 nt CL73
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGAGCGTC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read 2 10 nt CL73
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read 2 7 nt CL73
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read 2 6 nt CL73
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read 2 5 nt CL73
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII


As you see it removes completely only shorter (partial) adaptors, but only trims 6nt of longer adaptors from their 3' ends, presumably along with NNNN tails.

Please advise - what could be wrong?

Martin Asser Hansen

unread,
Feb 4, 2014, 11:09:51 AM2/4/14
to biop...@googlegroups.com
Check this out. I saved your test data to test.fq and did this:

maasha@mel:~$ read_fastq -n 1 -e base_33 -i test.fq | find_adaptor -L 6 -R AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  | clip_adaptor
SEQ_NAME: read 1 full CL78
SEQ: TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAGATCGGAAGAGCACACGTCTGAACTCC
SEQ_LEN: 67
SCORES: IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
ADAPTOR_POS_RIGHT: 67
ADAPTOR_LEN_RIGHT: 16
ADAPTOR_PAT_RIGHT: AGTCACNNNNNNNNNN
---
maasha@mel:~$ read_fastq -n 1 -e base_33 -i test.fq | find_adaptor -L 6 -r AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  | clip_adaptor
SEQ_NAME: read 1 full CL78
SEQ: TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
SEQ_LEN: 37
SCORES: IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
ADAPTOR_POS_RIGHT: 37
ADAPTOR_LEN_RIGHT: 36
ADAPTOR_PAT_RIGHT: TTAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
---

Notice the difference between -r and -R

And do make sure the different steps in your cleaning script works before putting it all together :o)



Cheers,



Martin
Reply all
Reply to author
Forward
0 new messages