regex to match pattern allowing errors?

k.kim

unread,

Sep 12, 2013, 7:49:57 AM9/12/13

to NGS...@googlegroups.com

Does anyone know regex to match patterns allowing errors?
I'm trying to find adapter sequences within sequence reads and there might be sequencing errors in the adapter sequence. It would be also useful to find the adapter sequences at the end of reads that has shorter part of adapter sequences.
cheers,

Claudius Kerth

unread,

Sep 12, 2013, 5:05:00 PM9/12/13

to

Hi Kang-Wook,

my suggestions for fuzzy (regex) matching are:

==============

[1] agrep/TRE:

==============

On Ubuntu:

$ sudo apt-get install tre-agrep

$ man tre-agrep

$ tre-agrep --help

As every good Unix command line tool 'tre-agrep' works on a line base. So the first task should be to get the four lines of each fastq record on one line.

This can be done with a simple perl command line, for instance:

$ perl -ne ' $seq=<>; $q_head=<>; $q_str=<>; chomp($_, $seq, $q_head, $q_str); print "$_ $seq $q_head $q_str\n"; ' input_file.fq | less -S

This reads in four lines (i. e. one fastq record at a time), removes return characters, then prints them out on one line separated by a space.

This can then be piped into tre-agrep:

$ perl -ne ' $seq=<>; $q_head=<>; $q_str=<>; chomp($_, $seq, $q_head, $q_str); print "$_ $seq $q_head $q_str\n"; ' input_file.fq | tre-agrep -2 "AGATCGGAAG" | less -S

This will extract all lines in which the search pattern between " " is found with up to 2 mismatches (actually "edits", i. e. including insertions and deletions). The search string contains the first 10bp common to the illumina single-end and paired-end adapters. Invert the match with '-v'. Also check out:

$ perl -ne ' $seq=<>; $q_head=<>; $q_str=<>; chomp($_, $seq, $q_head, $q_str); print "$_ $seq $q_head $q_str\n"; ' input_file.fq | tre-agrep -s --color -2 "AGATCGGAAG"

Then transform back into fastq format:

$ perl -ne '$seq=<>; $q_head=<>; $q_str=<>; chomp($_, $seq, $q_head, $q_str); print "$_ $seq $q_head $q_str\n";' ind_96.fq | tre-agrep -2 "AGATCGGAAG" | tr ' ' '\n' | less

more info: http://laurikari.net/tre/about/

===============================================================

[2] this Perl module: http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm

===============================================================

One disadvantage is that this module does not recognize regular expression syntax:

"Notice that the pattern is a string. Not a regular expression. None of the regular expression notations (^, ., *, and so on) work."

But for your purpose it should work fine.

I have attached a file (hope that works, tell me if not), that either slides a window over each read sequence in a fastq record and compares it to an adapter sequence or uses the module String::Approx.

The 'sliding_window' mode determines the number of mismatches between the search string and the sequence from each window and does not print the current record to STDOUT if the number of mismatches is below a specified number. Type:

$ ./remove_adapter_reads.pl -h

for more detail and also have a look inside the script. Since indel artifacts are very rare on illumina sequencers, this should be a very sensitive programme depending on specified fuzziness.

Feel free to experiment with this file and improve it. The following will download my git repo "scripts_for_RAD", fetch the *experimental* branch and checkout the file versions from that branch which contains the "experimental" version of 'remove_adapter_reads.pl':

$ git clone https://github.com/claudiuskerth/scripts_for_RAD.git

$ git remote add scripts_for_RAD@github https://github.com/claudiuskerth/scripts_for_RAD.git

$ git fetch scripts_for_RAD@github experimental

$ git checkout experimental

============================

[3] vmatch: http://www.vmatch.de/

============================

It seems like this programme can do almost anything related to string matching, but I haven't had time yet to try it out.

hope that helps,

claudius

remove_adapter_reads.pl

Kang-Wook Kim

unread,

Sep 13, 2013, 5:06:43 AM9/13/13

to NGS...@googlegroups.com

Good stuff Claudius, thanks!!
I also made a script that deal with adapter sequence at the end as well as in the middle in both fasta and fastq format. I should be able to improve it to deal with fuzzy matching with your code. I will upload when it's done.
cheers,

Kang-Wook

--
You received this message because you are subscribed to the Google Groups "NGS Group APS Sheffield" group.
To unsubscribe from this group and stop receiving emails from it, send an email to NGSshef+u...@googlegroups.com.
To post to this group, send an email to NGS...@googlegroups.com.
Visit this group at http://groups.google.com/group/NGSshef.
For more options, visit https://groups.google.com/groups/opt_out.

-- 
Dr Kang-Wook Kim
Postdoctoral Research Associate
Department of Animal and Plant Sciences
University of Sheffield
Western Bank
Sheffield
S10 2TN
United Kingdom

Phone: +44 (0)114 222 0112
e-mail: k....@sheffield.ac.uk

Reply all

Reply to author

Forward