perl regular expression

9 views
Skip to first unread message

vivien

unread,
Sep 12, 2013, 10:34:02 PM9/12/13
to unix-and-perl-...@googlegroups.com
Hallo .Iam new to perl programming but working on it.currently iam working with gene family in a particular plant..i wish to identify  sequences with the following motifs QLRDLAYDAEDCLDEFxYxx and KxWLxELRELAYDAEDCIDEF where x is any nucleotide from a plant genome....is it possible to design  a regular expression using those patterns then use it to extract  other sequences with same motifs from genome database?can somebody help formulate a regular expression using those motifs?Thanks.

Steve Mount

unread,
Sep 13, 2013, 6:17:14 AM9/13/13
to unix-and-perl-...@googlegroups.com
Vivien,

Those are amino acid sequences. You can indeed design a perl script to find such occurrences in protein fasta sequences, and I think that what you learn in the course  is probably sufficient for that task.

However, you really want to use more appropriate methods for this (you don't want to miss a case because of a single conservative amino acid change). 
blastp first, tblastn vs. plant genomes and hmmer (http://hmmer.janelia.org/) would be my recommendations.  You will find many genes with imperfect matches that I suspect are of interest to you, and you can easily scan the blast output for the few perfect matches that the regular expression search would have yielded.

Steve Mount


On Thu, Sep 12, 2013 at 10:34 PM, vivien <chev...@gmail.com> wrote:
Hallo .Iam new to perl programming but working on it.currently iam working with gene family in a particular plant..i wish to identify  sequences with the following motifs QLRDLAYDAEDCLDEFxYxx and KxWLxELRELAYDAEDCIDEF where x is any nucleotide from a plant genome....is it possible to design  a regular expression using those patterns then use it to extract  other sequences with same motifs from genome database?can somebody help formulate a regular expression using those motifs?Thanks.

--
You received this message because you are subscribed to the Google Groups "Unix and Perl for Biologists" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unix-and-perl-for-bi...@googlegroups.com.
To post to this group, send email to unix-and-perl-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unix-and-perl-for-biologists.
For more options, visit https://groups.google.com/groups/opt_out.



--
Stephen M. Mount
Center for Bioinformatics and Computational Biology
Dept. of Cell Biology and Molecular Genetics
2208 H. J. Patterson Hall
University of Maryland
College Park, MD    20742-5815

Phone      301-405-6934
alt. ph.     301-405-9904
URL        www.SteveMount.org
------------------------------------------------------
This email was sent from one of Google's gmail servers after being composed on an Android phone, a netbook, a mac or some combination of these.   
------------------------------------------------------
Any commercial or marketing email sent to this address (including information about lab supplies, textbooks, conferences and some funding opportunities) will be tagged as spam.  See SteveMount.org for alternatives. 

RENATO AUGUSTO CORREA DOS SANTOS

unread,
Sep 14, 2013, 8:33:43 AM9/14/13
to unix-and-perl-...@googlegroups.com
Dear both,

In the case of HMMER, they usually use an alignment of proteins to build a hmm, and with this you can search in a protein database. However, there are other tools being developed which aim to search with these protein alignment against nucleotide sequences, using similar approaches.

Best,

Renato A. C. Santos
Renato Augusto Corrêa dos Santos (Undergraduate Researcher)
Bioethanol Science and Technology Laboratory (ABTLuS-CNPEM)
+55 19 9722 5665

chebii vivien

unread,
Sep 14, 2013, 12:16:18 PM9/14/13
to unix-and-perl-...@googlegroups.com
Thanks for the suggestion

On 9/13/13, Steve Mount <st...@stevemount.info> wrote:
> Vivien,
>
> Those are amino acid sequences. You can indeed design a perl script to find
> such occurrences in protein fasta sequences, and I think that what you
> learn in the course <http://korflab.ucdavis.edu/unix_and_Perl/> is
> probably sufficient for that task.
>
> However, you really want to use more appropriate methods for this (you
> don't want to miss a case because of a single conservative amino acid
> change).
> blastp<http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on&LINK_LOC=blasthome>first,
> tblastn<http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=tblastn&BLAST_PROGRAMS=tblastn&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on&BLAST_SPEC=&LINK_LOC=blasttab&LAST_PAGE=blastp>vs.
Vivien Jepchirchir Chebii,
Msc Bioinformatics,
Fujian Agriculture and Forestry University,
China.
Skype:Vivienne c chichi2
Phone:+8615705900460

Steve Mount

unread,
Sep 14, 2013, 9:42:47 PM9/14/13
to unix-and-perl-...@googlegroups.com
Renato,

You're right that being able to search nucleotide sequences with an hmm would be useful.  
This is what is on the hmmer site (the last sentence deals with searching nucleotide sequences):

The following HMMER algorithms/programs are supported by this server:
  • phmmer - used to search one or more query protein sequences against a protein sequence database.
  • hmmscan - search protein sequences against collections of profiles, e.g. Pfam. In HMMER2 this was called hmmpfam.
  • hmmsearch - used to search one or more profiles against a protein sequence database.
  • jackhmmer - iteratively search a query protein sequence, multiple sequence alignment or profile HMM against the target protein sequence database.
Longer term, we will support nhmmer that will enable searches with nucleotide sequences. This software has been released as part of the HMMER software package (version 3.1).

Vivien,

You will note that phmmer does allow searching with a single protein sequence. Also, tblastn allows you to search nucleotide sequences with a protein query (using blast rather than hmmer). 
Getting back to your original question, whatever you use, you would be able to use regular expressions to look for exact matches in the output.

Steve



chebii vivien

unread,
Sep 14, 2013, 10:07:37 PM9/14/13
to unix-and-perl-...@googlegroups.com
Thanks alot for your contributins...but for the problem in question
some people had done it before using the BLAST methods and also HMM
profiles obtained from Pfam((they used Pfam HMM profiles to scan
various protein databases to identify this gene family)...whetever
they got tend to have many false postives ..hence looking for a
different approach but which is more stringten......okay was thinking
probably i can build HMM profiles using manual curated gene family
sequences of this the gene family in question which are about 12.
with help of HMMER package and then validate by RBH ....but not sure
if it actually it will make sense...because i understand quality
multiple sequence alignment is fundamental for generation of quality
profiles..but with 12 closely related genes iam afraid i may not get
good results....

On 9/15/13, Steve Mount <st...@stevemount.info> wrote:
> Renato,
>
> You're right that being able to search nucleotide sequences with an hmm
> would be useful.
> This is what is on the hmmer site <http://hmmer.janelia.org/help/algo> (the
> last sentence deals with searching nucleotide sequences):
>
> The following HMMER *algorithms/programs* are supported by this server:
>>
>> - *phmmer <http://hmmer.janelia.org/search/phmmer>* - used to search
>> one or more query protein sequences against a protein sequence
>> database.
>>
>>
>> - *hmmscan <http://hmmer.janelia.org/search/hmmscan>* - search protein
>> sequences against collections of profiles, *e.g.* Pfam. In HMMER2 this
>> was called *hmmpfam*.
>>
>>
>> - *hmmsearch <http://hmmer.janelia.org/search/hmmsearch>* - used to
>> search one or more profiles against a protein sequence database.
>>
>>
>> - *jackhmmer <http://hmmer.janelia.org/search/jackhmmer>* -
>> iteratively search a query protein sequence, multiple sequence
>> alignment or
>> profile HMM against the target protein sequence database.
>>
>> Longer term, we will support *nhmmer* that will enable searches with
>> nucleotide sequences. This software has been released as part of the
>> HMMER
>> software package <http://hmmer.janelia.org/software> (version 3.1).
>
>
> Vivien,
>
> You will note that phmmer does allow searching with a single protein
> sequence. Also, tblastn allows you to search nucleotide sequences with a
> protein query (using blast rather than hmmer).
> Getting back to your original question, whatever you use, you would be able
> to use regular expressions to look for exact matches in the output.
>
> Steve
>
>
>
>
>
> On Sat, Sep 14, 2013 at 8:33 AM, RENATO AUGUSTO CORREA DOS SANTOS <
> rcs....@ig.com.br> wrote:
>
>> Dear both,
>>
>> In the case of HMMER, they usually use an alignment of proteins to build
>> a
>> hmm, and with this you can search in a protein database. However, there
>> are
>> other tools being developed which aim to search with these protein
>> alignment against nucleotide sequences, using similar approaches.
>>
>> Best,
>>
>> Renato A. C. Santos
>>
>>
>> On Fri, Sep 13, 2013 at 7:17 AM, Steve Mount
>> <st...@stevemount.info>wrote:
>>
>>> Vivien,
>>>
>>> Those are amino acid sequences. You can indeed design a perl script to
>>> find such occurrences in protein fasta sequences, and I think that what
>>> you
>>> learn in the course <http://korflab.ucdavis.edu/unix_and_Perl/> is
>>> probably sufficient for that task.
>>>
>>> However, you really want to use more appropriate methods for this (you
>>> don't want to miss a case because of a single conservative amino acid
>>> change).
>>> blastp<http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on&LINK_LOC=blasthome>first,
>>> tblastn<http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=tblastn&BLAST_PROGRAMS=tblastn&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on&BLAST_SPEC=&LINK_LOC=blasttab&LAST_PAGE=blastp>vs.
>> *Renato Augusto Corrêa dos Santos (Undergraduate Researcher)*
> --
> You received this message because you are subscribed to the Google Groups
> "Unix and Perl for Biologists" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to unix-and-perl-for-bi...@googlegroups.com.
> To post to this group, send email to
> unix-and-perl-...@googlegroups.com.
> Visit this group at
> http://groups.google.com/group/unix-and-perl-for-biologists.
> For more options, visit https://groups.google.com/groups/opt_out.
>


--

RENATO AUGUSTO CORREA DOS SANTOS

unread,
Sep 15, 2013, 9:38:34 AM9/15/13
to unix-and-perl-...@googlegroups.com
Dear Steve,
Thank you so much.
I still did not work with nucleotide search (only with protein alignment agains prot db).
I actually want to understand the process. Is it possible to search a nucl database with a protein alignment?
I also agree with Vivien: better is the alignment, better will be the results! :-)
Best regards, thank you so much for your reply.
Renato.

Steve Mount

unread,
Sep 15, 2013, 11:12:02 AM9/15/13
to unix-and-perl-...@googlegroups.com
Renato,
I don't think that there is any way to search a nucleotide database with a protein "alignment" (actually, a PSSM or HMM derived from an alignment). However, you can search a nucleotide database with a single protein sequence (tblastn) and you can, of course, search a protein database with profile (using Psi-blast or HMMer).
Steve Mount

RENATO AUGUSTO CORREA DOS SANTOS

unread,
Sep 15, 2013, 2:17:28 PM9/15/13
to unix-and-perl-...@googlegroups.com
Sure, thanks.
I would say "profile", not alignment.
Best,
Renato.
Reply all
Reply to author
Forward
0 new messages