FGENESH+ New GeneFinder with Similarity usage

Victor Solovyev

Aug 2, 1999, 3:00:00 AM8/2/99
We installed New gene-finding HMM based program FGENESH+
for multiple gene prediction in genomic DNA with using information
from similar protein
The Web version of FGENESH+ is prepared to analyse Human, Drosophila,
Nematode and Plant sequences (and other close organisms genes).

The program can be used if you know the protein sequence similar with
protein which is encoded by the gene in your sequence.
You should run first any ab initio gene finding program as FGENES or FGENESH.
Then run BLASTP DB search with each predicted exons. Any true of
predicted exons can provide you by knowledge of known similar protein
(if such protein exist in the DB). Take this protein and run Fgenesh+.
The accuracy of gene prediction will be up to 100% depending of how similar the
predicted and DB protein.

Ab initio gene prediction programs usually predict correctly
significant portion of exons in a gene, but they often do not predict
correctly the whole gene structure:combining several genes in one or
predict several genes instead of one, missing or overpredicting exons.
Using similarity information provided by one or several true predicted
exons we can significantly improve the accuracy of gene finding.

You should provide similarity value (knowing it from the Blast search),
it affects the prediction, because very low similarity
will permit your gene encode the protein which deviates more from
the known similar protein.

TO USE Human specific version click (mark) Human button and fgenesh button
TO USE other specific version click Drosophila or Nematode or Plant + fgenesh

Past your sequence to the first window or load your file with nucleotide
sequence in FASTA format

Past your protein sequence to the second window

References: Salamov A.A., Solovyev V.V. (1999), unpublished data.
Please reference: CGG WEB server:

Fgenesh+ output:

G - the number of predicted gene (from sequence start)
Str - DNA strand (+ and - for complementary)
Feature - type of coding sequence (CDSf - First
(Starting with Start codon);
CDSi - internal (internal exon);
CDSl - the last coding seagment,
finishing by stop codon)
TSS - Position of transcription start (TATA-box position and score)

Start and End - Position of the Feature
Weight - Log likelihood*10 score for the feature
ORF-start/end - positions where the complete codons start and end
The last 3 values: Length of exon, positions in protein, % of similarity
with the target protein

FGENESH+ Prediction of potential genes in Human genomic DNA
Time: Mon Jul 26 21:38:41 1999
Seq name: Adh_and_cact.1 (2919020 bases) 848501 853000 Protein -
gi|2313041|gnl|PID|d1022564 Length 215 Sim: 90
Length of sequence: 4500 GC content: 40 Zone: 1
Number of predicted genes 1 in +chain 1 in -chain 0
Number of predicted exons 4 in +chain 4 in -chain 0
Positions of predicted genes and exons:
G Str Feature Start End Score ORF Len

1 + 1 CDSi 2577 - 2690 197.66 2579 - 2689 111
1 - 35 100
1 + 2 CDSi 2756 - 2936 312.35 2758 - 2934 177
37 - 95 100
1 + 3 CDSi 2991 - 3173 307.82 2992 - 3171 180
97 - 156 100
1 + 4 CDSl 3242 - 3419 301.90 3243 - 3419 177
158 - 215 100

Predicted protein(s):
>FGENESH 1 4 exon (s) 2577 - 3419 217 aa, chain +

Victor Solovyev
The Sanger Centre, Hinxton, Cambridge CB10 1SA, UK
Email: solo...@sanger.ac.uk http://genomic.sanger.ac.uk
Phone: 44-1223-494799 FAX: 44-1223-494919

