Database parsing problems

62 views
Skip to first unread message

Matt Padula

unread,
Jun 18, 2009, 7:34:30 PM6/18/09
to PEAKS_forum
Hello. I am trying to use Peaks Studio 5.0 to search data with a
custom database comprising of UniProt (sp, sp_vs and tr), plasmoDB and
ensembl entries (approx 38 species) plus a few "user" sequences.
However when I start the protein ID search, Peaks seems to stall and
the progress percentage does not increase. I've tried numerous parsing
rules that I know work for this database in Scaffold and the progress
percentage varies depending on the rules, but never goes to
completion. I have even tried the parsing rules for NCBInr without
success. In addition, Peaks does not tell me why it has stopped, which
is extremely annoying.

The parsing rules I have tried are:
Accession number: >[^\|]*\|([^\|]*)
Description: >(.*)

Accession number: >(?:(?:(?:sp)|(?:tr)|(?:user)|(?:sludge))\|)?([^ |]
*)
Description: >[^ ]* (.*)

Lynette

unread,
Jun 19, 2009, 10:26:19 AM6/19/09
to PEAKS_forum
Dear Matt,

If you would like to email me your database (or just a couple of
header lines from your database from all the different entries) I can
try to help you out with the parsing rules in Peaks 5.0.

Cheers,
Lynette

Matt Padula

unread,
Jun 21, 2009, 7:18:05 PM6/21/09
to PEAKS_forum
The database was prepared by APCF (http://www.apcf.edu.au/) and is
3.5GB, so Word and other programs on my desktop won't open it. The
first set of parsing rules in my first post was written by APCF for
Scaffold, so I would have thought they would be compatible. I find it
strange that one set of rules results in progress to 66% and the
second set of rules only gets to 30%.

Could I upload the database to you somehow?

Matt Padula

unread,
Jun 22, 2009, 12:15:26 AM6/22/09
to PEAKS_forum
Hi Lynette. I found some more info that maybe relevent.

"The procedure to create a suitable FASTA file for database searching
involves the
following:
The nr_prot.tar.gz file is downloaded from the Swiss Institute of
Bioinformatics by public
FTP (ftp://ftp.ch.embnet.org/pub/databases/nr_prot/). This file is
produced by the
Computational Genomics Group of the Ludwig Institute for Cancer
Research. The
compressed collection includes a readme file (see Appendix 1) as well
as STATS file (see
Appendix 2).
1) The individual .seq files are reformatted ensuring consistent
header lines for each
protein sequence entry as well as unique accession numbers for all
entries. The
NCBI Tax_Id number as well as organism name (for backward
compatibility) is
included in the header line.
2) The individual *.seq files are then grouped according to database
source (e.g.
ENSEMBL or plasmodb).
3) The LudwigNR database is then created by concatenating the grouped
databases
in the following order: Swiss-Prot, Swiss-Prot variant splice
isoforms, Trembl,
worm, yeast, ensembl, plasmo, gemmata, sludge and user-specific.
4) Final sanity checks are performed to ensure that all accession
numbers are unique
across all sequence entries.
Appendix 1: (Readme file)
The non-identical protein database is constructed from UniProt
(SwissProt, SwissProt
splice variants, TrEMBL, TrEMBL splice variants), yeastpep, wormpep,
EnsEMBL
peptides (http://www.ensembl.org/), SLUDGE, Gemmata, and the PlasmoDB
(http://plasmodb.org/) database.
Please note that duplicate proteins are removed only within the same
taxa, based on
NCBI's TaxID number.
The method is to produce non-redundant subsets of each member of the
nr set, using a
CRC method. The CRC method means we assign a checksum value to each
protein
sequence, then keep only one protein sequence per unique checksum
value and per
taxonomic ID (as assigned by the NCBI).
The probability of two different protein sequences having the same
checksum is
extremely small (but not null). We start with SwissPROT, from which we
remove all the
duplicates, and this produces swiss_nr.seq. We then take TrEMBL and
remove all
duplicates plus the sequences already found in swiss_nr.seq; this
produces trembl_nr.seq.
And the process is repeated for all the following members. Below is an
example output
of this method, which also gives the order in which the databases are
processed, i.e., we
keep all (almost) of SwissProt, then take entries from TrEMBL, etc."

Matt Padula

unread,
Jun 22, 2009, 12:16:55 AM6/22/09
to PEAKS_forum
And a little more info:

"The resulting files are in FASTA format, with a header of the form:
>DBtag|AccessionNb|OtherId (GeneName)description[species]
But there are some exceptions:
- for TrEMBL, OtherId is the EMBL accession number.
- most SwissProt and many TrEMBL entries have an associated gene name,
but not all.
The other databases don't have it, and sometimes their description
line starts with
something in ()...
- yeastpep entries do not have an OtherId entry at all.
- the splice variants entries are a special case: the OtherId is the
accession number of the
original entry, and the AccessionNb is made unique for each splice
variant."
Reply all
Reply to author
Forward
0 new messages