Hi Lynette. I found some more info that maybe relevent.
"The procedure to create a suitable FASTA file for database searching
involves the
following:
The nr_prot.tar.gz file is downloaded from the Swiss Institute of
Bioinformatics by public
FTP (
ftp://ftp.ch.embnet.org/pub/databases/nr_prot/). This file is
produced by the
Computational Genomics Group of the Ludwig Institute for Cancer
Research. The
compressed collection includes a readme file (see Appendix 1) as well
as STATS file (see
Appendix 2).
1) The individual .seq files are reformatted ensuring consistent
header lines for each
protein sequence entry as well as unique accession numbers for all
entries. The
NCBI Tax_Id number as well as organism name (for backward
compatibility) is
included in the header line.
2) The individual *.seq files are then grouped according to database
source (e.g.
ENSEMBL or plasmodb).
3) The LudwigNR database is then created by concatenating the grouped
databases
in the following order: Swiss-Prot, Swiss-Prot variant splice
isoforms, Trembl,
worm, yeast, ensembl, plasmo, gemmata, sludge and user-specific.
4) Final sanity checks are performed to ensure that all accession
numbers are unique
across all sequence entries.
Appendix 1: (Readme file)
The non-identical protein database is constructed from UniProt
(SwissProt, SwissProt
splice variants, TrEMBL, TrEMBL splice variants), yeastpep, wormpep,
EnsEMBL
peptides (
http://www.ensembl.org/), SLUDGE, Gemmata, and the PlasmoDB
(
http://plasmodb.org/) database.
Please note that duplicate proteins are removed only within the same
taxa, based on
NCBI's TaxID number.
The method is to produce non-redundant subsets of each member of the
nr set, using a
CRC method. The CRC method means we assign a checksum value to each
protein
sequence, then keep only one protein sequence per unique checksum
value and per
taxonomic ID (as assigned by the NCBI).
The probability of two different protein sequences having the same
checksum is
extremely small (but not null). We start with SwissPROT, from which we
remove all the
duplicates, and this produces swiss_nr.seq. We then take TrEMBL and
remove all
duplicates plus the sequences already found in swiss_nr.seq; this
produces trembl_nr.seq.
And the process is repeated for all the following members. Below is an
example output
of this method, which also gives the order in which the databases are
processed, i.e., we
keep all (almost) of SwissProt, then take entries from TrEMBL, etc."