Holly,
We could use a few more details of the software and data steps you used. It sound like you may be using Evigene's sra2genes pipeline, which has several steps. The simpler tr2aacds pipeline only reduces your input transcript set to a relevant smaller coding gene set. tr2aacds handles the details like cd-hit, blastn and such that you don't need to worry about. But it doesn't deal with external reference data like reference proteins and gene annotations which sra2genes does.
See here for tr2aacds only:
http://arthropods.eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.htmlwhich takes your gene assembly fasta sequences as input, and outputs a folder okayset/ of the reduced gene set (transcripts and proteins):
$evigene/scripts/prot/
tr2aacds.pl -NCPU $ncpu -MAXMEM $maxmem -log -cdna $trset
See here for sra2genes:
http://arthropods.eugenes.org/EvidentialGene/other/sra2genes_testdrive/sra2genes4v_testdrive/ SRA2Genes steps are
data selection and RNA assemblies
1. get RNA data (from NCBI SRA, or other sources)
2. reformat sra to fasta
3. subset(s) of data, digital normalize/reduce;
4. run several assemblers, with kmer size options, other opts
5. post process assembly sets (
trformat.pl)
6. quick assessment: aastats per assembly, report
reduction to best draft gene set, self-referential (no external gene evidence)
7. run evg over-assembly reduction,
tr2aacds.pl refinement with external gene evidence
8. ref protein blastp x evg okayset
9. annotate and name genes, vector/contam screen, conserved domains, transposons
10. make annotated public gene set
11. make NCBI TSA submission file set
So step 7 is the reduction with
tr2aacds.plThen to proceed to steps 8 and on, you need to add reference proteins for your plant.
Some set of good quality species proteins, I suggest 2 to 5 species. For Douglas fir you can pick from several related species. I suggest you look at either NCBI RefSeq for plants or Phytozome if you dont already have a set of reference species genes.
Use sra2genes this way, for its several steps:
1. Create folders of your input data and reference data:
trsets/ folder of your transcripts
refset/ folder has reference data for your species
genome/ folder has chromosome assembly, if available (not required)
2. Run this setup script with configured applications:
env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=8 maxmem=16000 datad=`pwd` ./run_evgsra2genes4v.sh
3. Run the cluster scripts it creates to do the computes
4. Rerun this run_evgsra2genes4v.sh script, which then updates to next pipeline steps, depending on outcome of (3).