blastn input

Holly Williams

unread,

Sep 12, 2022, 9:24:19 PM9/12/22

to EvidentialGene

Hello Don Gilbert and Evigene users,

I am running Evigene for the first time, and am unclear on what query file(s) and db file I should use.

So far I ran 12 assemblies (Trinity and MEGAHIT run on 6 different Douglas fir families) though trformat.pl, then ran the resulting fasta file though tr2aacds.pl. I then removed "identicals" using fastanrdb on the .cds file. I ran the reduced .cds file through cd-hit-est (-c 1.0) to remove shorter cds matches, and ran the .aa file through cd-hit (-c 0.9) to cluster amino acid sequences (*Question: should I first run the aa file though fastanrdb before running cd-hit, to get rid of identical amino acid sequences?).

For the next step, I am not sure what files to run through blastn, and whether I will use the public nr database or my own reduced files as the db to query against.

I appreciate your advice (and also any example command code people have used to run blastn). Thank you very much.

Holly Williams

Don Gilbert

unread,

Sep 19, 2022, 1:49:23 PM9/19/22

to Holly Williams, EvidentialGene

Holly,

We could use a few more details of the software and data steps you used. It sound like you may be using Evigene's sra2genes pipeline, which has several steps. The simpler tr2aacds pipeline only reduces your input transcript set to a relevant smaller coding gene set. tr2aacds handles the details like cd-hit, blastn and such that you don't need to worry about. But it doesn't deal with external reference data like reference proteins and gene annotations which sra2genes does.

See here for tr2aacds only:
http://arthropods.eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html

which takes your gene assembly fasta sequences as input, and outputs a folder okayset/ of the reduced gene set (transcripts and proteins):

$evigene/scripts/prot/tr2aacds.pl -NCPU $ncpu -MAXMEM $maxmem -log -cdna $trset

See here for sra2genes:
http://arthropods.eugenes.org/EvidentialGene/other/sra2genes_testdrive/sra2genes4v_testdrive/

SRA2Genes steps are
data selection and RNA assemblies
1. get RNA data (from NCBI SRA, or other sources)
2. reformat sra to fasta
3. subset(s) of data, digital normalize/reduce;
4. run several assemblers, with kmer size options, other opts
5. post process assembly sets (trformat.pl)
6. quick assessment: aastats per assembly, report

reduction to best draft gene set, self-referential (no external gene evidence)
7. run evg over-assembly reduction, tr2aacds.pl

refinement with external gene evidence
8. ref protein blastp x evg okayset
9. annotate and name genes, vector/contam screen, conserved domains, transposons
10. make annotated public gene set
11. make NCBI TSA submission file set

So step 7 is the reduction with tr2aacds.pl
Then to proceed to steps 8 and on, you need to add reference proteins for your plant.
Some set of good quality species proteins, I suggest 2 to 5 species. For Douglas fir you can pick from several related species. I suggest you look at either NCBI RefSeq for plants or Phytozome if you dont already have a set of reference species genes.

Use sra2genes this way, for its several steps:

1. Create folders of your input data and reference data:
trsets/ folder of your transcripts
refset/ folder has reference data for your species
genome/ folder has chromosome assembly, if available (not required)

2. Run this setup script with configured applications:

env name=arath16test species=Arabidopsis_thaliana runsteps=start7 ncpu=8 maxmem=16000 datad=`pwd` ./run_evgsra2genes4v.sh

3. Run the cluster scripts it creates to do the computes

4. Rerun this run_evgsra2genes4v.sh script, which then updates to next pipeline steps, depending on outcome of (3).

-- Don Gilbert

--
You received this message because you are subscribed to the Google Groups "EvidentialGene" group.
To unsubscribe from this group and stop receiving emails from it, send an email to evidentialgen...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/evidentialgene/ddcb40a8-8af4-4ac4-a8a8-3108eac7ccd1n%40googlegroups.com.

--

don gilbert - www.bio.net - bioinformatics - indiana.u.

Holly Williams

unread,

Oct 22, 2022, 8:07:01 AM10/22/22

to EvidentialGene

Thank you for your thorough reply, Don!

I am quite new to coding and completely new to Evigene. As you surmised, I thought I needed to run fastanrdb, cd-hit and blast after running tr2aacds, but I now understand that tr2aacds runs these for you. I appreciate you clearly typing out all the steps for me.