tr2ncrna.pl FAILED at step: make

James Ord

unread,

Nov 9, 2022, 9:13:51 AM11/9/22

to EvidentialGene

Dear Don

Firstly thank you very much for making EvidentialGene.

I am using tr2aacds for collapsing multi-assemblies, and would like to also recover putative ncrna using the tr2ncrna.pl script.

The script runs until the blastn step is finished, however it then fails with an error. See the output below:

#ncrna: EvidentialGene tr2ncrna, VERSION 2020.02.25
#ncrna: tr2ncrna -ncpu 16 -trset dropset/all_transcripts_Estonia.drop.tr -mrna okayset/all_transcripts_Estonia.okay.mrna
consensus transcripts n=1941 tabled in all_transcripts_Estonia.drop.nomrna.tr.consensus
#ncrna: FAILED at step: make_allevdtab nevd=0/234035 in all_transcripts_Estonia.drop.longnok.allevd.tab

The folder ncrnaset is generated and the file ending in .longnok.allevd.tab does appear in there, but includes only a header:

$ head all_transcripts_Estonia.drop.longnok.allevd.tab
IDtrasm trlen aalen pcds cdpot agree dglocus dgclass aaqual

I executed the tr2ncrna.pl script within the same folder in which I executed tr2aacds.pl VERSION 2022.01.20. The command for tr2aacds was:

tr2aacds.pl -species=Salmo_trutta -NCPU=16 -MAXMEM=160000 -cdnaseq all_transcripts_Estonia.fasta

Therefore the okayset and dropset folders are already in the directory in which I run tr2ncrna. I'll include the output summary at the end of this message.

Any idea what could have caused tr2ncrna to stop with the error regarding generating the allevd.tab file?

Cheers

James

tr2aacds summary:

#t2ac: EvidentialGene tr2aacds.pl VERSION 2022.01.20
#t2ac: CMD: tr2aacds.pl -species=Salmo_trutta -NCPU=16 -MAXMEM=160000 -cdnaseq all_transcripts_Estonia.fasta
#t2ac: NOTE: evd_smallclassfilter not enough evidence nev=155815 for ncds=1595040
# Class Table for all_transcripts_Estonia.trclass
class %okay %drop okay drop
althi 4.1 2.6 90650 56813
althi1 13.2 5.1 286232 111957
althinc 1.3 0 29006 0
altmfrag 0.9 2.4 20673 52084
altmid 1 1.8 23677 40938
main 2.6 0.7 57553 16344
mainnc 1.7 0 38583 0
noclass 4 12.9 88383 279073
noclassnc 3.7 0 79988 0
parthi 0 6.3 0 138325
parthi1 0 1 0 21807
perfdupl 0 25.9 0 560463
perffrag 0 7.8 0 169279
smallorf 0 0 0 0
---------------------------------------------
total 33 66.9 714745 1447083
=============================================
# AA-quality for okay set of all_transcripts_Estonia.aa.qual (no okalt): all and longest 1000 summary
okay.top n=1000; average=1968; median=1728; min,max=1376,6755; nfull=933; sum=1968990; gaps=0,0
okay.all n=264507; average=111; median=64; min,max=30,6755; nfull=186383; sum=29573280; gaps=0,0
#ta2c: ERR:output skip dup id TRINITY_GG_87350_c0_g1_i4
#readPubidTab(publicset/all_transcripts_Estonia.pubids)= 714745
# nin=24029379, nok=5594532, nfrag=11171, nskipnotloc=15188387, nskipdupfrag=446218, nskipdiffloc=2789071
#insertUniqExons= 27969
#collectExonChains= 700418 of 705768 ids
#assignChainLoci
#n_class: ichain=361577 icalt=295547 icsub=43294 icdup=5350
#n_alts : t1=361577 t2=43324 t3=23608 t4=17652 t5=14383 t6=12216 t7=10680 t8=9409 t9=8322 t10=7495 t11=6754 t12=6127 t13=5585 t14=5144 t15=4727 t16=4331 t17=3994 t18=3681 t19=3414 t20=3145

Don Gilbert

unread,

Nov 9, 2022, 2:36:05 PM11/9/22

to James Ord, EvidentialGene

James,

THanks much for this report on tr2ncrna failing; I've another such report like yours, so it is likely a software bug (or data+soft), and I'll take a look as soon as I can. It may be simple to fix, or not, as this is one of the more complex parts of evigene code, using both data + software that needs to mesh properly to work. It is complex in part because non-coding genes are ill defined; evigene tr2ncrna tries to define them as a transcript set minus all likely coding transcripts, minus likely artifacts. This pipeline has several checking steps, and one or more of those is finding fault, but not reporting it well enough to decipher.

- Don Gilbert

--
You received this message because you are subscribed to the Google Groups "EvidentialGene" group.
To unsubscribe from this group and stop receiving emails from it, send an email to evidentialgen...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/evidentialgene/fc4bc68f-35b0-4ab4-8e41-0cfddee346e0n%40googlegroups.com.

--

don gilbert - www.bio.net - bioinformatics - indiana.u.

Don Gilbert

unread,

Nov 9, 2022, 3:55:51 PM11/9/22

to James Ord, EvidentialGene

James (& others),

I spot a discrepancy in your tr2ncrna.pl failed runs, versus a successful run. Can you rerun using your full transcript set as input parameter, not the dropset/ subset?

This " -trset input/name.tr " needs to be the same input set you used for tr2aacds, in the same directory with outputs of that pipeline.

Let me know if it works or fails. There may be a bug in documentation somewhere.

Also, if this test fails for you, would you try with the tiny weed data set I use to see if that also fails for you?

tiny test data: http://arthropods.eugenes.org/EvidentialGene/plants/arabidopsis/evigene_tr2aacds_test2021/arath_TAIR10_20101214up.cdna.gz

- Don

http://arthropods.eugenes.org/EvidentialGene/evigene/docs/EvigeneR/tr2ncrna_about.txt
Usage: tr2ncrna.pl -trset input/name.tr -mrna okayset/name.okay.mrna

Evigene tr2ncrna.pl fail-report#1
$evigene/scripts/genes/tr2ncrna.pl -ncpu $ncpu -debug -log -trset dropset/XXX.drop.tr -mrna okayset/XXX.okay.mrna

Evigene tr2ncrna.pl fail-report#2
$evigene/scripts/genes/tr2ncrna -ncpu 16 -trset dropset/XXX.drop.tr -mrna okayset/XXX.okay.mrna

DGG tr2ncrna test run succeeds:
in current directory for input set at18tair_mrna.fa, with tr2aacds.pl outputs,
$evigene/scripts/genes/tr2ncrna.pl -ncpu 2 -log -debug -trset at18tair_mrna.fa -mrna okayset/at18tair_mrna.okay.mrna

On Wed, Nov 9, 2022 at 9:13 AM James Ord <jms....@gmail.com> wrote:

--

You received this message because you are subscribed to the Google Groups "EvidentialGene" group.
To unsubscribe from this group and stop receiving emails from it, send an email to evidentialgen...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/evidentialgene/fc4bc68f-35b0-4ab4-8e41-0cfddee346e0n%40googlegroups.com.

James Ord

unread,

Nov 11, 2022, 4:29:37 AM11/11/22

to EvidentialGene

Hi Don

Thank you very much for the quick response.

I tried as you suggested with using the full transcript set (tr2aacds input) as the input for tr2ncrna and it now runs without the error, generating a non-empty *.allevd.tab file.

It works both with the tiny weed transcript set (arath_TAIR10_20101214up.cdna) and with my larger trout transcript set. I include the stdout for the former at the end of the message.

In the ncrnaset folder there is a fasta called arath_TAIR10_20101214up.ncrna_pub.fa containing 2208 sequences. Would this be the correct file with nonredundant putative ncrnas?

Cheers

James

-----------

#t2ac: EvidentialGene tr2aacds.pl VERSION 2022.01.20

#t2ac: CMD: tr2aacds.pl -species=Arabidopsis_thaliana -NCPU=16 -MAXMEM=80000 -cdnaseq arath_TAIR10_20101214up.cdna
#t2ac: NOTE: evd_smallclassfilter not enough evidence nev=31 for naa=43235
# Class Table for arath_TAIR10_20101214up.trclass

class %okay %drop okay drop

althi 4.2 0.07 1848 34
althi1 10.6 0.16 4615 70
althinc 0.09 0 42 0
altmfrag 0.42 0.04 184 19
altmid 0.8 0.04 357 19
main 12.6 0.03 5482 15
mainnc 0.09 0 40 0
noclass 56.8 1.4 24598 618
noclassnc 3.2 0 1405 0
parthi 0 1.1 0 477
parthi1 0 0.43 0 187
perfdupl 0 6.1 0 2660
perffrag 0 1.3 0 565

smallorf 0 0 0 0
---------------------------------------------

total 89.2 10.7 38571 4664
=============================================
# AA-quality for okay set of arath_TAIR10_20101214up.aa.qual (no okalt): all and longest 1000 summary
okay.top n=1000; average=1416; median=1265; min,max=1052,5393; nfull=996; sum=1416673; gaps=2,0
okay.all n=31525; average=383; median=322; min,max=30,5393; nfull=30445; sum=12105077; gaps=19,0
#ta2c: ERR:output skip dup id ATMG00130.1
#readPubidTab(publicset/arath_TAIR10_20101214up.pubids)= 38571
# nin=57234, nok=53212, nfrag=3, nskipnotloc=2993, nskipdupfrag=10, nskipdiffloc=1016
#insertUniqExons= 1544
#collectExonChains= 37685 of 38568 ids
#assignChainLoci
#n_class: ichain=32654 icalt=3915 icsub=1116 icdup=883
#n_alts : t1=32654 t2=3240 t3=488 t4=91 t5=32 t6=17 t7=12 t8=10 t9=8 t10=6 t11=3 t12=3 t13=3 t14=1 t15=1 t16=0 t17=0 t18=0 t19=0 t20=0

#ncrna: EvidentialGene tr2ncrna, VERSION 2020.02.25

#ncrna: tr2ncrna -ncpu 16 -input arath_TAIR10_20101214up.cdna -mrna okayset/arath_TAIR10_20101214up.okay.mrna
consensus transcripts n=11 tabled in arath_TAIR10_20101214up.nodropbigcds.tr.consensus
#ncrna: data=REFAA, exists=0, path=
#ncrna: data=refset/ref*.aa, exists=0, path=

----------------

in 'ncrnaset' folder:

$ grep '>' arath_TAIR10_20101214up.ncrna_pub.fa | wc -l
2208

$ head arath_TAIR10_20101214up.ncrna_pub.fa -n 4
>ArathaEVm032729t1 type=ncRNA; aalen=85,30%,na; clen=839; evgclass=mainuni; notes=aaref:85/30%/complete-utrbad,tscore:89.5; oid=AT1G72645.1
AAAAATGTGTGCATGCTAGTTTCTAGTTTTGTTCTCTCTTTTTTTGTGTTATCTTCTATGAGTCTCTTTA
TTCATAGATGTTTTCGTTTGATTATATATACCTTAGTCATGTTATCTATTTATGATGTAGCTCGTTCTTA
CTTGAGATTGCACTAGTAGTCTTGCTTAATGACATATAAATCTTGATTTAAACTCCCTCTGGATATTCTT

Don Gilbert

unread,

Nov 11, 2022, 2:58:25 PM11/11/22

to James Ord, EvidentialGene

James,

Thank you much for helping resolve this. Some mistake caused two people to use this wrong invocation "-trset dropset/XXX.drop.tr", which may be my error in documents. Do you recall your reason for this?

-- Don Gilbert

The documents for tr2ncrna need improvement, a worked example would help. Here are some notes

* Needs more explanation in tr2ncrna_about.txt

-- Only calls long ncRNA (> 300 nt default). Use other tools to find short ncRNA (eg. tRNA, snoRNA, microRNA, etc), maybe recommend some

-- The final ncRNA sequence set is in ncrnaset/Name.ncrna_pub.fa, which if you wish can be added to the final mRNA set, okayset/Name.mrna.fa, for a single RNA sequence set. They both have consistent, different gene IDs, and alternate form numbers (ncRNA ids start after mRNA), the ncRNA headers have 'type=ncRNA;' versus 'type=mRNA;' in the mrna.fa, and other annotations, 'oid=AT1G72645.1' is your original transcript ID.

-- This is an approximate classification of long ncRNA, it will include some that are coding with long UTR transcripts. It should also contain a full non-redundant set of long ncRNA genes in your transcript set. Using this on mRNA set of the model weed (AT) calls 1000 as ncRNA, or 2200 as you found using their full RNA gene set. These are mostly short-coding, long UTR transcripts, and those models may include joined coding + ncRNA genes, or may be long-UTR coding genes, that are hard to distinguish.

-- Consensus or agreement across transcript assembler methods is a valuable piece of evidence for ncRNA calls. Many users rely on one assembler, ie Trinity, but even if that is *best* (I disagree), use of 3 or more RNA assemblers provides a consensus measure that works, as Adam Voshall pointed out to me and others.

-- Uses step-wise subtraction of mRNA, too short or likely artifacts
-- tr2ncrna.log file lists steps with retained transcripts;
The *.trids files list original IDs for each step.

S1 9354 at18tair_mrna.nomrna.trids : not in okayset/at18tair_mrna.okay.mrna
S2 3487 at18tair_mrna.nodropbigcds.trids : remove drops with large CDS
S3 1658 at18tair_mrna.notokmrna.trids : no aligns with mrna (putative ncrna)
S4 1247 at18tair_mrna.longnok.trids : long enough to avoid artifacts
S5 1067 at18tair_mrna.ncrna_pub.trids : final calls, after drop likely duplicate ncrna models

at18tair_mrna.tr2ncrna.log
#ncrna: BEGIN with input= at18tair_mrna.fa date= Wed Nov 9 15:29:17 EST 2022
#ncrna: tr2ncrna( at18tair_mrna.fa, okayset/at18tair_mrna.okay.mrna)

S1 #ncrna: remove_mrna_oids kept=9354/48147, at18tair_mrna.nomrna.tr
S2 #ncrna: remove_bigcdsdrops kept=3505/48147, at18tair_mrna.nodropbigcds.tr
S3 #ncrna: remove_mrna_aligned kept=1658/48147, at18tair_mrna.notokmrna.tr
S4 #ncrna: long_seqs(>=300) kept=1247/48147, at18tair_mrna.longnok.tr

S5 #ncrna: ncrna_selfalign count=1247/48147, at18tair_mrna.longnok.self97.blastn
S5 #ncrna: altparclass loci=1039 for 1247 tr, in at18tair_mrna.longnok.dgclass.tab
S5 #ncrna: nagree=0 in at18tair_mrna.nodropbigcds.tr.consensus ** No multiple-assembler consensus here

S5 #ncrna: make_allevdtab nevd=1240/1247 in at18tair_mrna.longnok.allevd.tab
S5 #ncrna: ncrna_classgenes ngene=1032,ok=1067,drop=173/1247 in at18tair_mrna.longnok.allevd.ncrna_class
S5 #ncrna: ncrna_pubseqset kept=1067/48147, in at18tair_mrna.ncrna_pub.fa

To view this discussion on the web visit https://groups.google.com/d/msgid/evidentialgene/1a812043-6bc1-4995-bd31-e473f7b11af8n%40googlegroups.com.

Message has been deleted

James Ord

unread,

Nov 14, 2022, 8:22:43 AM11/14/22

to EvidentialGene

Hi Don

By wrong invocation, do you mean specifying the dropset instead of the full transcript set as the input?

In which case, possibly the confusion stems from an earlier post in this group in which you mentioned that the dropset folder was needed by tr2ncrna. This may have led myself and others to think that the dropset was supposed to be passed to the -input / -trset argument.

Also I have noticed that '-trset' and '-input' parameters seem to be interchangeable. Does it matter which one is used? I realised this morning that I had run the A.thal test with -input and my larger trout run with -trset and both ran without error.

Thanks for giving more attention to this and developing it further. Re. small RNAs: I am aware of some tools for de novo miRNA discovery (e.g. mirnovo, brumir), but I have not personally used them.
Cheers
James

Reply all

Reply to author

Forward

tr2ncrna.pl FAILED at step: make_allevdtab

James Ord

Don Gilbert

Don Gilbert

James Ord

Don Gilbert

James Ord