Cleaning up the transcriptome results from tr2aacds.pl?

146 views
Skip to first unread message

charles...@gmail.com

unread,
Nov 16, 2023, 2:42:57 AM11/16/23
to EvidentialGene
Dear Don,

After a couple of years' hiatus I'm jumping back into transcriptomics temporarily. I've successfully run 'tr2aacds.pl' on my "over-assembly", resulting in the expected 'okayset', 'inputset', 'dropset' etc.

In the past, my next step was to clean up the names of the okayset files using your script:

perl $evigene/evgmrna2tsa.pl  -onlypubset -idprefix My_species -class evigene_formatted_assembly.trclass

When I went to do this using the most recent version of Evigene, I got a deprecation notice. I was informed that:

  Usage now is
    (a) prot/tr2aacds4.pl, which calls subprogram trclass2pubset.pl,
        output to okayset/ with public ids (pubids) and reformated sequences,
        replacing evgmrna2tsa.pl -onlypubset

However, tr2aacds.pl (the symlink of tr2aacds4.pl) did not call the subprogram trclass2pubset.pl when I ran it as far as I can tell: there were no "publicset" files. Instead I next manually ran trclass2pubset.pl, but I'm a bit unsure of what to provide for its options (specifically those I've bolded):

$evigene/genes/trclass2pubset.pl
EvidentialGene trclass2pubset -trclass myspp.trclass [-idprefix EVGm ]
  makes tables of public ids and main-alt linkage, from results of tr2aacds
  opts: -idprefix Thecc1EG  -mrna myspp.mrna  -names mrna.names
     -keepdrop keep_drop_ids.table -preserveOldIds=old.pubids
     -nosizesort -[no]pubsortseq -debug
  version 2020.03.15

In lieu of knowing the correct options, I ran it from within the directory I first ran tr2aacds.pl as:

$evigene/genes/trclass2pubset.pl -trclass evigene_formatted_assembly.trclass -idprefix Rtaylori

In the resulting publicset/ directory there are the nicely renamed transcriptome files as desired. However, the number of sequences do not appear to match those in the corresponding okayset/ files like I would hope. For example:

$ grep -c ">" publicset/evigene_formatted_assembly.cds_pub.fa
149053

$ grep -c ">" okayset/evigene_formatted_assembly.okay.cds
96931

Questions:
1. Am I running these correctly?
2. Why don't the number of sequences match?

Thanks,
Charles

Don Gilbert

unread,
Nov 16, 2023, 3:14:02 PM11/16/23
to charles...@gmail.com, EvidentialGene
Charles,

This updated tr2aacds program has changed the folder name 'publicset' to 'okayset', so that your final outputs are there as a publicly usable transcript set, which should include the -IDprefix you specify.   In tr2aacds version 4, there is now a preliminary folder from the first step, okayset1st.    This version adds several further measures of transcript quality, primarily reducing spurious alternate transcripts (i.e. partial proteins and slight variants of primary transcripts), so that the final okayset (or publicset) has more reductions from okayset1st.  This all followed from extensive testing and I believe it produces an improved transcript set, with less spurious redundancies but no loss of primary gene transcripts.

If you do want to run the $evigene/genes/trclass2pubset.pl program (which is automatically run as part of tr2aacds4, second step), you will need to use inputs of okayset1st/ folder (by removing updated okayset/ and renaming okayset1st/ to that).  Then you want to add two options to the way you tried it, -noaltdrop and -cullothers, to preserve the excess alternate transcripts :
    $evigene/genes/trclass2pubset.pl -trclass evigene_formatted_assembly.trclass -idprefix Rtaylori  -noaltdrop -cullothers -log

Do use the -log option to produce an output file of program steps that you can show me if you run into further problems.  But I think the output of trclass2pubset, starting from okayset1st/ folder will be near to what the tr2aacds4 okayset has (possibly missing some measures).

There remains for me to finish a part of EvidentialGene that measures true duplicated genes, that have near identical transcripts, and requires DNA samples to measure copy number of duplicate genes.   A partial program of this is genes/tr2dupgeneclass1a.pl but I can't recommend its use yet.  It requires inputs of both the Gnodes portion for measuring DNA copy numbers as well as tr2aacds for transcript assemblies.

- Don Gilbert

--
You received this message because you are subscribed to the Google Groups "EvidentialGene" group.
To unsubscribe from this group and stop receiving emails from it, send an email to evidentialgen...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/evidentialgene/e795b0a0-084d-4c36-9c62-76571f0deb41n%40googlegroups.com.


--
don gilbert - www.bio.net - bioinformatics - indiana.u.
Reply all
Reply to author
Forward
0 new messages