How to get differentially expressed gene sequences (by ids) from Trinity.fasta file.

870 views
Skip to first unread message

clo...@csumb.edu

unread,
Apr 7, 2015, 4:22:01 PM4/7/15
to trinityrn...@googlegroups.com
Hello,

I am trying to use grep and awk to extract a subset of DE genes from my Trinity.fasta file. I have this, but it only prints the first line of the sequence. How do I modify to include the entire sequence? Is there a way to do this in Trinity?

>head infile
comp1
comp2
comp3
...

head Trinity.fasta

>comp1
ACTCCCTT
>comp2
ACACTGGGT
>compN


cat infile | awk '{print $1}' | grep -A1 -f - Trinity.fasta > good.genes.fa

Thank you!
Cheryl

Will Holtz

unread,
Apr 7, 2015, 4:25:37 PM4/7/15
to clo...@csumb.edu, trinityrn...@googlegroups.com
Consider converting your fasta file to a tabular format with something like fasta_formatter from http://www.researchgate.net/go.Deref.html?url=http%3A%2F%2Fhannonlab.cshl.edu%2Ffastx_toolkit%2F. Then do your grepping on the tabular file, and convert back to fasta afterwards if needed.

-Will


--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
The information contained in this e-mail message or any attachment(s) may be confidential and/or privileged and is intended for use only by the individual(s) to whom this message is addressed.  If you are not the intended recipient, any dissemination, distribution, copying, or use is strictly prohibited.  If you receive this e-mail message in error, please e-mail the sender at who...@lygos.com and destroy this message and remove the transmission from all computer directories (including e-mail servers).

Please consider the environment before printing this email.

Dan Browne

unread,
Apr 7, 2015, 4:50:02 PM4/7/15
to trinityrn...@googlegroups.com
Hi Cheryl,

I wrote a script the other day to do exactly this. It is attached. In order to run it, you will need Python 2.7 and pyfaidx (https://github.com/mdshw5/pyfaidx). Easiest way to install pyfaidx is with pip.

The script takes a list of gene identifiers (TR#|c#_g#) and writes the sequences of all the isoforms (TR#|c#_g#_i#) to a fasta file.

Usage is:

$ python extract_unigene_isoforms -q gene_list.txt -f Trinity_assembly.fa -o output_file.fa

Let me know if you have any questions!

Dan
extract_unigene_isoforms.py

Dan Browne

unread,
Apr 7, 2015, 4:51:25 PM4/7/15
to trinityrn...@googlegroups.com
A note - if you want specific isoforms, not just all isoforms for a given gene - provide a list of the desired isoforms (TR#|c#_g#_i#). Should still work.

Cheryl Logan

unread,
Apr 7, 2015, 6:27:13 PM4/7/15
to Will Holtz, trinityrn...@googlegroups.com
This solution worked great. I converted my Trinity.fasta file to tabular format using fasta_formatter from the FASTX-Toolkit. 

fasta_formatter -i Trinity.fasta -o Trinity.txt -t

I just had to make one modification to the grep command, so that it prints only 1 line:

cat infile | awk '{print $1}' | grep -A0 -f - Trinity.txt > good.genes.fa

Many Thanks,
Cheryl

John Craft

unread,
Apr 20, 2016, 7:34:46 AM4/20/16
to trinityrnaseq-users
Cheryl
I cannot get your command to work. I have a file with the DE gene list extracted from Trinity_genes.counts.matrix.FCF_vs_FEF.edgeR.DE_results and used it to create a txt file. I have replaced the pipe symbol (|) in the entries of this file and produced Trinity_FCF_FEF_DEgenes_mod1.txt to avoid any character misidentification. I have run Trinity.fasta through fasta_formatter to produce the tabular format and then replaced the pipe in the Trinity.fasta to produce  Trinity_modif4.txt. All of the formats of the files looks OK. I then then ran:

cat Trinity_FCF_FEF_DEgenes_mod1.txt | awk '{print $1}' | grep -A0 -f - Trinity_modif4.txt >Trinity_FCF_FEF_DEgenes_list

but it does not extract any sequences.

I have tested steps along the road and just running

cat Trinity_FCF_FEF_DEgenes_mod1.txt | awk '{print $1}'

produces the DE genes list so I suspect the problem lies in the grep. Running single gene entries with grep -A0 "entry" Trinity_modif4.txt  works fine.

 

Can anyone see where I am going wrong.

Thanks

John

 

 

Adriana Fróes

unread,
Apr 20, 2016, 10:42:43 AM4/20/16
to John Craft, trinityrnaseq-users
You can use fastacmd script to do that. Just need to format (makeblastdb) you Trinity.fasta and extract the ID genes from the *.subset file (cat file.subset | awk {'print $1'} > outfile.txt).


Adriana M. Froes
Laboratório de Microbiologia, Instituto de Biologia, Depto de Biologia Marinha
Universidade Federal do Rio de Janeiro   
Av. Carlos Chagas Filho 373, Sala A3-202, Bloco A (Anexo) do CCS
21941-599, Ilha do Fundão, Rio de Janeiro, RJ

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.

Pavel Eduardo Galindo Torres

unread,
Apr 20, 2016, 7:06:11 PM4/20/16
to trinityrnaseq-users
Hi 

This could work 

bioawk -cfastx 'BEGIN{while((getline k <"id_file.txt")>0)i[k]=1}{if(i[$name])print ">"$name, "length="length($seq)"\n"$seq}' Trinity.fasta > final_file.fasta


regards
Pavel 

Brian Haas

unread,
Apr 20, 2016, 7:11:50 PM4/20/16
to Pavel Eduardo Galindo Torres, trinityrnaseq-users
Lots of ways to do this, of course.

Note, there's a script in the latest Trinity release:

  trinityrnaseq/util/misc/acc_list_to_fasta_entries.pl  

       usage: acc_list_to_fasta_entries.pl   acc.list.txt    file.fasta

that'll extract the list of transcripts from a fasta file based on a list of accessions.

best,

~b

hex...@gmail.com

unread,
Sep 28, 2016, 2:56:15 PM9/28/16
to trinityrnaseq-users, odrau...@gmail.com
Hi Brian,

Is it possible to use acc_list_to_fasta_entries.pl to extract isoform (or gene) sequences from trinity.fasta for differentially expressed genes? I tried acc_list_to_fasta_entries.pl, and it did not work.

I get list of DE genes (e.g. TR#|c#_g#), and the names of sequences in trinity.fasta are TR#|c#_g#_i#.Dan Browne provided a way to do that, but I have problem to install pyfaidx in the server I use.

Thanks


在 2016年4月20日星期三 UTC-4下午7:11:50,Brian Haas写道:

Mark Chapman

unread,
Sep 28, 2016, 4:59:31 PM9/28/16
to hex...@gmail.com, trinityrn...@googlegroups.com, odrau...@gmail.com

Are you trying to extract gene sequences from the trinity.fasta? Genes don't exist they are a combination of expression values from the transcripts. All that's in the fasta file will be transcript sequences.
Best wishes, Mark


--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.

hex...@gmail.com

unread,
Sep 28, 2016, 7:38:23 PM9/28/16
to trinityrnaseq-users, hex...@gmail.com, odrau...@gmail.com
Is it possible to extract all transcripts of selected genes?

在 2016年9月28日星期三 UTC-4下午4:59:31,Mark Chapman写道:

Are you trying to extract gene sequences from the trinity.fasta? Genes don't exist they are a combination of expression values from the transcripts. All that's in the fasta file will be transcript sequences.
Best wishes, Mark

On 28 Sep 2016 19:56, <hex...@gmail.com> wrote:
Hi Brian,

Is it possible to use acc_list_to_fasta_entries.pl to extract isoform (or gene) sequences from trinity.fasta for differentially expressed genes? I tried acc_list_to_fasta_entries.pl, and it did not work.

I get list of DE genes (e.g. TR#|c#_g#), and the names of sequences in trinity.fasta are TR#|c#_g#_i#.Dan Browne provided a way to do that, but I have problem to install pyfaidx in the server I use.

Thanks


在 2016年4月20日星期三 UTC-4下午7:11:50,Brian Haas写道:
Lots of ways to do this, of course.

Note, there's a script in the latest Trinity release:

  trinityrnaseq/util/misc/acc_list_to_fasta_entries.pl  

       usage: acc_list_to_fasta_entries.pl   acc.list.txt    file.fasta

that'll extract the list of transcripts from a fasta file based on a list of accessions.

best,

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.

Brian Haas

unread,
Sep 28, 2016, 9:32:13 PM9/28/16
to hex...@gmail.com, trinityrnaseq-users, Pavel Eduardo Galindo Torres
Attached is a version of the script that should work for either lists of Trinity genes or list of transcripts.

It'll go in the upcoming release.

best,

~brian


On Wed, Sep 28, 2016 at 7:38 PM, <hex...@gmail.com> wrote:
Is it possible to extract all transcripts of selected genes?

在 2016年9月28日星期三 UTC-4下午4:59:31,Mark Chapman写道:

Are you trying to extract gene sequences from the trinity.fasta? Genes don't exist they are a combination of expression values from the transcripts. All that's in the fasta file will be transcript sequences.
Best wishes, Mark

On 28 Sep 2016 19:56, <hex...@gmail.com> wrote:
Hi Brian,

Is it possible to use acc_list_to_fasta_entries.pl to extract isoform (or gene) sequences from trinity.fasta for differentially expressed genes? I tried acc_list_to_fasta_entries.pl, and it did not work.

I get list of DE genes (e.g. TR#|c#_g#), and the names of sequences in trinity.fasta are TR#|c#_g#_i#.Dan Browne provided a way to do that, but I have problem to install pyfaidx in the server I use.

Thanks


在 2016年4月20日星期三 UTC-4下午7:11:50,Brian Haas写道:
Lots of ways to do this, of course.

Note, there's a script in the latest Trinity release:

  trinityrnaseq/util/misc/acc_list_to_fasta_entries.pl  

       usage: acc_list_to_fasta_entries.pl   acc.list.txt    file.fasta

that'll extract the list of transcripts from a fasta file based on a list of accessions.

best,

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsubscribe...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 
acc_list_to_fasta_entries.pl

Farbod Emami

unread,
Sep 29, 2016, 4:35:17 AM9/29/16
to trinityrnaseq-users
Hi Cheryl,

You can have a look at this, too.
Reply all
Reply to author
Forward
0 new messages