How to get differentially expressed gene sequences (by ids) from Trinity.fasta file.

clo...@csumb.edu

unread,

Apr 7, 2015, 4:22:01 PM4/7/15

to trinityrn...@googlegroups.com

Hello,

I am trying to use grep and awk to extract a subset of DE genes from my Trinity.fasta file. I have this, but it only prints the first line of the sequence. How do I modify to include the entire sequence? Is there a way to do this in Trinity?

>head infile
comp1
comp2
comp3
...

head Trinity.fasta

>comp1
ACTCCCTT
>comp2
ACACTGGGT
>compN


cat infile | awk '{print $1}' | grep -A1 -f - Trinity.fasta > good.genes.fa

Thank you!
Cheryl

Will Holtz

unread,

Apr 7, 2015, 4:25:37 PM4/7/15

to clo...@csumb.edu, trinityrn...@googlegroups.com

Consider converting your fasta file to a tabular format with something like fasta_formatter from http://www.researchgate.net/go.Deref.html?url=http%3A%2F%2Fhannonlab.cshl.edu%2Ffastx_toolkit%2F. Then do your grepping on the tabular file, and convert back to fasta afterwards if needed.

-Will

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--

The information contained in this e-mail message or any attachment(s) may be confidential and/or privileged and is intended for use only by the individual(s) to whom this message is addressed. If you are not the intended recipient, any dissemination, distribution, copying, or use is strictly prohibited. If you receive this e-mail message in error, please e-mail the sender at who...@lygos.com and destroy this message and remove the transmission from all computer directories (including e-mail servers).

Please consider the environment before printing this email.

Dan Browne

unread,

Apr 7, 2015, 4:50:02 PM4/7/15

to trinityrn...@googlegroups.com

Hi Cheryl,

I wrote a script the other day to do exactly this. It is attached. In order to run it, you will need Python 2.7 and pyfaidx (https://github.com/mdshw5/pyfaidx). Easiest way to install pyfaidx is with pip.

The script takes a list of gene identifiers (TR#|c#_g#) and writes the sequences of all the isoforms (TR#|c#_g#_i#) to a fasta file.

Usage is:

$ python extract_unigene_isoforms -q gene_list.txt -f Trinity_assembly.fa -o output_file.fa

Let me know if you have any questions!

Dan

extract_unigene_isoforms.py

Dan Browne

unread,

Apr 7, 2015, 4:51:25 PM4/7/15

to trinityrn...@googlegroups.com

A note - if you want specific isoforms, not just all isoforms for a given gene - provide a list of the desired isoforms (TR#|c#_g#_i#). Should still work.

Cheryl Logan

unread,

Apr 7, 2015, 6:27:13 PM4/7/15

to Will Holtz, trinityrn...@googlegroups.com

This solution worked great. I converted my Trinity.fasta file to tabular format using fasta_formatter from the FASTX-Toolkit.

fasta_formatter -i Trinity.fasta -o Trinity.txt -t

I just had to make one modification to the grep command, so that it prints only 1 line:

cat infile | awk '{print $1}' | grep -A0 -f - Trinity.txt > good.genes.fa

Many Thanks,

Cheryl

John Craft

unread,

Apr 20, 2016, 7:34:46 AM4/20/16

to trinityrnaseq-users

Cheryl

I cannot get your command to work. I have a file with the DE gene list extracted from Trinity_genes.counts.matrix.FCF_vs_FEF.edgeR.DE_results and used it to create a txt file. I have replaced the pipe symbol (|) in the entries of this file and produced Trinity_FCF_FEF_DEgenes_mod1.txt to avoid any character misidentification. I have run Trinity.fasta through fasta_formatter to produce the tabular format and then replaced the pipe in the Trinity.fasta to produce Trinity_modif4.txt. All of the formats of the files looks OK. I then then ran:

cat Trinity_FCF_FEF_DEgenes_mod1.txt | awk '{print $1}' | grep -A0 -f - Trinity_modif4.txt >Trinity_FCF_FEF_DEgenes_list

but it does not extract any sequences.

I have tested steps along the road and just running

cat Trinity_FCF_FEF_DEgenes_mod1.txt | awk '{print $1}'

produces the DE genes list so I suspect the problem lies in the grep. Running single gene entries with grep -A0 "entry" Trinity_modif4.txt works fine.

Can anyone see where I am going wrong.

Thanks

John

Adriana Fróes

unread,

Apr 20, 2016, 10:42:43 AM4/20/16

to John Craft, trinityrnaseq-users

You can use fastacmd script to do that. Just need to format (makeblastdb) you Trinity.fasta and extract the ID genes from the *.subset file (cat file.subset | awk {'print $1'} > outfile.txt).

Adriana M. Froes

Laboratório de Microbiologia, Instituto de Biologia, Depto de Biologia Marinha
Universidade Federal do Rio de Janeiro
Av. Carlos Chagas Filho 373, Sala A3-202, Bloco A (Anexo) do CCS
21941-599, Ilha do Fundão, Rio de Janeiro, RJ

--

You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.

Visit this group at https://groups.google.com/group/trinityrnaseq-users.

Pavel Eduardo Galindo Torres

unread,

Apr 20, 2016, 7:06:11 PM4/20/16

to trinityrnaseq-users

Hi

This could work

bioawk -cfastx 'BEGIN{while((getline k <"id_file.txt")>0)i[k]=1}{if(i[$name])print ">"$name, "length="length($seq)"\n"$seq}' Trinity.fasta > final_file.fasta

regards

Pavel

Brian Haas

unread,

Apr 20, 2016, 7:11:50 PM4/20/16

to Pavel Eduardo Galindo Torres, trinityrnaseq-users

Lots of ways to do this, of course.

Note, there's a script in the latest Trinity release:

trinityrnaseq/util/misc/acc_list_to_fasta_entries.pl

usage: acc_list_to_fasta_entries.pl acc.list.txt file.fasta

that'll extract the list of transcripts from a fasta file based on a list of accessions.

best,

~b

hex...@gmail.com

unread,

Sep 28, 2016, 2:56:15 PM9/28/16

to trinityrnaseq-users, odrau...@gmail.com

Hi Brian,

Is it possible to use acc_list_to_fasta_entries.pl to extract isoform (or gene) sequences from trinity.fasta for differentially expressed genes? I tried acc_list_to_fasta_entries.pl, and it did not work.

I get list of DE genes (e.g. TR#|c#_g#), and the names of sequences in trinity.fasta are TR#|c#_g#_i#.Dan Browne provided a way to do that, but I have problem to install pyfaidx in the server I use.

Thanks

在 2016年4月20日星期三 UTC-4下午7:11:50，Brian Haas写道：

Mark Chapman

unread,

Sep 28, 2016, 4:59:31 PM9/28/16

to hex...@gmail.com, trinityrn...@googlegroups.com, odrau...@gmail.com

Are you trying to extract gene sequences from the trinity.fasta? Genes don't exist they are a combination of expression values from the transcripts. All that's in the fasta file will be transcript sequences.
Best wishes, Mark

--

You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.

hex...@gmail.com

unread,

Sep 28, 2016, 7:38:23 PM9/28/16

to trinityrnaseq-users, hex...@gmail.com, odrau...@gmail.com

Is it possible to extract all transcripts of selected genes?

在 2016年9月28日星期三 UTC-4下午4:59:31，Mark Chapman写道：

Are you trying to extract gene sequences from the trinity.fasta? Genes don't exist they are a combination of expression values from the transcripts. All that's in the fasta file will be transcript sequences.
Best wishes, Mark

On 28 Sep 2016 19:56, <hex...@gmail.com> wrote:

Hi Brian,

Is it possible to use acc_list_to_fasta_entries.pl to extract isoform (or gene) sequences from trinity.fasta for differentially expressed genes? I tried acc_list_to_fasta_entries.pl, and it did not work.

I get list of DE genes (e.g. TR#|c#_g#), and the names of sequences in trinity.fasta are TR#|c#_g#_i#.Dan Browne provided a way to do that, but I have problem to install pyfaidx in the server I use.

Thanks

在 2016年4月20日星期三 UTC-4下午7:11:50，Brian Haas写道：
Lots of ways to do this, of course.

Note, there's a script in the latest Trinity release:

trinityrnaseq/util/misc/acc_list_to_fasta_entries.pl

usage: acc_list_to_fasta_entries.pl acc.list.txt file.fasta

that'll extract the list of transcripts from a fasta file based on a list of accessions.

best,

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.

Brian Haas

unread,

Sep 28, 2016, 9:32:13 PM9/28/16

to hex...@gmail.com, trinityrnaseq-users, Pavel Eduardo Galindo Torres

Attached is a version of the script that should work for either lists of Trinity genes or list of transcripts.

It'll go in the upcoming release.

best,

~brian

On Wed, Sep 28, 2016 at 7:38 PM, <hex...@gmail.com> wrote:

Is it possible to extract all transcripts of selected genes?

在 2016年9月28日星期三 UTC-4下午4:59:31，Mark Chapman写道：

Are you trying to extract gene sequences from the trinity.fasta? Genes don't exist they are a combination of expression values from the transcripts. All that's in the fasta file will be transcript sequences.
Best wishes, Mark

On 28 Sep 2016 19:56, <hex...@gmail.com> wrote:

Hi Brian,

Is it possible to use acc_list_to_fasta_entries.pl to extract isoform (or gene) sequences from trinity.fasta for differentially expressed genes? I tried acc_list_to_fasta_entries.pl, and it did not work.

I get list of DE genes (e.g. TR#|c#_g#), and the names of sequences in trinity.fasta are TR#|c#_g#_i#.Dan Browne provided a way to do that, but I have problem to install pyfaidx in the server I use.

Thanks

在 2016年4月20日星期三 UTC-4下午7:11:50，Brian Haas写道：
Lots of ways to do this, of course.

Note, there's a script in the latest Trinity release:

trinityrnaseq/util/misc/acc_list_to_fasta_entries.pl

usage: acc_list_to_fasta_entries.pl acc.list.txt file.fasta

that'll extract the list of transcripts from a fasta file based on a list of accessions.

best,

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsubscribe...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrnaseq-users@googlegroups.com.

Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--

--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

acc_list_to_fasta_entries.pl

Farbod Emami

unread,

Sep 29, 2016, 4:35:17 AM9/29/16

to trinityrnaseq-users

Hi Cheryl,

You can have a look at this, too.

https://www.biostars.org/p/209872/#211950

Reply all

Reply to author

Forward