Adding annotation to Corset results

phala...@gmail.com

unread,

Mar 3, 2015, 9:03:24 PM3/3/15

to corset-...@googlegroups.com

Hi Nadia,

I'm using corset to analyse differentially expressed genes between two experimental conditions. After running Trinity, Bowtie, Corset and edgeR, I ended up with a list of 30,000 "analysable" clusters (after performing normalisations in edgeR), from which around 5,000 are differentially expressed. I would now like to annotate these clusters using Blast2Go. However, after all these steps, I have information spread over many files. Briefly:

File 1: Contig sequences from Trinity (fasta file) - around 250,000 contigs
File 2: Clusters from Corset (text file) - around 150,000 clusters
File 3: List of 30,000 "analysable" clusters from edgeR (text file)

My question is, how can I combine all this information?
I want to produce two output files, one would be the annotated clusters and the second would be the annotated subset of the clusters from edgeR.

How can I do that?

Thank you very much,
R

Nadia Davidson

unread,

Mar 4, 2015, 2:58:24 AM3/4/15

to corset-...@googlegroups.com

Hi,

You will probably find the command packaged with corset called corset_fasta_ID_changer useful for annotation. It's a convenient way of combining the information from files 1 and 2 above. Basically it just adds the cluster ID to the contig ID in the fasta file.

If you're comfortable using command line tools and bash, you could then filter the fasta file for the clusters listed in File3 using grep. The recipe I would use is as follows (although you might have different tools or way to achieve the same result).

1. Run corset_fasta_ID_changer on the corset cluster.txt file and the fasta file the reads were mapped to.

1. Format the output fasta file so you have one line per entry using fasta_formatter

2. Make a file listing all the clusters of interest. One line per cluster and nothing else. e.g.

Cluster-0.0

Cluster-2.0

etc.

3. Add some stuff around the cluster ID to make greping easier:

cat my_clusters.txt | sed 's/^/^>/g' | sed 's/$/--/g' > my_clusters2.txt

4. Run grep (this part might be slow):

grep -A1 -f my_clusters2.txt formatted_fasta.fasta | grep -v "^\-\-" > final.fasta

5. Run Blast2GO on final.fasta

However, you will still have the issue that there are multiple contigs per cluster, which complicates the annotation process because it makes it slower and you'd need to do some post-processing to decide on the cluster to annotation mapping. We are working on software to make a single gene sequence per cluster, but this is probably some time away. It may be that you or others know a simple way to get the longest or other representative contig for each cluster.

Cheers,

Nadia.

phala...@gmail.com

unread,

Mar 4, 2015, 7:21:42 PM3/4/15

to corset-...@googlegroups.com

Hi Nadia,

Thank you very much for your reply. It was really helpful.
However, I couldn't get the last bit to run. Basically my two files look like this:

File 1: Formatted fasta file containing Clusters ID, contigs ID and sequences in a single line: (formatted using fasta_formatter [-t])
Cluster-38823.0--c0_g1_i1 len=1474 path=[47:0-404 451:405-405 46:406-1473] GAGAAATTGGCCGCGGTGTGCCGATTACTGAAAATATATTGACCGAGTTTTAGGCATACTGCAGGCATAAATTATACGACTAGAACAACTGAATGGCACCATGTTAACTCAGTTTTGAATTAAAGAAGAAATCGACAGATTATTAATGGTAC..{continue}
Cluster-44451.0--c1_g1_i1 len=667 path=[53:0-666] CAAAATTAATGCAATTCTGTTTCGATATGTGACTACTACTAGTGATAAAAATGTATTTTTATTTGCAAACTCTTTTTGAACTTTTTTCCCCGTGAATGCTCTTATTCCGGAACATTTATATATATGTATAAACCGAATTCTAGAAGTGTTAA..{continue}

File 2: Edited list of clusters of interest (text file):
^>Cluster-79384.0
--
^>Cluster-4271.1
--
^>Cluster-78496.0
--
^>Cluster-3673.0
--

So, basically the command

grep -A1 -f my_clusters2.txt formatted_fasta.fasta | grep -v "^\-\-" > final.fasta

is not producing any output. Did I did something wrong editing the fasta and text files?

Once again, thank you very much for your help.

phala...@gmail.com

unread,

Mar 4, 2015, 7:45:05 PM3/4/15

to corset-...@googlegroups.com

Hi again,

For some clusters, I have more than one designed sequence in File 1. For instance, Cluster-3673.0 contains contigs c3545_g1_i2, c3545_g1_i3, c43521_g1_i1 and c43521_g1_i3. Would that be an issue for this last command to run?

Cheers,
Regina

Nadia Davidson

unread,

Mar 4, 2015, 7:52:00 PM3/4/15

to corset-...@googlegroups.com

Hi Regina,

Here are a couple of suggestions which might help:

- Run fasta_formatter without the -t option. You still want a line break between the contig ID and sequence, just not line breaks within the sequence. You also still need the ">" character at the start of the ID.

- In your file 2, the "--" characters should be just after the cluster ID. e.g. like "^>Cluster-79384.0--", instead of on a line of their own. I'm not sure why the " sed 's/$/--/g' " part of the command I posted doesn't work for you. It might be some platform specific thing. You could try to get the same result with awk instead (at step 3):

cat my_clusters.txt | sed 's/^/^>/g' | awk '{print $0"--"}' > my_clusters2.txt

Let us know how you go.

Cheers,

Nadia.

Nadia Davidson

unread,

Mar 4, 2015, 7:54:40 PM3/4/15

to corset-...@googlegroups.com

It should pick out any contig with a cluster ID that matches your list. So you should get all of them reported.

Cheers,

Nadia.

phala...@gmail.com

unread,

Mar 4, 2015, 10:41:50 PM3/4/15

to corset-...@googlegroups.com

Hi Nadia,

Thank you very much for your quick response. I've re edited the files 1 and 2 as you suggested, and they now looked like this:

File 1: Formatted fasta file containing Clusters ID, contigs ID and sequences (formatted using fasta_formatter):
>Cluster-11688.0--c2_g1_i1 len=442 path=[53:0-441]
TGACTACACCTGTGTCTGTATTCTGAAAATAAAGACAATTATAACTAGCAATAGAATGAACTGAAAATAAGAGGAGGTATAAGAAATTCAATT...[continue]
>Cluster-94817.0--c3_g1_i1 len=430 path=[66:0-78 24:79-429]
CTCAATAGTGAGTAGACGGTTGCACGTTAACCTTAGGAAGCAAAGCGTAACATGTATAAAGCCATAAGTATATAAATCTTCTACCTTT...[continue]
>Cluster-24211.0--c4_g1_i1 len=419 path=[45:0-296 342:297-340 44:341-418]
TGTGACAAAGACTTCTCCGACATAAATTCTACCTTGAAGTCAAAAACAAACTGTCACAACATTCATCTACACTTAATCTTAAAGTCAAAAACACA...[continue]

File 2: Edited list of clusters of interest (text file):
^>Cluster-79384.0--
^>Cluster-4271.1--
^>Cluster-78496.0--
^>Cluster-3673.0--

^>Cluster-57202.2--

However, the fourth step is still not working. It runs but generates an empty file as output. Do you have any idea of what might be happening here? Am I doing something wrong?

Cheers,
R

phala...@gmail.com

unread,

Mar 4, 2015, 11:00:31 PM3/4/15

to corset-...@googlegroups.com

Hi again,

Command is running. It's taking ages though. Only 12kb produced in 5 min. I will check the results once the run is completed and let you guys now.

Thanks!

phala...@gmail.com

unread,

Mar 6, 2015, 8:39:39 PM3/6/15

to corset-...@googlegroups.com

Hi Nadia,

The command for the fourth step works, but only with small datasets. I've done a test with 1/10 of my clusters of interest and it worked fine. Running my full list (around 30,000 clusters of interest), the run ends up crashing (process is killed) after 20-24h.

I was trying to use GNU parallel to be able to run this command using multiple cores, but it didn't seem to work. Do you have any suggestion on how to solve it?

Thanks you very much.

Cheers,
R

Nadia Davidson

unread,

Mar 16, 2015, 7:04:06 PM3/16/15

to corset-...@googlegroups.com

Hi Regina,

Yeah grep can be a bit slow with large lists. There's probably a more elegant solution out there, but I'm not really a bash expert.

My only thought is to split the cluster list file up into 10 or so smaller files and run them independently.

Cheers,

Nadia.

Adam Taranto

unread,

Apr 29, 2015, 3:40:05 AM4/29/15

to corset-...@googlegroups.com

Hi Nadia / Regina,

I am also trying to map transcript annotations back up to the Corset transcript clusters.

I've written a python script that should deal with your problem. The program will take your trinity transcripts fasta, the clusters.txt file from Corset, and a list of clusters of interest. Transcripts belonging to your clusters of interest will have the clusterID appended to their name and be written, along with their sequence data, to a new fasta file for further analysis.

Any target clusters that do not occur in the corset map, as well as any transcripts that do not occur in your reference fasta, will be written to an error log.

You can run the test data provided like this:

./fetchClusterSeqs.py -c testData/clusters.txt -t testData/significantClusters.txt -i testData/transcripts.fa -o foundTranscripts.fa

This should be much faster than grepping out clusters. I'd be interested to hear how it goes on your large dataset.

This and future tools will be available from: https://github.com/Adamtaranto/Corset-tools

Cheers,

Adam

Nadia Davidson

unread,

Apr 29, 2015, 8:36:16 PM4/29/15

to corset-...@googlegroups.com

Hi Adam,

I haven't had a chance to try out your python script out yet, but it sounds really great. Do you mind if I link to your code on the corset webpage? I think a lot of people might find it useful.

Cheers,

Nadia.

phala...@gmail.com

unread,

Nov 22, 2015, 5:18:06 PM11/22/15

to corset-project

Hi Nadia and Adam,

Thank you very much for your help.
Have any of you written a script for selecting the longest contig within a cluster? As I mentioned before, most of my clusters comprise multiple contigs and I would like to select the cluster with the longest contig for annotation purposes. I've tried writing one in R, but it's taking ages to run and I'm getting some warning messages too. Do you have something in python or shell?

Regards,
Regina

lent...@gmail.com

unread,

Mar 10, 2016, 6:45:46 AM3/10/16

to corset-project

Hi!!

I'm trying to run the fetchClusterSeqs.py script, but I get the following:

"Traceback (most recent call last):
File "./fetchClusterSeqs.py", line 127, in <module>
main(inFasta, targetClust, outFasta, clustMap)
File "./fetchClusterSeqs.py", line 26, in main
clustMem = clustDict(clustMap)
File "./fetchClusterSeqs.py", line 86, in clustDict
clustID=row[1]
IndexError: list index out of range"

My command line was: ./fetchClusterSeqs.py -c ./clusters.txt -t ./select_clusters.csv -i ~/proton/trinity_kmer3/Trinity.fasta -o ./Trinity_kmer3_fetchcorset.fa

Any orientation will be very appreciated!!

Adam Taranto

unread,

Mar 10, 2016, 8:04:23 PM3/10/16

to corset-project

Could you please post the results of 'head -n 10' on each of you input files (clusters.txt , select_clusters.csv, Trinity.fasta).

Adam Taranto

unread,

Mar 10, 2016, 10:29:11 PM3/10/16

to corset-project

Hi Regina,

I've updated fetchClusterSeqs to optionally return only the longest contig per cluster using the flag '--longest'

./fetchClusterSeqs.py --inFasta transcripts.fa --targetClust significantClusters.txt --outFasta longestTranscriptsByCluster.fa --clustMap clusters.txt --longest

Also, you can now leave out the '--targetClust' flag if you want to report transcripts belonging to ALL clusters in clusters.txt

Let me know if you find any bugs!

Cheers,

Adam

Adam Taranto

unread,

Mar 15, 2016, 12:20:53 AM3/15/16

to corset-project, lent...@gmail.com

Should be fixed. Please 'git pull' from the Corset-tools repo and test.

A

M. Olalla Lorenzo-Carballa

unread,

Apr 27, 2016, 7:39:12 AM4/27/16

to corset-project, lent...@gmail.com

Great script!

Worked perfectly in my case to report the transcripts from all clusters :)

Thanks a lot

Olalla

maevem...@gmail.com

unread,

Apr 19, 2018, 9:52:54 PM4/19/18

to corset-project

Hi Adam,

I am very interested in using your script to annotate the results of my corset output. However, I am unfamiliar with Python and was wondering if you could shed some light on what tools I need to download in order to run your script. Can it be run directly in R or does it need to be run in terminal? Additionally, is the command line on the Corset GitHub example sufficient, or does the entire code that you have listed on your GitHub page need to be run and changed to show my file names? Any advice for getting started would be much appreciated. I apologize for asking such a basic question!

Best,
Maeve

Reply all

Reply to author

Forward