bacterial genome friendliness

Oleg Moskvin

unread,

Feb 8, 2013, 11:21:14 AM2/8/13

to rsem-...@googlegroups.com

Hi Colin and Bo,

I am testing RSEM with a couple of bacterial genomes, and while everyhting was smooth in one case (yes, that genome was downloaded from Ensemble, and Colin has already mentioned the error-free, refined style of Ensemble genomes in another thread), second genome was a custom in-house one, and we have issues right away here: 1) absence of gtf file and need to convert gff to gtf, 2) the gtf file had no "exon" record in the first place, which resulted in "less than 1 transcript detected" error (thank Bo for the suggestion to rename "gene" to "exon"), then 3) "a transcript has exons from different orientations"error...

So, why I am writing here instead of just fixing the files and go on?

The reality, part 1: If we look at bacterial genomes in GenBank, for example, we may find sets with up to 11 file formats per replicon there (.asn, .faa, .ffn, .fna, .frn, .gbk, .gff, .ptt, .rnt, .rpt,
.val). Example: ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Rhodobacter_sphaeroides_2_4_1_uid56/
What is curious here? - There is no gtf file!!! Yes, the end-user can convert General Feature Format to gtf with some reservations. GFF by itself is rejected by RSEM.

The reality, part 2: In several places (including GLBRC) people won't get away with running prepare-reference once and get away with it. With endless bacterial genomes / strains and cheaper in-house sequencing, situations of constant inflow of new locally sequenced reference genomes will soon be normal (in fact, we are expecting this quite soon). Sure, we may expect people involved in sequencing / upstream bioinformatics to provide the strain annotation in the perfectly-suitable format but this may or may not be real in 100% cases.

Do you think that extending the flexibility of RSEM in accepting feature annotation formats would be a reasonable idea?

While alternative splicing is obviously not relevant to prokaryotes, bacterial RNA-Seq analysis may benefit form other great features of RSEM. I really enjoyed reading th RSEM paper and am excited about what are you doing.

Thanks a lot!

Oleg

Colin Dewey

unread,

Feb 8, 2013, 2:47:54 PM2/8/13

to rsem-...@googlegroups.com

Hi Oleg,

We can definitely consider supporting another format, but it will take some time to have this implemented. Would supporting the GenBank flatfile format (the .gbk files) be the most useful to you? For the locally-sequenced genomes, which format would be preferred for these?

Thanks,

Colin

--
You received this message because you are subscribed to the Google Groups "RSEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rsem-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Oleg Moskvin

unread,

Feb 8, 2013, 3:06:36 PM2/8/13

to rsem-...@googlegroups.com

Hi Colin,

I was going to ask about GBK files because they are so self-containing and their support by RSEM will be highly user-friendly; plus, they are widely popular / standard for sure.

And for many visualization packages, loading a single gbk file substitutes for loading of genome and feature files separately and this also means less trembling / worry about possible occasional (I've seen that) sequence / IDs mismatches between the two even when they are distributed as parts of a single package etc.

Thanks!

Oleg

Colin Dewey

unread,

Feb 10, 2013, 2:08:23 PM2/10/13

to rsem-...@googlegroups.com

Hi Oleg,

Thanks for your thoughts on this. I think we'll pursue adding GBK support in the near future.

Are the locally-sequenced genomes put into GBK format? Or is there some other (more basic) format that is used?

Thanks,

Colin

Oleg Moskvin

unread,

Feb 11, 2013, 1:56:37 PM2/11/13

to rsem-...@googlegroups.com

Hi Colin,

Yes, we definitely have .gbk, along with .gff3 and .sbd

Thanks!

Oleg

lucmo...@gmail.com

unread,

May 19, 2015, 12:58:41 AM5/19/15

to rsem-...@googlegroups.com

Dear Colin,

I was wondering if the gff and the gbk file format were integrated in the rsem-1.2.21. I am receiving the following error while running the rsem-prepare-reference scritpt in a bacterial genome.

rsem-extract-reference-transcripts nitro 0 nitroatt1.gff 0 my_bacteria.fna
The reference contains no transcripts!
"rsem-extract-reference-transcripts nitro 0 nitroatt1.gff 0 my_bacteria.fna" failed! Plase check if you provide correct parameters/options for the pipeline!

The original file was downloaded as gbk from IMG website and converted to gff with Geneious.
The file looks like:

my_bacteria_1126    Geneious    source    1    506    .    +    .    Name=my_bacteria;organism="my_bacteria";mol_type="genomic DNA";db_xref=taxon:203693;NCBI Feature Key=source
my_bacteria_1126    Geneious    gene    244    504    .    +    .    locus_tag=my_bacteria_11261
my_bacteria_1126    Geneious    CDS    244    504    .    +    .    Name=hypothetical protein CDS;locus_tag=my_bacteria_11261;product="hypothetical protein";translation=XXXXXXXXX;NCBI Feature Key=CDS

Thanks for any insight.

All the best,
Lucas

Colin Dewey

unread,

May 20, 2015, 12:11:02 PM5/20/15

to rsem-...@googlegroups.com

Dear Lucas,

Currently, RSEM only accepts GTF formatted annotation files, which are admittedly more targeted towards eukaryotic gene annotations in that they require “transcript” and “exon” lines. You could either add those lines in (for each gene, simply add a transcript and exon line with identical start and end coordinates), making sure you adhere to the GTF standard:

http://mblab.wustl.edu/GTF22.html

or you could simply extract the sequences of the genes in a multi-fasta file and pass that directly to RSEM, in lieu of a genome+annotation.

Best,

Colin

--
RSEM website: http://deweylab.biostat.wisc.edu/rsem/
---

You received this message because you are subscribed to the Google Groups "RSEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rsem-users+...@googlegroups.com.

To post to this group, send email to rsem-...@googlegroups.com.
Visit this group at http://groups.google.com/group/rsem-users.

Lucas M. e Silva

unread,

May 24, 2015, 10:05:50 PM5/24/15

to rsem-...@googlegroups.com

HI Colin,

Thanks for the reply. Good tip. I extracted the CDSs and RSEM worked super fine with them.

This is my script if anyone have the same problem as I had. It extracts CDSs in multi-fasta format from a genebank file.It should work fine with bacteria genomes.

Cheers,

Lucas

#A python script to extract CDSs in multi-fasta format from a genebank file of a bacterial genome

from Bio import SeqIO

in_file = "rhizo.gbk"
out_file = "rhizo_CDS.fa"
in_handle = open(in_file)
out_handle = open(out_file, "w")

for record in SeqIO.parse(in_handle, "genbank"):
for i in record.features:
    if i.type == "CDS":
      out_handle.write(">"+ i.qualifiers["locus_tag"][0] + "\n")
      out_handle.write(str(i.location.extract(record).seq) +"\n")

in_handle.close()
out_handle.close()

You received this message because you are subscribed to a topic in the Google Groups "RSEM Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rsem-users/GoWPrFzMHMI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rsem-users+...@googlegroups.com.

Message has been deleted

Alexander Predeus

unread,

Jul 31, 2018, 3:31:26 PM7/31/18

to RSEM Users

There's an interesting problem that I discovered with using RSEM and similar methods.

For my bacterial RNA-seq, I have done a regular read-count summation with featureCounts, which in the default mode does not use any multimappers. I've also ran RSEM and kallisto.

Now, in case of normal eukaryotic RNA-seq, you would expect the sum of estimated raw read counts to be 10-30% higher for RSEM and kallisto. This is simply because you "rescue" the multimappers.

However, most bacterial annotations DO NOT have the UTRs included! The "gene" feature is usually the same as "CDS", and if you extract "transcripts" using genome fasta and annotation, your features quite often end up being too short to map/pseudomap the reads.

So, what happens is you actually lose reads with RSEM or kallisto.

I'm trying to figure out what to do here.

I would like to use an EM-based quantification method, but it seems like genomic alignment is a must for RNA-seq. So far I've used featureCounts with multimapping option, but that one simply divides the read counts equally between the equivalent positions - not exactly ideal.

Do you think there's a way one can use genomic alignment with RSEM's EM quantification? If you could define a read as belonging to a feature if it overlaps it even by 1 bp, that would include most of the reads I believe.