Building index of human genome from .fna file

morien

unread,

Nov 26, 2015, 6:03:07 PM11/26/15

to Sailfish Users Group

I think I may have a (not enough) memory issue, but I wanted to check with the google group before deciding that that is the real problem. I'm trying to build an index using GRCh38. I'd like to think this should be an easy task, but I've actually run into problems on each of three machines I've tried it on. First I had a problem with a homebrew install on a mac desktop. The error I had didn't tell me much, it would just quit partway through the index command. I left that alone since there are numerous ways that could go wrong and I didn't have the hours to figure out what the issue was.

I have access to a server running CentOS 7. I tried to build sailfish there from source, but there was a problem with sailfish not being able to find a specific gcc library (it's installed and up to date, and the dev library too. in fact, every package that provides the library in question is installed...). Another thing that's often time consuming to fix, so I left it for later.

Last resort: I tried using the current Debian binary on an up-to-date Ubuntu desktop. These systems have a lot in common and it's often the case that programs built for one will work in the other. Worked fine until the index command was nearly done running. I came back to my screen today to find this:

dune:~/circ_RNA/data/db$ sailfish index -t GCA_000001405.15_GRCh38_full_analysis_set.fna -o sailfish_index/

...

[info] Building 64-bit suffix array (length of generalized text is 3209458384 )

Building suffix array . . . success

saving to disk . . . done

Elapsed time: 3390.84s

done

Elapsed time: 9389.65s

processed 305000000 positionssailfish: /home/vagrant/sailfish/external/install/include/sparsehash/internal/densehashtable.h:782: void google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::clear_to_size(google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::size_type) [with Value = std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> >; Key = long unsigned int; HashFcn = rapmap::utils::KmerKeyHasher; ExtractKey = google::dense_hash_map<long unsigned int, rapmap::utils::SAInterval<long int>, rapmap::utils::KmerKeyHasher, std::equal_to<long unsigned int>, google::libc_allocator_with_realloc<std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> > > >::SelectKey; SetKey = google::dense_hash_map<long unsigned int, rapmap::utils::SAInterval<long int>, rapmap::utils::KmerKeyHasher, std::equal_to<long unsigned int>, google::libc_allocator_with_realloc<std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> > > >::SetKey; EqualKey = std::equal_to<long unsigned int>; Alloc = google::libc_allocator_with_realloc<std::pair<const long unsigned int, rapmap::utils::SAInterval<long int> > >; google::dense_hashtable<Value, Key, HashFcn, ExtractKey, SetKey, EqualKey, Alloc>::size_type = long unsigned int]: Assertion `table' failed.

Aborted (core dumped)

My clue that it's a problem with lack of memory is that there is some kind of allocation going on right before the program aborts. I tried searching the archives here for a similar issue but I didn't find anything, so I'm guessing this is specific to my machine. Hopefully the devs or a more experienced user can help?

Rob

unread,

Nov 26, 2015, 7:10:24 PM11/26/15

to Sailfish Users Group

Hi Morien,

I agree that this seems to be a memory-related issue. One reason that you might not have found many related posts is because your use-case seems somewhat particular. Specifically, you mention that you are building a Sailfish index on the human genome, but the Sailfish index is typically built on the transcriptome. The transcriptome obviously tends to be much smaller than the genome. Of course, the index can be built on any set of "target" sequences in a fasta file, but one must consider if the file being provided makes sense in terms of quantification. Sailfish treats each entry in the fasta file on which it builds its index as a separate target for quantification (i.e. it doesn't take a genome + an annotation, but the set of potential transcripts directly). From this perspective, providing the genome probably is not what you want to do, as Sailfish will then provide a quantification estimate for the relative abundance of each chromosome and alt contig (which is probably not meaningful for any analysis). Is there a reason you're attempting to index the genome itself? If not, you should prepare a transcriptome with the targets you wish to quantify and then index that.

Best,

Rob

morien

unread,

Nov 26, 2015, 7:22:39 PM11/26/15

to Sailfish Users Group

Ah, okay. Thank you Rob. I hadn't tried this before, but now that you've pointed that issue out it seems obvious. I was using the human genome because it is possible some of the RNA seq data I am working with may not correspond to previously recorded transcripts.

For some reason I had it in my head that providing a .gff file with annotations during the quant step would get me what I wanted, but again, now that you've pointed out the issue, it seems obvious that this won't work.

Rob

unread,

Nov 26, 2015, 7:42:06 PM11/26/15

to Sailfish Users Group

Hi Morien,

No problem! Please let me know if the issue goes away when you build the index with the transcriptome. I should note that the purpose of providing the gff to sailfish during the quant phase is to provide it with a transcript <-> gene mapping, so that it may provide gene-level as well as transcript-level abundance estimates.