Any possibility to build an index on normal alphabet?

56 views
Skip to first unread message

EunCheon Lim

unread,
Jan 24, 2015, 8:38:57 AM1/24/15
to nvbio...@googlegroups.com
Hi,
  I am mainly working on genomics and proteomics. Since I do not know much detail about nvBio, my question could be off the topic.

  I would like to build an index for an approximate string search, which can deal with {"word", occurrence of the word}.
  Therefore, I thought about FM-index.

  The problem is that, in my application, sometimes I need to change the encoding scheme for protein or DNA sequences of interests
  to a general character sets such as [a-zA-Z0-9] rather than [ACGTN] or IUPAC codes.

  Thus pseudo code of using this index is as follows:
  
an_index_1 = build_index("./data/sequence_A-Z_encoded.fa")
an_index_2 = build_index("./data/sequence_IUPAC_encoded.fa")
an_index_3 = build_index("./data/sequence_DNA_encoded.fa") 
occurrence_1 = approximate_search(an_index_1, "ABCDEFG", 2)
occurrence_2 = approximate_search(an_index_2, "ACDKMPR", 2)
occurrence_3 = approximate_search(an_index_3, "ACGTNAC", 2)

  Let me know how to achieve such functionality with nvBio if it is possible.

Regards,
Euncheon

Jacopo Pantaleoni

unread,
Jan 24, 2015, 12:59:54 PM1/24/15
to EunCheon Lim, nvbio...@googlegroups.com
Hello Euncheon,

at the moment we only support FM-index construction and lookups for 2-bit alphabets - though it would certainly
be possible to extend the algorithms to larger alphabets.

In fact, all the suffix sorting algorithms (in the sufsort module), necessary for the BWT construction, are templated
over the symbol size, but the final applications like nvBWT are hardwired to work with DNA.

We could likely generalize them in a relatively short amount of time, though the effort would have to be properly
prioritized. Alternatively, you could try to do it yourself, perhaps with our guidance.

For lookup, one would have to extend the rank_dictionary class, which is also only implemented for the 2-bit case.

best,
-jacopo

--
You received this message because you are subscribed to the Google Groups "nvbio-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nvbio-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

EunCheon Lim

unread,
Jan 24, 2015, 7:52:24 PM1/24/15
to nvbio...@googlegroups.com, abyss...@gmail.com
Hi, Jacopo,

  I understood as the nvBWT is an explicit template class specialization on top of generic template class definitions. I am going to check codes on next Monday.
  Before that, could you tell me exact names of codes that I should read if I would like to try implementing such feature by myself?
  Regarding the priority, this data structure serves as a basis for the higher-order concepts in our projects, I would say it is the first thing to do.

Euncheon

Jacopo Pantaleoni

unread,
Jan 27, 2015, 1:15:51 PM1/27/15
to EunCheon Lim, nvbio...@googlegroups.com
Hi Euncheon,

I have just made the following changes to NVBIO which might be generally useful:

 * generalized rank_dictionary to support larger alphabets; note that while it is now possible to use this class with alphabets \Sigma with s=16 and
   256 symbols (i.e. log(s) = 4 or 8) storage is _linear_ in the alphabet size (and so exponential in the number of bits per symbol s).
   So for large strings and large alphabets, this is likely unusable

* introduced a new WaveletTree class, which allows to perform rank queries in O(log(s)) time and with O(log(s)) storage.
  The idea is that this class could be used as an interchangeable replacement for rank_dictionary.

I have not (yet) tested the combination of the fm_index class with any of the new dictionaries, but I think it should work... modulo minor changes.
I will have a look in the next days.

As to the exact names of codes you should read, well - basically, the code for nvBWT, and then the fm_index, rank_dictionary,
and WaveletTree classe.

-jacopo

Jacopo Pantaleoni

unread,
Jan 29, 2015, 3:22:26 AM1/29/15
to nvbio...@googlegroups.com, abyss...@gmail.com
Hi again,

I have now built an actual test showing protein search using wavelet trees and an fm-index.
You can find it here:

http://nvlabs.github.io/nvbio/waveletfm_page.html
https://github.com/NVlabs/nvbio/blob/master/examples/waveletfm/waveletfm.cu

Hope this helps!
-jacopo
To unsubscribe from this group and stop receiving emails from it, send an email to nvbio-users+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages