Optimization for limited alphabets (such as DNA's A, C, G, T)?

Frédéric Mahé

unread,

Aug 10, 2014, 4:50:44 AM8/10/14

to cityhash...@googlegroups.com

Hi,

I use CityHash64 to map and to lookup for short DNA sequences (1 to 500 letters). I have a very naive question. Do you think there are possible speed optimizations for CityHash64 when the alphabet is limited to only 4 different letters?

Best,

Geoff Pike

unread,

Aug 12, 2014, 2:30:08 PM8/12/14

to cityhash...@googlegroups.com

Yes, it should be possible. For example, you could apply a fast
lossless compression first. A general purpose lossless compression
algorithm might not be fast enough, but since you know that you want
2-bits of value per 8-bit input, you might be able to compress 4 words
to 1 with, say,
w0 = <first word>; w1 = <second word>; etc.
c0 = f(w0, f(w1, f(w2, f(w3, 0))));

where f(a, b) is a + 5*b. Warning: this is untested!

Geoff

Frédéric Mahé

unread,

Aug 14, 2014, 8:56:22 AM8/14/14

to cityhash...@googlegroups.com, geof...@alum.berkeley.edu

On Tuesday, August 12, 2014 8:30:08 PM UTC+2, Geoff Pike wrote:

Yes, it should be possible. For example, you could apply a fast
lossless compression first.

Thank you Geoff,

As you mentioned, our best option is probably to first compress our messages by encoding them using 2-bits per symbol. It will divide message length by 4 and speed up hashing.

Reply all

Reply to author

Forward