AutoComplete Data Structure

81 views
Skip to first unread message

ankul

unread,
Jul 29, 2009, 5:03:22 PM7/29/09
to Allahabad FOSS Shuudan
The following articles look informative for the selection of
autocomplete data structure:
http://rmandvikar.blogspot.com/2008/10/trie-examples.html
http://sujitpal.blogspot.com/2007/02/three-autocomplete-implementations.html

The first article is more useful for us because by now I have
implemented AutoComplete using trie and traversed it simply by a DFS
search, but the article states some disadvantages of doing so and
suggests a better way for the completion of strings.

Another interesting article is on ternary tree which can make trie
even space effiecient. Guess it will be a good idea to implement
ternary tree for the autocomplete.
http://en.wikipedia.org/wiki/Ternary_search_trees

What say Shalin Sir and Sanjeev Sir??? Will be soon coming up with
benchmarking of lucene search library :)

Shalin Shekhar Mangar

unread,
Jul 30, 2009, 5:23:11 AM7/30/09
to allahabad-f...@googlegroups.com
Some people (including me) believe that a well-implemented data structure for auto-complete will beat Lucene's prefix search. The benchmarking has two objectives:
  1. Figure out the performance characteristics of Lucene for auto-complete
  2. Compare the performance of Lucene with our custom implementation
Interest in auto-complete has been rekindled in Solr land. You must finish this quickly and start participating in the discussions.

See this thread on solr-user -- http://markmail.org/thread/chdvcwhrqnvh7vaw

Solr issues:

https://issues.apache.org/jira/browse/SOLR-1316
https://issues.apache.org/jira/browse/SOLR-706

Ishan has recently worked on TSTs at AOL so he may be able to share some thoughts as well.

--
Regards,
Shalin Shekhar Mangar.

Sanjeev B.S.

unread,
Jul 30, 2009, 7:12:15 AM7/30/09
to allahabad-f...@googlegroups.com
From the benchmarks trie seems to be the best, compared to 'in memory
relational databases' and Java sets. But I would still like to check
the implementation with a very good hash function, if not for
anything, just out of curiosity. If there is someone else who can work
parallely on this along with Ankul in his group, it will be a very
good idea. I have been waiting for the results from the default hash
maps of Java. Any progress in that direction Ankul?

Good luck for your efforts.

-S

ankul

unread,
Jul 30, 2009, 8:11:06 AM7/30/09
to Allahabad FOSS Shuudan
Not yet Sir. Once I would be over with benchmarking of Lucene auto-
complete, will surely implement auto-complete using hash maps of Java
too. But right now, ternary search tree seems to be the best option
for auto-complete, both space and time efficient. Hope to see lots of
results soon!!!

Ankul Garg

unread,
Aug 1, 2009, 3:41:10 PM8/1/09
to allahabad-f...@googlegroups.com
Lucene and our AutoComplete were benchmarked using real life strings contained in "words.txt" file (around 11000 lines) and prefix queries contained in "prefixQueries.txt" (26 queries of size 1) .
The search results for AutoComplete are stored in "AutoCompleteResults.txt" and for Lucene in "LuceneResults.txt"
Both Benchmark.java (for benchmarking lucene) and AutoComplete.java (our custom implementation) were executed on max heap size : 1024m
AutoComplete performs better in both indexing and auto-completion.
Lucene 2.4.1 has been used for benchmarking.


words.txt
prefixQueries.txt
AutoCompleteResults.txt
LuceneResults.txt
Benchmark.java
AutoComplete.java

Sanjeev B.S.

unread,
Aug 2, 2009, 3:25:21 AM8/2/09
to allahabad-f...@googlegroups.com
Hi Ankul et al,
Congrats for benchmarking. Can you drop in after lunch?
-S

2009/8/2 Ankul Garg <ankul...@gmail.com>:
--
"While I usually came back from meeting Gandhiji elated and inspired
but always a bit sceptical, and from talks with Jawaharlal, fired with
emotional zeal but often confused and unconvinced, meetings with
Vallabhbhai were a joy from which I returned with renewed confidence
in the future of our country."
- JRD Tata

Ankul Garg

unread,
Aug 6, 2009, 10:08:25 AM8/6/09
to allahabad-f...@googlegroups.com
A modified trie was implemented by us using HashMap at each node of Trie, thus storing just the required pointers at each node. The results for both Lucene and Trie (on Linux platform) have been attached along with the java codes and words and prefixQueries files. The results are much better for Trie using HashMap as compared to Lucene. On increasing the Heap Size to 1024m, the indexing time for Trie decreases drastically to just 80-90ms and total search time to around 20-25ms for 26 queries.
Hope to get your response soon!!!

LuceneResults.txt
words.txt
TrieResults.txt
codes.rar
prefixQueries.txt

Sanjeev B.S.

unread,
Aug 6, 2009, 10:15:37 AM8/6/09
to allahabad-f...@googlegroups.com
Calls for party! :)
Hope to see you guys soon. Keep up good work. Going to RGIIT tomm so
would meet you the day after. Congratulations again. :)
Ready for the next step?

2009/8/6 Ankul Garg <ankul...@gmail.com>:
--
If a system doesn't have to be reliable, it can do anything else.
- HH Williams

Ishan Chattopadhyaya

unread,
Aug 6, 2009, 10:47:34 AM8/6/09
to allahabad-f...@googlegroups.com
putting a hashmap (of size=|alphabet|) at every node is overkill, imho.

Ishan Chattopadhyaya

unread,
Aug 6, 2009, 10:48:57 AM8/6/09
to allahabad-f...@googlegroups.com
correction, i meant:
hashmap of size=O(k), where k=|alphabet|

Ankul Garg

unread,
Aug 6, 2009, 11:30:55 AM8/6/09
to allahabad-f...@googlegroups.com
Hi Ishan sir,
confused with what u wrote ;) In our new implementation using hashMap, we are inserting just the required no of pointers at each node. Thus, if we insert "abc" and "abd", then at root node, the hashmap will contain only a, then at node pointed by a, the hashmap will contain only b, and so on. Yup its roughly a TST only, rather faster than it, because here hashmap has been used instead of a BST. Hashmap takes constant time for insertion and searching while BST takes log(n).
Guess I make sense ;)

Sanjeev B.S.

unread,
Aug 6, 2009, 1:20:53 PM8/6/09
to allahabad-f...@googlegroups.com
Ankul, can you compare the results in a tabular form? That will be
easier to understand...

Ankul Garg

unread,
Aug 6, 2009, 1:22:52 PM8/6/09
to allahabad-f...@googlegroups.com
Sure Sir, was thinking of doing the same. And will get the benchmarking done on more testcases and on variable heap sizes and variable platforms. Will come up with the results soon :)

Sanjeev B.S.

unread,
Aug 6, 2009, 1:23:23 PM8/6/09
to allahabad-f...@googlegroups.com
Thanks. So by what factor or % did you see the improvement?

Ankul Garg

unread,
Aug 6, 2009, 1:27:03 PM8/6/09
to allahabad-f...@googlegroups.com
As compared to simple tries, trie using hashmaps at each node is around 60-70% more space efficient and decreases the query time by about 30-40%. The implementation by all means is much much better than lucene's prefix search.

Sanjeev B.S.

unread,
Aug 6, 2009, 1:28:57 PM8/6/09
to allahabad-f...@googlegroups.com
And using simple hashes should see a better performance? ;)

Ankul Garg

unread,
Aug 6, 2009, 1:29:32 PM8/6/09
to allahabad-f...@googlegroups.com
Also, when it is executed on increased heap size, results are close to ideal results. Inserting 11000 strings in just 60-70ms and searching takes place in avg 3-4ms per query.

On Thu, Aug 6, 2009 at 10:57 PM, Ankul Garg <ankul...@gmail.com> wrote:

Sanjeev B.S.

unread,
Aug 6, 2009, 1:30:06 PM8/6/09
to allahabad-f...@googlegroups.com
Wow! That's a lot of improvement!

Ankul Garg

unread,
Aug 6, 2009, 1:38:59 PM8/6/09
to allahabad-f...@googlegroups.com
Yeah!! Will be benchmarking lucene's Ternary Tree implementation too, and the other two ppl in my grp are working upon how to integrate it with Solr :) Hope we will be able to do that!! And simple hashes left to be implemented too ;) will do that sir. Hope to meet you on 8th :)

Ankul Garg

unread,
Aug 10, 2009, 10:05:15 AM8/10/09
to allahabad-f...@googlegroups.com
Hi all,
So here I am with the tabulated benchmarking results of 4 different implementations of autocomplete including lucene's prefix search. A modified form of trie using hash map has also been implemented which stores suffixes at each node to prevent the increase in query time due to DFS traversal of trie(as done earlier). In this implementation, suppose "abc" is to be inserted in trie, then the node pointed by 'a' will store "bc" in its suffix list, similarly, the node pointed by 'b' will store "c".
This reduces the search time to almost negligible, though the large indexing time and large space consumption is a pretty considerable overhead.
Lucene's Ternary Tree implementation is still to be benchmarked. Am sorry for not including it in these results. Will come with it as soon as possible.
Result3
Result4

Sanjeev B.S.

unread,
Aug 10, 2009, 10:08:18 AM8/10/09
to allahabad-f...@googlegroups.com
Can't open. How about .txt extension? :)

Ishan Chattopadhyaya

unread,
Aug 10, 2009, 10:09:26 AM8/10/09
to allahabad-f...@googlegroups.com
Thanks Ankul,
CSV will suit me best.
I too got broken text in gedit.

Ankul Garg

unread,
Aug 10, 2009, 10:10:46 AM8/10/09
to allahabad-f...@googlegroups.com
Result3.txt
Result4.txt

Ishan Chattopadhyaya

unread,
Aug 10, 2009, 10:13:14 AM8/10/09
to allahabad-f...@googlegroups.com
kroy, please change the list name to
solr-autocomplete-datastructure-discussion-shuudan, instead of FOSS
shuudan. The sets of discussions are not FOSS related by any means.
Saurabh, what happened to the IRC bot, mate? Lets kick it up and restore
some non "cool project to show some superiority over other
non-intelligent humans" kind discussions somewhere, which could be more
informative and relaxing to go thru and participate in.
-- Thoughts of a random idiot, whose mom named him Ishan

Ankul Garg

unread,
Aug 10, 2009, 10:26:57 AM8/10/09
to allahabad-f...@googlegroups.com
Am attaching the file containing strings also.
words.txt

ankul

unread,
Aug 10, 2009, 10:30:44 AM8/10/09
to Allahabad FOSS Shuudan
Make a slight change, the no of words in file is 39419. Hope to see
your responses soon!!!

On Aug 10, 7:13 pm, Ishan Chattopadhyaya <ichattopadhy...@gmail.com>
wrote:
> kroy, please change the list name to
> solr-autocomplete-datastructure-discussion-shuudan, instead of FOSS
> shuudan. The sets of discussions are not FOSS related by any means.
> Saurabh, what happened to the IRC bot, mate? Lets kick it up and restore
> some non "cool project to show some superiority over other
> non-intelligent humans" kind discussions somewhere, which could be more
> informative and relaxing to go thru and participate in.
> -- Thoughts of a random idiot, whose mom named him Ishan
>
> On Mon, 2009-08-10 at 19:39 +0530, Ishan Chattopadhyaya wrote:
> > Thanks Ankul,
> > CSV will suit me best.
> > I too got broken text in gedit.
>
> > On Mon, 2009-08-10 at 19:38 +0530, Sanjeev B.S. wrote:
> > > Can't open. How about .txt extension? :)
>
> > > On Mon, Aug 10, 2009 at 7:35 PM, Ankul Garg<ankul.n...@gmail.com> wrote:
> > > > Hi all,
> > > > So here I am with the tabulated benchmarking results of 4 different
> > > > implementations of autocomplete including lucene's prefix search. A modified
> > > > form of trie using hash map has also been implemented which stores suffixes
> > > > at each node to prevent the increase in query time due to DFS traversal of
> > > > trie(as done earlier). In this implementation, suppose "abc" is to be
> > > > inserted in trie, then the node pointed by 'a' will store "bc" in its suffix
> > > > list, similarly, the node pointed by 'b' will store "c".
> > > > This reduces the search time to almost negligible, though the large indexing
> > > > time and large space consumption is a pretty considerable overhead.
> > > > Lucene's Ternary Tree implementation is still to be benchmarked. Am sorry
> > > > for not including it in these results. Will come with it as soon as
> > > > possible.
>

k roy

unread,
Aug 10, 2009, 2:35:33 PM8/10/09
to allahabad-f...@googlegroups.com
Hey Ishan,

kroy, please change the list name to
solr-autocomplete-datastructure-discussion-shuudan, instead of FOSS
shuudan. The sets of discussions are not FOSS related by any means.

So why not kick start discussion _related_ to FOSS? I know things have become pretty monotonous, solr-centric and irritating but then I partly blame us for not actually bringing up anything FOSS related. So lets start shall we? It'll be great if you can discuss your GSoC project, for starters.

  -- Thoughts of a random idiot, whose mom named him Ishan
 
Lately I am seeing all your mails signed as "idiot". Any special affinity with the word, eh?


--
~ Once you have flown, you will walk the earth with your eyes turned skyward, for there you have been, there you long to return --Da Vinci

ankul

unread,
Aug 10, 2009, 3:07:16 PM8/10/09
to Allahabad FOSS Shuudan
Discussions shall be done under the right heading, atleast not spoil
the thread!!!

Abhishek Mishra

unread,
Aug 11, 2009, 5:41:21 AM8/11/09
to allahabad-f...@googlegroups.com
The US patent office has granted a Microsoft patent application,
7,571,169: "Word-processing document stored in a single XML file that
may be manipulated by applications that understand XML"

"On the face of it, the patent would appear to cover all usage of XML
and XSDs in word processing document, which would effectively leave
all other modern word processors - and other software that used their
documents - liable to licensing by the company".

For more info:
http://news.zdnet.com/2100-9595_22-329645.html?tag=nl.e539

...
So what about a common format and interoperability?
Seems like M$ is really against interoperability, would OpenOffice have
to pay licensing fee to M$ to be able to access docx files ?


k roy

unread,
Aug 12, 2009, 9:51:18 AM8/12/09
to allahabad-f...@googlegroups.com
Microsoft cannot sell one of its flagship products, Word, in the United States because of patent infringement. More can be read here http://blog.seattlepi.com/microsoft/archives/176223.asp
Reply all
Reply to author
Forward
0 new messages