AutoComplete Data Structure

ankul

unread,

Jul 29, 2009, 5:03:22 PM7/29/09

to Allahabad FOSS Shuudan

The following articles look informative for the selection of
autocomplete data structure:
http://rmandvikar.blogspot.com/2008/10/trie-examples.html
http://sujitpal.blogspot.com/2007/02/three-autocomplete-implementations.html

The first article is more useful for us because by now I have
implemented AutoComplete using trie and traversed it simply by a DFS
search, but the article states some disadvantages of doing so and
suggests a better way for the completion of strings.

Another interesting article is on ternary tree which can make trie
even space effiecient. Guess it will be a good idea to implement
ternary tree for the autocomplete.
http://en.wikipedia.org/wiki/Ternary_search_trees

What say Shalin Sir and Sanjeev Sir??? Will be soon coming up with
benchmarking of lucene search library :)

Shalin Shekhar Mangar

unread,

Jul 30, 2009, 5:23:11 AM7/30/09

to allahabad-f...@googlegroups.com

Some people (including me) believe that a well-implemented data structure for auto-complete will beat Lucene's prefix search. The benchmarking has two objectives:

Figure out the performance characteristics of Lucene for auto-complete
Compare the performance of Lucene with our custom implementation

Interest in auto-complete has been rekindled in Solr land. You must finish this quickly and start participating in the discussions.

See this thread on solr-user -- http://markmail.org/thread/chdvcwhrqnvh7vaw

Solr issues:

https://issues.apache.org/jira/browse/SOLR-1316
https://issues.apache.org/jira/browse/SOLR-706

Ishan has recently worked on TSTs at AOL so he may be able to share some thoughts as well.

--
Regards,
Shalin Shekhar Mangar.

Sanjeev B.S.

unread,

Jul 30, 2009, 7:12:15 AM7/30/09

to allahabad-f...@googlegroups.com

From the benchmarks trie seems to be the best, compared to 'in memory
relational databases' and Java sets. But I would still like to check
the implementation with a very good hash function, if not for
anything, just out of curiosity. If there is someone else who can work
parallely on this along with Ankul in his group, it will be a very
good idea. I have been waiting for the results from the default hash
maps of Java. Any progress in that direction Ankul?

Good luck for your efforts.

-S

ankul

unread,

Jul 30, 2009, 8:11:06 AM7/30/09

to Allahabad FOSS Shuudan

Not yet Sir. Once I would be over with benchmarking of Lucene auto-
complete, will surely implement auto-complete using hash maps of Java
too. But right now, ternary search tree seems to be the best option
for auto-complete, both space and time efficient. Hope to see lots of
results soon!!!

Ankul Garg

unread,

Aug 1, 2009, 3:41:10 PM8/1/09

to allahabad-f...@googlegroups.com

Lucene and our AutoComplete were benchmarked using real life strings contained in "words.txt" file (around 11000 lines) and prefix queries contained in "prefixQueries.txt" (26 queries of size 1) .

The search results for AutoComplete are stored in "AutoCompleteResults.txt" and for Lucene in "LuceneResults.txt"

Both Benchmark.java (for benchmarking lucene) and AutoComplete.java (our custom implementation) were executed on max heap size : 1024m

AutoComplete performs better in both indexing and auto-completion.

Lucene 2.4.1 has been used for benchmarking.

words.txt

prefixQueries.txt

AutoCompleteResults.txt

LuceneResults.txt

Benchmark.java

AutoComplete.java

Sanjeev B.S.

unread,

Aug 2, 2009, 3:25:21 AM8/2/09

to allahabad-f...@googlegroups.com

Hi Ankul et al,
Congrats for benchmarking. Can you drop in after lunch?
-S

2009/8/2 Ankul Garg <ankul...@gmail.com>:

--
"While I usually came back from meeting Gandhiji elated and inspired
but always a bit sceptical, and from talks with Jawaharlal, fired with
emotional zeal but often confused and unconvinced, meetings with
Vallabhbhai were a joy from which I returned with renewed confidence
in the future of our country."
- JRD Tata

Ankul Garg

unread,

Aug 6, 2009, 10:08:25 AM8/6/09

to allahabad-f...@googlegroups.com

A modified trie was implemented by us using HashMap at each node of Trie, thus storing just the required pointers at each node. The results for both Lucene and Trie (on Linux platform) have been attached along with the java codes and words and prefixQueries files. The results are much better for Trie using HashMap as compared to Lucene. On increasing the Heap Size to 1024m, the indexing time for Trie decreases drastically to just 80-90ms and total search time to around 20-25ms for 26 queries.

Hope to get your response soon!!!

LuceneResults.txt

words.txt

TrieResults.txt

codes.rar

prefixQueries.txt

Sanjeev B.S.

unread,

Aug 6, 2009, 10:15:37 AM8/6/09

to allahabad-f...@googlegroups.com

Calls for party! :)
Hope to see you guys soon. Keep up good work. Going to RGIIT tomm so
would meet you the day after. Congratulations again. :)
Ready for the next step?

2009/8/6 Ankul Garg <ankul...@gmail.com>:

--
If a system doesn't have to be reliable, it can do anything else.
- HH Williams

Ishan Chattopadhyaya

unread,

Aug 6, 2009, 10:47:34 AM8/6/09

to allahabad-f...@googlegroups.com

putting a hashmap (of size=|alphabet|) at every node is overkill, imho.

Ishan Chattopadhyaya

unread,

Aug 6, 2009, 10:48:57 AM8/6/09

to allahabad-f...@googlegroups.com

correction, i meant:
hashmap of size=O(k), where k=|alphabet|

Ankul Garg

unread,

Aug 6, 2009, 11:30:55 AM8/6/09

to allahabad-f...@googlegroups.com

Hi Ishan sir,
confused with what u wrote ;) In our new implementation using hashMap, we are inserting just the required no of pointers at each node. Thus, if we insert "abc" and "abd", then at root node, the hashmap will contain only a, then at node pointed by a, the hashmap will contain only b, and so on. Yup its roughly a TST only, rather faster than it, because here hashmap has been used instead of a BST. Hashmap takes constant time for insertion and searching while BST takes log(n).
Guess I make sense ;)

Sanjeev B.S.

unread,

Aug 6, 2009, 1:20:53 PM8/6/09

to allahabad-f...@googlegroups.com

Ankul, can you compare the results in a tabular form? That will be
easier to understand...

Ankul Garg

unread,

Aug 6, 2009, 1:22:52 PM8/6/09

to allahabad-f...@googlegroups.com

Sure Sir, was thinking of doing the same. And will get the benchmarking done on more testcases and on variable heap sizes and variable platforms. Will come up with the results soon :)

Sanjeev B.S.

unread,

Aug 6, 2009, 1:23:23 PM8/6/09

to allahabad-f...@googlegroups.com

Thanks. So by what factor or % did you see the improvement?

Ankul Garg

unread,

Aug 6, 2009, 1:27:03 PM8/6/09

to allahabad-f...@googlegroups.com

As compared to simple tries, trie using hashmaps at each node is around 60-70% more space efficient and decreases the query time by about 30-40%. The implementation by all means is much much better than lucene's prefix search.

Sanjeev B.S.

unread,

Aug 6, 2009, 1:28:57 PM8/6/09

to allahabad-f...@googlegroups.com

And using simple hashes should see a better performance? ;)

Ankul Garg

unread,

Aug 6, 2009, 1:29:32 PM8/6/09

to allahabad-f...@googlegroups.com

Also, when it is executed on increased heap size, results are close to ideal results. Inserting 11000 strings in just 60-70ms and searching takes place in avg 3-4ms per query.

On Thu, Aug 6, 2009 at 10:57 PM, Ankul Garg <ankul...@gmail.com> wrote:

Sanjeev B.S.

unread,

Aug 6, 2009, 1:30:06 PM8/6/09

to allahabad-f...@googlegroups.com

Wow! That's a lot of improvement!

Ankul Garg

unread,

Aug 6, 2009, 1:38:59 PM8/6/09

to allahabad-f...@googlegroups.com

Yeah!! Will be benchmarking lucene's Ternary Tree implementation too, and the other two ppl in my grp are working upon how to integrate it with Solr :) Hope we will be able to do that!! And simple hashes left to be implemented too ;) will do that sir. Hope to meet you on 8th :)

Ankul Garg

unread,

Aug 10, 2009, 10:05:15 AM8/10/09

to allahabad-f...@googlegroups.com

Hi all,
So here I am with the tabulated benchmarking results of 4 different implementations of autocomplete including lucene's prefix search. A modified form of trie using hash map has also been implemented which stores suffixes at each node to prevent the increase in query time due to DFS traversal of trie(as done earlier). In this implementation, suppose "abc" is to be inserted in trie, then the node pointed by 'a' will store "bc" in its suffix list, similarly, the node pointed by 'b' will store "c".
This reduces the search time to almost negligible, though the large indexing time and large space consumption is a pretty considerable overhead.
Lucene's Ternary Tree implementation is still to be benchmarked. Am sorry for not including it in these results. Will come with it as soon as possible.

Result3

Result4

Sanjeev B.S.

unread,

Aug 10, 2009, 10:08:18 AM8/10/09

to allahabad-f...@googlegroups.com

Can't open. How about .txt extension? :)

Ishan Chattopadhyaya

unread,

Aug 10, 2009, 10:09:26 AM8/10/09

to allahabad-f...@googlegroups.com

Thanks Ankul,
CSV will suit me best.
I too got broken text in gedit.

Ankul Garg

unread,

Aug 10, 2009, 10:10:46 AM8/10/09

to allahabad-f...@googlegroups.com

Result3.txt

Result4.txt

Ishan Chattopadhyaya

unread,

Aug 10, 2009, 10:13:14 AM8/10/09

to allahabad-f...@googlegroups.com

kroy, please change the list name to
solr-autocomplete-datastructure-discussion-shuudan, instead of FOSS
shuudan. The sets of discussions are not FOSS related by any means.
Saurabh, what happened to the IRC bot, mate? Lets kick it up and restore
some non "cool project to show some superiority over other
non-intelligent humans" kind discussions somewhere, which could be more
informative and relaxing to go thru and participate in.
-- Thoughts of a random idiot, whose mom named him Ishan

Ankul Garg

unread,

Aug 10, 2009, 10:26:57 AM8/10/09

to allahabad-f...@googlegroups.com

Am attaching the file containing strings also.

words.txt

ankul

unread,

Aug 10, 2009, 10:30:44 AM8/10/09

to Allahabad FOSS Shuudan

Make a slight change, the no of words in file is 39419. Hope to see
your responses soon!!!

On Aug 10, 7:13 pm, Ishan Chattopadhyaya <ichattopadhy...@gmail.com>
wrote:

> kroy, please change the list name to
> solr-autocomplete-datastructure-discussion-shuudan, instead of FOSS
> shuudan. The sets of discussions are not FOSS related by any means.
> Saurabh, what happened to the IRC bot, mate? Lets kick it up and restore
> some non "cool project to show some superiority over other
> non-intelligent humans" kind discussions somewhere, which could be more
> informative and relaxing to go thru and participate in.
> -- Thoughts of a random idiot, whose mom named him Ishan
>
> On Mon, 2009-08-10 at 19:39 +0530, Ishan Chattopadhyaya wrote:
> > Thanks Ankul,
> > CSV will suit me best.
> > I too got broken text in gedit.
>
> > On Mon, 2009-08-10 at 19:38 +0530, Sanjeev B.S. wrote:
> > > Can't open. How about .txt extension? :)
>

> > > On Mon, Aug 10, 2009 at 7:35 PM, Ankul Garg<ankul.n...@gmail.com> wrote:
> > > > Hi all,
> > > > So here I am with the tabulated benchmarking results of 4 different
> > > > implementations of autocomplete including lucene's prefix search. A modified
> > > > form of trie using hash map has also been implemented which stores suffixes
> > > > at each node to prevent the increase in query time due to DFS traversal of
> > > > trie(as done earlier). In this implementation, suppose "abc" is to be
> > > > inserted in trie, then the node pointed by 'a' will store "bc" in its suffix
> > > > list, similarly, the node pointed by 'b' will store "c".
> > > > This reduces the search time to almost negligible, though the large indexing
> > > > time and large space consumption is a pretty considerable overhead.
> > > > Lucene's Ternary Tree implementation is still to be benchmarked. Am sorry
> > > > for not including it in these results. Will come with it as soon as
> > > > possible.
>

k roy

unread,

Aug 10, 2009, 2:35:33 PM8/10/09

to allahabad-f...@googlegroups.com

Hey Ishan,

kroy, please change the list name to
solr-autocomplete-datastructure-discussion-shuudan, instead of FOSS
shuudan. The sets of discussions are not FOSS related by any means.

So why not kick start discussion _related_ to FOSS? I know things have become pretty monotonous, solr-centric and irritating but then I partly blame us for not actually bringing up anything FOSS related. So lets start shall we? It'll be great if you can discuss your GSoC project, for starters.

-- Thoughts of a random idiot, whose mom named him Ishan

Lately I am seeing all your mails signed as "idiot". Any special affinity with the word, eh?

--
~ Once you have flown, you will walk the earth with your eyes turned skyward, for there you have been, there you long to return --Da Vinci

ankul

unread,

Aug 10, 2009, 3:07:16 PM8/10/09

to Allahabad FOSS Shuudan

Discussions shall be done under the right heading, atleast not spoil
the thread!!!

Abhishek Mishra

unread,

Aug 11, 2009, 5:41:21 AM8/11/09

to allahabad-f...@googlegroups.com

The US patent office has granted a Microsoft patent application,
7,571,169: "Word-processing document stored in a single XML file that
may be manipulated by applications that understand XML"

"On the face of it, the patent would appear to cover all usage of XML
and XSDs in word processing document, which would effectively leave
all other modern word processors - and other software that used their
documents - liable to licensing by the company".

For more info:
http://news.zdnet.com/2100-9595_22-329645.html?tag=nl.e539

...
So what about a common format and interoperability?
Seems like M$ is really against interoperability, would OpenOffice have
to pay licensing fee to M$ to be able to access docx files ?

k roy

unread,

Aug 12, 2009, 9:51:18 AM8/12/09

to allahabad-f...@googlegroups.com

Microsoft cannot sell one of its flagship products, Word, in the United States because of patent infringement. More can be read here http://blog.seattlepi.com/microsoft/archives/176223.asp

Reply all

Reply to author

Forward