Need more information.

16 views
Skip to first unread message

bllackeye

unread,
Jul 19, 2009, 3:21:20 PM7/19/09
to OBSearch-Users
Hi,

I am building up a 3d shape search engine(Open source) for searching
similarity shapes. I have already done the feature extraction part,
now I need to have a good similarity search engine.

I have already tried xxl, but it seems to me that xxl package is not a
full implementation of similarity search. I tried multiple space
filling curve scheme too, it is faster, but it could only give
approximate result.

Eventually, I came to your project, it seems that a lot of work has
been done here. But I still have some doubt about your projects.
Firstly, it seems that somehow you managed to map a multi-dimensional
vector into a long number( I suppose that SMAP is for the mapping),
can you please tell me how you make that possible. Secondly, I am
wondering whether you could publish some performance test data, which
will be very helpful to users like me.

Thanks you very much.

LIU

Arnoldo Muller

unread,
Jul 20, 2009, 1:45:20 AM7/20/09
to obsearc...@googlegroups.com
Dear Liu:


Thank you for your interest in OBSearch.

OBSearch contains several indexes like the P+Tree (exact) and the
GH-sketch (inexact).
Regarding SMAP or pivot embeddings take a look at:

Pivot selection techniques for proximity searching in metric spaces

By Bustos and Navarro. It will give you a good introduction on the
topic. Regarding the long mapping, the technique is very new and it
will be presented at the upcoming SISAP 2009 conference (Czech
Republic).

I will find out if I can publish the paper now with them and I will
get back to you. In the paper you will find the details of the
technique and some performance numbers. Thank you for your suggestion,
I think we should have some perf. numbers in the webpage.

Thanks again and I will get back to you soon!

Arnoldo

Arnoldo Muller

unread,
Jul 20, 2009, 11:22:00 AM7/20/09
to obsearc...@googlegroups.com
Dear Liu:

So here it is the paper that explains about the X -> long mapping.
Please let me know if you have further questions. The current
implementation will benefit if you use a x64 machine.

How big is the dataset that you are planning to use? How many objects
are you planning to insert?

Thanks,

Arnoldo
paper.pdf

chaoyang liu

unread,
Jul 20, 2009, 11:48:12 AM7/20/09
to obsearc...@googlegroups.com
Hi, Arnoldo

Thank you so much for your information.

My application is targeting middle size organization for the time being. So the size of dataset ranges from a few thousands to 100 thousands for now. But I have a plan to make it support more than 1 million of objects. As you know, for  a search solution,  it only becomes valuable when there is a big dataset available.

I tried xxl package with a test dataset with 2,000 objects,  the search performance was not to my satisfaction. It takes 40 milliseconds to retrieve 10 most similar objects.  This is why I spent time looking around and eventually find your project. 

I think matrix search will play a very important role in the near future, especially in the multi-media search applications which just get started in recent years. And there is no commerical products available from those big database players yet.  So I think you are doing a great thing,  I am wondering how far you want to persure with Obsearch, and if you need someone to help you out with the development, please let me know.

Thank you.

LIU

Arnoldo Muller

unread,
Jul 28, 2009, 1:17:21 AM7/28/09
to obsearc...@googlegroups.com
Dear Liu:

Sorry for the delay. I was very busy last week...

The size of your app is perfect for the Tokyo Cabinet backed.

About OBSearch, well this project is my dearest project, I want to
take it very far from here.
You are most welcome to contribute to the project. If you have patches
or if you find bugs please contact me.

I hope OBSearch fullfills your expectations. I highly recommend to use
the Sketch based indexing technique, it is on SVN, you can browse the
*.examples* package to find out how to use the sketch index.

Let me know how it goes :)

You can run the demos with the command:

ant -buildfile approxGeneral.xml example

(after doing an mvn compile)

Cheers,

Arnoldo Muller

bllackeye

unread,
Aug 11, 2009, 9:59:44 PM8/11/09
to OBSearch-Users
Hi, Arnoldo

Thanks for your paper. I am impressed by the results, and I think that
is something I am looking for. So I got the source code from svn. But
I could not
compile it, because there are two java file missing, both of them are
under net.obsearch.index.sorter. Can you kindly pass me those files.

Meanwhile, I noticed that you are using Berkley DB java edition and
native, and Tokyocabinet for your storage engine. Can I only use
Berkley DB java edition?

Thank you very much

LIU

On Jul 20, 11:22 pm, Arnoldo Muller <arnoldomul...@gmail.com> wrote:
> Dear Liu:
>
> So here it is the paper that explains about the X -> long mapping.
> Please let me know if you have further questions. The current
> implementation will benefit if you use a x64 machine.
>
> How big is the dataset that you are planning to use? How many objects
> are you planning to insert?
>
> Thanks,
>
> Arnoldo
>
> On Mon, Jul 20, 2009 at 2:45 PM, Arnoldo Muller<arnoldomul...@gmail.com> wrote:
> > Dear Liu:
>
> > Thank you for your interest in OBSearch.
>
> > OBSearch contains several indexes like the P+Tree (exact) and the
> > GH-sketch (inexact).
> > Regarding SMAP or pivot embeddings take a look at:
>
> > Pivot selection techniques for proximity searching in metric spaces
>
> > By Bustos and Navarro. It will give you a good introduction on the
> > topic. Regarding the long mapping, the technique is very new and it
> > will be presented at the upcoming SISAP 2009 conference (Czech
> > Republic).
>
> > I will find out if I can publish the paper now with them and I will
> > get back to you. In the paper you will find the details of the
> > technique and some performance numbers. Thank you for your suggestion,
> > I think we should have some perf. numbers in the webpage.
>
> > Thanks again and I will get back to you soon!
>
> > Arnoldo
>
> > On Mon, Jul 20, 2009 at 4:21 AM, bllackeye<bllack...@gmail.com> wrote:
>
> >> Hi,
>
> >> I am building up a 3d shape search engine(Open source) for searching
> >> similarity shapes. I have already done the feature extraction part,
> >> now I need to have a good similarity search engine.
>
> >> I have already tried xxl, but it seems to me that xxl package is not a
> >> full implementation of similarity search. I tried multiple space
> >> filling curve scheme too, it is faster, but it could only give
> >> approximate result.
>
> >> Eventually, I came to your project, it seems that a lot of work has
> >> been done here.  But I still have some doubt about your projects.
> >> Firstly, it seems that somehow you managed to map a multi-dimensional
> >> vector into a long number( I suppose that SMAP is for the mapping),
> >> can you please tell me how you make that possible. Secondly, I am
> >> wondering whether you could publish some performance test data, which
> >> will be very helpful to users like me.
>
> >> Thanks you very much.
>
> >> LIU
>
>
>
>  paper.pdf
> 333KViewDownload

Arnoldo Muller

unread,
Aug 11, 2009, 11:33:20 PM8/11/09
to obsearc...@googlegroups.com
Hello Liu:

I am glad that you liked the paper, there are still many things to work on.
I have been recently reading about cover trees:
http://hunch.net/~jl/projects/cover_tree/cover_tree.html

And there are exciting techniques like RanWalk
http://www.sisap.org/2009/index.php?src=Invited
that I also want to study and implement.

I have made a commit that includes the files you need.

Regarding the storage devices, there are three options right now:
* TC
* Berkeley DB Java
* Cuckoo hash table (developed in-house, very alpha code)

You can choose any of the storage devices.

Specifically note the example VectorsDemoGHSSISAP. I have been working
on it recently and it displays how the new GHS
sketch implementation should be used. Some datasets may be better
served with the DistPerm technique (Chavez, Navarro) so you may use
the WikipediaDemo as a guide.

Let me know how it goes!

Arnoldo

chaoyang liu

unread,
Aug 12, 2009, 4:37:48 AM8/12/09
to obsearc...@googlegroups.com
Hi, Arnoldo

Thanks for your quick replay. I am very interested in this topic
too, so thank for sharing this info with me. And meanwhile, I need
your help again. I found out that there are two packages
net.obsearch.index.perm.impl, net.obsearch.pivots.rf03 being missing.
And definitely I will let you know the results of applying your
package in 3d search domain while they are ready.

Thanks

LIU

Arnoldo Muller

unread,
Aug 12, 2009, 1:44:29 PM8/12/09
to obsearc...@googlegroups.com
Hello Liu:

This topic is indeed interesting!
Sorry for the missing files. I just added the packages you reported.
Let me know if all the files are there.

Arnoldo

chaoyang liu

unread,
Aug 12, 2009, 10:15:23 PM8/12/09
to obsearc...@googlegroups.com
Hi, Arnoldo

Thanks so much for your help. Now I can run VectorsDemoGHS. I will go
through your code to understand the basic thing. Then I'll run my 3d
search test case and let you know the result.

Thank you so much for your kindness

LIU

chaoyang liu

unread,
Aug 16, 2009, 9:56:36 PM8/16/09
to obsearc...@googlegroups.com
Hi, Arnoldo

I made the VectorsDemoGHS work and went though part of your code.

For the search process,
1) the system convert the query object into a SketchProjection, here
the sketchProjection is the Sketch mentioned in your paper. To some
extend, the sketch here is sort of hash value but the distance between
2 sketches reflects distance between the 2 corresponding objects.
2) the system finds the buckets which is indexed with
SketchProjection, the system will get many buckets based upon the
statistics measuring how well the sketch can reflect the original
object in terms of similarity.
3) upon finding candidate buckets, the system now check every objects
contained in each bucket, and pick up those most similar one.

The process of search sketch is linear search, in other word, brutal
force search. Since searching sketch is very cheap, so it actually
outperforms many other method.

I have a couple of questions too.

1) upon index freezing, the system will computer the sketch and store
it some where. So after freezing the index, is it possible to insert
more objects?
2) Suppose I want to search k most similar objects, saying k=10, where
can I set the value to k?
3) is it possible to put some indexing structure to the sketch to
speed up the search process?

Thanks.


LIU

Arnoldo Muller

unread,
Aug 17, 2009, 5:36:53 PM8/17/09
to obsearc...@googlegroups.com
Hello Liu:

Your insights are correct. The sketch distance will try to resemble
the real distance.
The statistics try to decide how many candidate sketches (closest to
the query) must be
read in order to find a result with low error.

1) yes it is possible to insert objects after freeze.
2) You must replace the line:
index.setMaxK(new int[] { 1 });
for the line:
index.setMaxK(new int[] { 1, 10 });

This adds the k=10 to the estimation during freeze. If you wanted k=3
you can add 3 to the array and that is it!
3) Yes, there are hamming filters developed during the early 90's and
the S-tree during the 80's. Locality sensitive hashing for hamming
spaces could also be used. I am not so convinced they perform so
well, and I think it is better to compress the sketches.
I will be attending an important similarity search conference in Aug
31 (sisap 2009) and I will try to find more information about this.
I also want an index, but the curse of dimensionality kicks in even
for hamming spaces.


Cheers!

Arnoldo
Reply all
Reply to author
Forward
0 new messages