Postgres full text search, Apache Lucene?

136 views
Skip to first unread message

Chris McCann

unread,
Apr 11, 2014, 3:59:00 PM4/11/14
to sdr...@googlegroups.com
SD Ruby,

I'm looking for a solution to a search problem and want to survey the community to see if anyone else has dealt with this type of search.

The application I'm building supports an image processing system.  We have a mathematical way of uniquely representing any particular image as a vector of 16 values, each ranging between 0 and 255.

I need to implement a search mechanism that finds the closest matches to a given image, also represented as a 16 element vector.  This is usually called a "vector space model" search, and it's implemented for full text search in Postgres as well as Lucene, and probably many other full text search systems.

The problem I'm wrestling with is I'm not searching on text, I'm searching on integers.  I basically need to search for the closest match like this:

Say my search image has a vector with elements q(1) to q(16),  [q(1) = 122, q(2) = 7, q(3) = 89,, ..., q(16) = 224].  

To compare that vector against the image vectors in the database I need to calculate the "distance" between the query vector (q) and each of the database vectors (d):

distance = square_root( (q(1) - d(1))^2 + (q(2) - d(2))^2 + ... (q(16) - d(16))^2) 

The lower the distance the closer the match, with dist == 0 being an exact match.

My research hasn't led me to a direct implementation of this in Postgres or Lucene since they are designed for text searching, though the underlying principles are the exact same.  Anyone ever tackle this type of search with numerical values?

Thanks in advance,

Chris

Guyren Howe

unread,
Apr 11, 2014, 4:18:59 PM4/11/14
to sdr...@googlegroups.com
This isn’t my area of expertise, but I believe Postgres is the go-to database of choice for spatial work because of its advanced indexing options. I’m 75% sure you can do something that will give you an index across your 16 values that will let you do a fast nearest-neighbor or nearest-k.

Stack Overflow appears to agree:


Regards,

Guyren G Howe
Relevant Logic LLC

guyren-at-relevantlogic.comhttp://relevantlogic.com ~ +1 512 784 3178

Ruby/Rails,  Xojo, PHP programming
PostgreSQL, MySQL database design and consulting
Technical writing and training

Read my book, Real OOP with REALbasic: <http://relevantlogic.com/oop-book/about-the-oop-book.php>

Dan Simpson

unread,
Apr 11, 2014, 4:37:39 PM4/11/14
to sdr...@googlegroups.com
If you have a table with 16 columns, q1, q2 ..., q16, could you do the following?

SELECT id, (abs(q1 - input1) + abs(q2 - input2) ... + abs(q16 - input16)) AS v FROM table ORDER BY v DESC LIMIT 0,10;

This way you can limit the size with a WHERE v > threshold, etc.

--Dan


--
--
SD Ruby mailing list
sdr...@googlegroups.com
http://groups.google.com/group/sdruby
---
You received this message because you are subscribed to the Google Groups "SD Ruby" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sdruby+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dan Simpson

unread,
Apr 11, 2014, 4:41:38 PM4/11/14
to sdr...@googlegroups.com
I guess it would be order by v ASC, since 0 is a perfect match.  order by v DESC gives you the least similar matches.

Rob Kaufman

unread,
Apr 11, 2014, 7:56:40 PM4/11/14
to Chris McCann, sdr...@googlegroups.com
Solr and Sunspot (for Rails) has pretty powerful numeric search.  Worth taking a look at if what you can get out of Postgres doesn't get you there or if you're queries get too complex.  Solr wraps Lucene, making it much easier to use.

Best,
Rob

Alex Boster

unread,
Apr 11, 2014, 9:04:31 PM4/11/14
to sdr...@googlegroups.com
Solr/sunspot is good. Not sure if it’s a great fit here. If it’s going to work here, you maybe looking at the order_by_function, with the indexing merely storing the vector values of the image in question.

In other words, unless there are other search parameters in play here (image name, for example), you aren’t going to get a ton out of it, relative to more traditional text search, since it really shines where you can precompute the search indexes.

AB

Guyren Howe

unread,
Apr 11, 2014, 9:05:18 PM4/11/14
to sdr...@googlegroups.com
On Apr 11, 2014, at 4:56 PM, Rob Kaufman <rgka...@gmail.com> wrote:

Solr and Sunspot (for Rails) has pretty powerful numeric search.  Worth taking a look at if what you can get out of Postgres doesn't get you there or if you're queries get too complex.  Solr wraps Lucene, making it much easier to use.

Doesn’t sound like a text search problem though, or did I misunderstand? If you’re wanting to do clever things with 16-dimensional numeric vectors, text is surely the last way to deal with it.

Chris McCann

unread,
Apr 12, 2014, 8:27:11 PM4/12/14
to sdr...@googlegroups.com
Guyren,

No, I don't think you've misunderstood at all.  It's not a text search problem.  I'm trying to find the "closest" match to 16 integer values, close being defined as the overall distance between the vector components or Euclidean L^2 Norm.

Dan's suggestion for the query was what I was initially trying to do, and I hacked that together pretty quickly.  It seems pretty fast to compute, say, the 5 closest matches, but it feels wrong to not be using a precomputed index somehow.  

I'm also concerned how that would scale if we had millions of images against which to match combined with thousands of simultaneous queries.  Yeah, yeah, that will be a good problem to have but I'm trying to not start out hamstrung.


But Postgres does not appear to support the <-> distance operator between text values or integer arrays.  The code example in that SO post does not work, though the author doesn't claim that it does.

I also tried a 16-dimension cube, but the current Cube support also doesn't include <->.  There is an experimental patch floating around that adds that capability but that would require me to custom build Postgres and I'm not sure I want to go down that path just yet.

I scoured the Solr and Sunspot documentation looking for an example of searching on "closeness" to numeric values but didn't see anything.  I'm wondering if I simply declare the 16 vector components as "text" fields in the Rails model's searchable block if I'll get the results I'm after.  An experiment to try, no doubt.

Thanks again for all the great input, and especially that SO post!

Cheers,

Chris

Guyren Howe

unread,
Apr 12, 2014, 10:54:07 PM4/12/14
to sdr...@googlegroups.com
On Apr 12, 2014, at 5:27 PM, Chris McCann <testf...@gmail.com> wrote:

But Postgres does not appear to support the <-> distance operator between text values or integer arrays.  The code example in that SO post does not work, though the author doesn’t claim that it does.

Read the section on Postgres indexing. I believe you can use a custom distance operator you define yourself.

Postgres has two custom index options; one is faster on insert and one is faster on lookup.

Guyren Howe

unread,
Apr 13, 2014, 2:50:00 AM4/13/14
to sdr...@googlegroups.com
On Apr 11, 2014, at 12:59 PM, Chris McCann <testf...@gmail.com> wrote:

I'm looking for a solution to a search problem and want to survey the community to see if anyone else has dealt with this type of search.

The application I'm building supports an image processing system.  We have a mathematical way of uniquely representing any particular image as a vector of 16 values, each ranging between 0 and 255.

I need to implement a search mechanism that finds the closest matches to a given image, also represented as a 16 element vector.  This is usually called a “vector space model" search, and it's implemented for full text search in Postgres as well as Lucene, and probably many other full text search systems.

This might be *exactly* what you’re looking for:

Chris McCann

unread,
Apr 13, 2014, 3:18:27 AM4/13/14
to sdr...@googlegroups.com
Thanks, Guy, that is the Cube patch I referenced in my reply earlier today.  At this point I'd rather not have to build a custom copy of Postgres to get this functionality, especially if I plan to deploy this app to a hosting provider like Heroku.  

That assumes I correctly understand what a "patch" is, so if I'm off the mark please enlighten me.


You received this message because you are subscribed to a topic in the Google Groups "SD Ruby" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sdruby/eIfin-w6eD4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sdruby+un...@googlegroups.com.

Guyren Howe

unread,
Apr 13, 2014, 3:44:05 AM4/13/14
to sdr...@googlegroups.com
On Apr 13, 2014, at 12:18 AM, Chris McCann <testf...@gmail.com> wrote:

Thanks, Guy, that is the Cube patch I referenced in my reply earlier today.  At this point I'd rather not have to build a custom copy of Postgres to get this functionality, especially if I plan to deploy this app to a hosting provider like Heroku.  

That assumes I correctly understand what a “patch" is, so if I'm off the mark please enlighten me.

That’s a standard contrib module, so it’s probably already there if you’re using a pretty standard build. There’s a good chance you can just do:

create extension cube;

Then do something like:

select cube_union('(0,5,2),(2,3,1)', '0');

Chris McCann

unread,
Apr 13, 2014, 11:58:00 AM4/13/14
to sdr...@googlegroups.com
Guy,

It doesn't look to me like this patch has been added to the cube extension, which I already loaded into Postgres hoping it would provide a solution.  There's a discussion on the Postgres hacker board about maybe adding it to version 9.5 assuming all the tests and benchmarks can be properly added.

Again, I'm not familiar at all with the PG development ecosystem, so if there is a way to add this patch to my PG installation I'm more than willing to give it a test drive.

Cheers,

Chris


Dan Simpson

unread,
Apr 13, 2014, 12:00:42 PM4/13/14
to sdr...@googlegroups.com
I'd be curious to see how far the naive approach would get you.  The high dimensional nature of your problem makes exact k-NN less effective: http://en.wikipedia.org/wiki/Curse_of_dimensionality

You could try using LSH (http://en.wikipedia.org/wiki/Locality-sensitive_hashing) to cluster your data, then use a use the hash for matching.  I've never used this before, but it seems cool.

Here is an interesting SO question similar to your problem: http://stackoverflow.com/questions/5751114/nearest-neighbors-in-high-dimensional-data

I'm probably not helping, but the problem is interesting to me :)

--Dan




On Sun, Apr 13, 2014 at 12:44 AM, Guyren Howe <guy...@gmail.com> wrote:
You received this message because you are subscribed to the Google Groups "SD Ruby" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sdruby+un...@googlegroups.com.

Chris McCann

unread,
Apr 13, 2014, 12:05:11 PM4/13/14
to sdr...@googlegroups.com
Dan,

You're helping a tremendous amount by feeding me new "terms of art" on this problem.  Not being a mathematician I've fumbled around for the right terms to Google, so this is very helpful.  I'll take a look at what you provided.

Cheers,

Chris 


You received this message because you are subscribed to a topic in the Google Groups "SD Ruby" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sdruby/eIfin-w6eD4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sdruby+un...@googlegroups.com.

Chris McCann

unread,
Apr 13, 2014, 12:14:33 PM4/13/14
to sdr...@googlegroups.com
This link seems to describe Google's approach to the issue:  http://en.wikipedia.org/wiki/VisualRank

There is lots of good info here relating directly to the problem I'm solving.




On Sun, Apr 13, 2014 at 9:00 AM, Dan Simpson <dan.s...@gmail.com> wrote:
You received this message because you are subscribed to a topic in the Google Groups "SD Ruby" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sdruby/eIfin-w6eD4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sdruby+un...@googlegroups.com.

John Lynch

unread,
Apr 13, 2014, 12:35:09 PM4/13/14
to sdr...@googlegroups.com
Chris,

In the field of Content-Based Image Recognition (CIBR), I think they frequently use a Bag-of-Words approach, to classify image features into words (these can be actual words like "dog" or just sequences of letters that represent a numeric value). By doing this, it turns the image similarity search problem into a document similarity search problem, for which there are loads of (mostly Lucene-based) tools available.





- john



Rob Kaufman

unread,
Apr 13, 2014, 2:31:41 PM4/13/14
to sdr...@googlegroups.com, sdr...@googlegroups.com
Solr does numerical range matching as well as text, though I'm starting to think without more info on use case we may not be best serving Chris when it comes to best practice.

Rob

Guyren Howe

unread,
Apr 14, 2014, 12:39:25 AM4/14/14
to sdr...@googlegroups.com
On Apr 13, 2014, at 8:58 AM, Chris McCann <testf...@gmail.com> wrote:

It doesn't look to me like this patch has been added to the cube extension, which I already loaded into Postgres hoping it would provide a solution.  There's a discussion on the Postgres hacker board about maybe adding it to version 9.5 assuming all the tests and benchmarks can be properly added.

And the standard cube extension doesn’t work for you? It implements a distance function, so I figured it would meet your needs out of the box.

Chris McCann

unread,
Apr 14, 2014, 1:46:47 AM4/14/14
to sdr...@googlegroups.com
I didn't see that there was a cube_distance function. The documentation is super handy, though: 

cube_distance(cube, cube) returns doubleReturns the distance between two cubes. If both cubes are points, this is the normal distance function.
I'll give it a whirl tomorrow. 

Chris

Chris McCann

unread,
Apr 14, 2014, 8:03:19 PM4/14/14
to sdr...@googlegroups.com
cube_distance does in fact work -- thanks, Guyren!

Here's the actual query I'm using, in case anyone else needs to do this.  The second cube(ARRAY[ ]) call would take the 16 elements of the image vector that we want to match:

select id, 
cube_distance(
cube(ARRAY[c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15]), 
cube(ARRAY[211, 64, 120, 24, 139, 246, 11, 161, 29, 88, 252, 112, 218, 147, 66, 111])
) as distance
from images
order by 2 ASC
limit 5

Cheers,

Chris
Reply all
Reply to author
Forward
0 new messages