Searching for ranges

Roedy Green

unread,

Nov 21, 2011, 11:56:45 AM11/21/11

to

I was wondering about techniques for using SQL to find records with a
low-high associated range, especially when the ranges are contiguous,
or when the range sizes are a multiple of some integer.

I am familiar with in-RAM technique such as binary search and dividing
by the range size/atomicity to get an index.

I suspect just asking
for table.low <= wanted < table.high
won't be that clever, even if low and high are indexed.

I wondered if there were standard solutions to the problem.

The pattern of cases defined by bounds, particularly when the bounds
are contiguous comes up quite frequently. I am surprised there are
not more built in features in languages, Collections and databases to
efficiently handle it.

Some places where this band-searching problem comes up:

converting a random number 0 <= r < 1 to select a weighted quotation.

deciding which unit of measure to use depending on its size.

deciding which rule to use for inserting dashes in a ISBN depending on
is band.

naming colours

clipregion bands

converting frequency to RGB

UTF-8 encoding

converting weight, height etc to evaluation words.

calculations with coins.

displaying a wide range of numbers or currency values.

--
Roedy Green Canadian Mind Products
http://mindprod.com
I can't come to bed just yet. Somebody is wrong on the Internet.

Martin Gregorie

unread,

Nov 21, 2011, 4:18:53 PM11/21/11

to

On Mon, 21 Nov 2011 08:56:45 -0800, Roedy Green wrote:

> I was wondering about techniques for using SQL to find records with a
> low-high associated range, especially when the ranges are contiguous, or
> when the range sizes are a multiple of some integer.
>
> I am familiar with in-RAM technique such as binary search and dividing
> by the range size/atomicity to get an index.
>
> I suspect just asking for table.low <= wanted < table.high won't be
> that clever, even if low and high are indexed.
>

Au contraire, that should be pretty good given a decent RDBMS, provided
that:

- a frequency analysis is done to ensure that the costs of maintaining
the index don't damage the performance of other queries.

- the query using the new index isn't so complex that the effect of the
index becomes irrelevant or even harmful. EXPLAIN is your friend here
for PostgreSQL and some other DBMS. Those that don't implement
EXPLAIN should provide a similar tool. If the DBMS you're using
doesn't, consider carefully if you should be using it for anything.

--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |

David Lee Lambert

unread,

Jan 17, 2012, 2:59:58 PM1/17/12

to

On 21 nov 2011, 11:56, Roedy Green <see_webs...@mindprod.com.invalid>
wrote:

> I was wondering about techniques for using SQL to find records with a

> low-high associated range, especially when the ranges are contiguous, [...]

>
> I am familiar with in-RAM technique such as binary search and dividing
> by the range size/atomicity to get an index.
>
> I suspect just asking
> for table.low <= wanted < table.high
> won't be that clever, even if low and high are indexed.

As long as low OR high is indexed, and assuming the ranges are small
enough that it is actually better than a full-table scan, any good
DBMS will make use of the index to speed up the query without much
trickery. In fact, if the set of ranges is small enough, the whole
index will probably be in one B-tree page, with the keys in order in
memory, and the DBMS will actually use a binary search to look values
up within the page.

If you have a sequence of contiguous non-overlapping ranges, you
don't need two columns; you just need to "SELECT ... WHERE table.low
<= wanted ORDER BY table.low ASC LIMIT 1" or "SELECT ... WHERE
table.low <= wanted AND table.id NOT IN (SELECT id FROM table t2 WHERE
t2.low >= wanted)" or any other way of expressing it in SQL.

However, I wonder if putting such a table out in the database is
really the most robust, maintainable or high-performance way to deal
with the applications you list lower down...

> The pattern of cases defined by bounds, particularly when the bounds
> are contiguous comes up quite frequently. I am surprised there are
> not more built in features in languages, Collections and databases to
> efficiently handle it.
>
> Some places where this band-searching problem comes up:
>
> converting a random number 0 <= r < 1 to select a weighted quotation.

Yes.

> deciding which unit of measure to use depending on its size.

For this I think an array lookup by "Math.floor(Math.log(x)/
LOG_OF_1000)" is simpler and faster than a network round-trip (or
local context switch) to the database. Of course, if you're writing
database stored procedures and you expect the unit names to actually
change anytime, that might make a database table more attractive...

> deciding which rule to use for inserting dashes in a ISBN depending on

> is [?] band.

The rules for this fit on a single page of human-readable text, I
don't see how anything involving SQL parsing and a database lookup
could possibly be more efficient than just implementing them in a
procedural language.

See for example http://waltshiel.com/2008/06/22/isbn-how-to-decode-it/
.

> UTF-8 encoding

The rules for this are even simpler, and we hope already solved as
part of Java NIO. The only thing to tweak is whether one-byte, two-
byte, three-byte, or four-byte sequences are expected to be most
common, which might affect what optimization hints you pass to a
highly optimizing C compiler.

> calculations with coins.

There I'll grant you it could change at runtime for a sufficiently
long-running program; but I still suspect that the most practically
efficient algorithms will start by loading all the coin-values for a
particular currency into memory at the beginning of a calculation.

> naming colours

If you're thinking of loading X's "rgb.txt" and computing the closest
match to a color, etc... Several databases support 2-D geometric
types with corresponding indexes. However, I can't find documentation
of any mainstream database with out-of-the-box support for 3-D
indexes. See the following comment about PostGIS...

http://www.mail-archive.com/postgi...@postgis.refractions.net/msg03821.html

--
DLL

Roedy Green

unread,

Jan 19, 2012, 12:32:02 PM1/19/12

to

On Tue, 17 Jan 2012 11:59:58 -0800 (PST), David Lee Lambert
<dav...@lmert.com> wrote, quoted or indirectly quoted someone who said
:

>
>> deciding which rule to use for inserting dashes in a ISBN depending on
>> is [?] band.
>
>The rules for this fit on a single page of human-readable text, I
>don't see how anything involving SQL parsing and a database lookup
>could possibly be more efficient than just implementing them in a
>procedural language.

I was not suggesting that you would use SQL for such code. I have
implemented in it Java. See http://mindprod.com/products1.html#ISBN

I was just pointing out that slicing up a range into a number of
abutting bands of variable width is a pattern than shows up in many
different problems where you want to rapidly identify the band. I was
suggesting there ought to be canned classes or language features in
Java/SQL etc . for dealing with it that hide the housekeeping.

--
Roedy Green Canadian Mind Products
http://mindprod.com

One of the most useful comments you can put in a program is
"If you change this, remember to change ?XXX? too".