Linear Spatial Data for repeated querying

6 views
Skip to first unread message

BB

unread,
Nov 4, 2009, 11:33:49 PM11/4/09
to H2 Database
Hi all, I have a data set that I'm considering using an embeded H2
database for processing but I don't know if this is the best approach.

The primary data is essentially a series of line segments (several
hundred thousand) along a very long line (it's actually a genome)
I want to be able to rapidly process another very large set of data
(in the millions of records) finding the intersection between the two
data sets for any overlapping segments.

The primary data set will be ordered, but the secondary data set
that's being processed would be unordered.

My first thought was import each data set into a table using the
MultiDimension tool and then loop through all of the several hundred
thousand regions in the primary data set one by one selecting the
items from the larger (millions of records) data set that falls within
that particular range.

Where I have really run into issue is figuring out how to use the
multidimension tool to accomplish the intersection of the two ranges.
It's pretty straightforward to identify where a single point lies
within a range, but I'm having a hard time adapting the intersection
query for two ranges. The getResult() function seems to only allow
for a single range where I want to find when two ranges overlap.

Am I just missing something simple here?

Thomas Mueller

unread,
Nov 9, 2009, 3:03:03 PM11/9/09
to h2-da...@googlegroups.com
Hi,

It sounds like an interesting problem. However I'm not sure if the
multi-dimension tool will help. Could you provide a link or explain
what exactly you want to do? Could you describe what exactly the
problem is? I understood it as:

String dna (data actually type CLOB, length: 6 GB)
dna = "TGATAGGTGATAGATAGATTGATAGATGATAGAAGATTGATAGATGATAG...."
You want to quickly search for a sequence of length 400-500 characters, as in:
String sequence = "TGATAGATGAT...";
String idx = dna.indexOf(sequence);

Is this about what you want to do? I guess there are already very good
solutions. I did a quick Google search for "genome fast search" and
got http://genome.cshlp.org/content/11/10/1725.full

It does sound similar to rsync: http://en.wikipedia.org/wiki/Rolling_hash

If it's not that, could you explain what exactly you want to do?

Regards,
Thomas

BB

unread,
Nov 11, 2009, 11:28:08 PM11/11/09
to H2 Database
Hi Thomas,

Thanks for the reply. I'm not really trying to do string matching,
that part of the problem is addressed by a number of software packages
tuned for this data type and the SAHA package you referenced is one of
those. I'm really trying to process data after you have done those
alignments and essentially have a series of ranges that indicate where
all those matches occur.

For example, if I use software to line up the DNA sequencing reads and
each has a start and stop position on the genome defining the range of
alignment. Next, I have a small range on the genome that I want to
identify all of the reads that overlap that range. My query would all
be numerical data.

Here is a really simple diagram that will hopefully show up okay.
[--------] denotes the range of interest maybe a single gene for
example
*********** denotes the reads and they are lined up where they match
the genome.

1
n
|<----------------------------[-----------]---------------------------
>| Genome
1 ************** 2
****************
3 ********************
4 *******************


I want to be able to do a query where I find all reads that overlap
the indicated range. If the reads were single points then it's an
easy query, but since it's intersecting ranges I'm having a hard time
with it. In the diagram above I would do a query defining the max and
min of the bracketed range to return reads 1, 3, and 4 but not 2.

Ultimately, I guess this is a question about how to intersect sets of
ranges. I would probably represent the data as rectangles in a
traditional spatial database and then call the intersects function,
but since H2 doesn't have that defined I'm looking for a work around.

Let me know if this still isn't clear, it's a bit tough to describe in
this format.







On Nov 9, 3:03 pm, Thomas Mueller <thomas.tom.muel...@gmail.com>
wrote:
> Hi,
>
> It sounds like an interesting problem. However I'm not sure if the
> multi-dimension tool will help. Could you provide a link or explain
> what exactly you want to do? Could you describe what exactly the
> problem is? I understood it as:
>
> String dna (data actually type CLOB, length: 6 GB)
> dna = "TGATAGGTGATAGATAGATTGATAGATGATAGAAGATTGATAGATGATAG...."
> You want to quickly search for a sequence of length 400-500 characters, as in:
> String sequence = "TGATAGATGAT...";
> String idx = dna.indexOf(sequence);
>
> Is this about what you want to do? I guess there are already very good
> solutions. I did a quick Google search for "genome fast search" and
> gothttp://genome.cshlp.org/content/11/10/1725.full
>
> It does sound similar to rsync:http://en.wikipedia.org/wiki/Rolling_hash
>
> If it's not that, could you explain what exactly you want to do?
>
> Regards,
> Thomas
>

Thomas Mueller

unread,
Nov 12, 2009, 6:24:24 AM11/12/09
to h2-da...@googlegroups.com
Hi,

> a series of ranges that indicate where all those matches occur.

> reads ... each has a start and stop position on the genome defining the range of
> alignment.

This is an interesting problem. So let's say you have the following
table and query:

create table reads(id int, start int, stop int);
select id from reads where start<=max and stop>=min;

Is that the (simple, but slow) query you want to optimize? I
understand now why a spatial index would help. However, if the average
of stop-start is relatively small, there might be an easier ways. One
idea is to additionally store the middle:

create table reads(id int, start int, stop int, middle int as
(stop+start)/2);

Then you also need to know M=max(stop-start). If max is a lot higher
than average(stop-start), you would need to split large reads into
sub-reads that are smaller. The query would be:

select id from reads where (middle between min-M/2 and max+M/2)
and start<=max and stop>=min;

Would that help?

Regards,
Thomas
Reply all
Reply to author
Forward
0 new messages