Hi Thomas,
Thanks for the reply. I'm not really trying to do string matching,
that part of the problem is addressed by a number of software packages
tuned for this data type and the SAHA package you referenced is one of
those. I'm really trying to process data after you have done those
alignments and essentially have a series of ranges that indicate where
all those matches occur.
For example, if I use software to line up the DNA sequencing reads and
each has a start and stop position on the genome defining the range of
alignment. Next, I have a small range on the genome that I want to
identify all of the reads that overlap that range. My query would all
be numerical data.
Here is a really simple diagram that will hopefully show up okay.
[--------] denotes the range of interest maybe a single gene for
example
*********** denotes the reads and they are lined up where they match
the genome.
1
n
|<----------------------------[-----------]---------------------------
>| Genome
1 ************** 2
****************
3 ********************
4 *******************
I want to be able to do a query where I find all reads that overlap
the indicated range. If the reads were single points then it's an
easy query, but since it's intersecting ranges I'm having a hard time
with it. In the diagram above I would do a query defining the max and
min of the bracketed range to return reads 1, 3, and 4 but not 2.
Ultimately, I guess this is a question about how to intersect sets of
ranges. I would probably represent the data as rectangles in a
traditional spatial database and then call the intersects function,
but since H2 doesn't have that defined I'm looking for a work around.
Let me know if this still isn't clear, it's a bit tough to describe in
this format.
On Nov 9, 3:03 pm, Thomas Mueller <
thomas.tom.muel...@gmail.com>
wrote:
> Hi,
>
> It sounds like an interesting problem. However I'm not sure if the
> multi-dimension tool will help. Could you provide a link or explain
> what exactly you want to do? Could you describe what exactly the
> problem is? I understood it as:
>
> String dna (data actually type CLOB, length: 6 GB)
> dna = "TGATAGGTGATAGATAGATTGATAGATGATAGAAGATTGATAGATGATAG...."
> You want to quickly search for a sequence of length 400-500 characters, as in:
> String sequence = "TGATAGATGAT...";
> String idx = dna.indexOf(sequence);
>
> Is this about what you want to do? I guess there are already very good
> solutions. I did a quick Google search for "genome fast search" and
> gothttp://
genome.cshlp.org/content/11/10/1725.full