recursive definition of text

29 views
Skip to first unread message

Ingvar Bogdahn

unread,
Feb 29, 2012, 7:06:31 AM2/29/12
to hyperg...@googlegroups.com
hi all,

Procrastination can be a very fruitful thing. I want to share some ideas of a night session and hope for some feedback. 
Had that idea first when studying sequence alignment algorithms (needleman & wunsch) and afterwards trying to deeper understand the HGDB model and trying to imagine ways of saving nucleotide sequences in a way that reveals recurring motifs of nucleotides, some way to merge storage and sequence alignment. But the main idea is applicable to normal Text of course. 

There is one easy part and an (optional) hard part where I'd hope for ideas, probably this kind of thing has been done 50 years ago. Anyway it would give more interesting test cases than randomstrings :-)

The easy part of the idea:
given a string s: instead of storing s as an byte array, split s into words first, save those and, then save s as sequence of (refs to) its composing words.

The harder part: 
Define a type "Text", where Text subtypes to either Word or Expression, the latter being a Sequence of Text (not Sequence of Word). Hence a Text can be either a sequence of Words or sequnce of words or expressions (composed of expressions (composed of expressions)) and so forth.

The motivation to do this particularly in HGDB:
By chaining refs to words (or to expressions), save each word (or expression) only once, save space (at least for expressions) but particularly at the same time automatically build up interesting interconnections between separate texts or within a single text, using custom indices.


Open questions of the harder part:

1) the best way of defining this recursive type on the java/JVM-side?
  a) java abstract class + class hierachy? Limitations? Pattern match? Better understand the limitations of JVM regarding recursive types. Maybe this would need tricks like visitor pattern, which could render interop with HGDB more difficult.
  b) scala first try: https://gist.github.com/1940229 
  c) keep recursive type outside of JVM, replace HGDB-StringType with recursive version goto 2b

2) the best way of defining this recursive type on the HGDB-side? 
  a) just create a JVM-type and store it? ( 1a/b). 
  b) combination of HGDB-type and JVM-type ( in combo with 1a/b) 
  c) everything only in HGDB -still have to think about that :-)

3) identify recurring text motifs of higher abstraction level? 
   a) further differentiate between subordinate clauses and/or also whole sentences. First easy way: separate at one of ".;?!". This is too rigid.
   b) Some way of "contiguous intersections" that keep only contiguous atoms including the order? some form of (multiple) sequence alignment built into a custom index. Very expensive, should be distributed.

4) if recurring motif(s) are identified after storing the flat representation of a string s, how to organize HGDB such that the representation of s can (continously) shrink to representations with fewer refs to higher-level Text lego pieces? 

5) How to deal with overlapping / alternative representations? => Have several different representations of the same string s => increase linking. 



Dave Dolan

unread,
Feb 29, 2012, 8:27:06 AM2/29/12
to hyperg...@googlegroups.com
First of all, needleman & wuensch, and the BLAST approximation of it
find alignment WITH elisions, (and position shifts) and thus the
required 'dynamic programming' iteration means that even rote
similarity of words/sentences/paragraphs etc will not serve as a
basis, nor even necessarily a helpful tool for this kind of
calculation. It's unfortunate but you have to actually do the
iteration, each is necessary to calculate the next step in the
iteration. While each step is a markov model for the next step, any
one state doesn't give any concrete clues as to any but the
immediately following state.

"AACGT" is broken first into wordsized chunks of every possible
combination... say it's wordsize 2, you have AA, AC, CG, GT. Now
"ACT" will have a very high similarity score (2 deletions) with N&W
and indeed BLAST, but breaking this into words does not reveal this
unless you run the entire algorithm.

I do believe the algorithm/type suggested by the rest of your
discussion is useful for a lot of things but not necessarily in
performing this type of calculation.

The hypergraph model though could be instrumental in representing the
results after some other parallel computing device has computed said
similarity scores though and save a lot of trouble making sense and
identity of the computed data afterward. This would be useful in much
the same way as Bio4j (http://bio4j.com)

I'm not trying to piss all over your idea, but rather color it,
because I have been thinking about this very issue for a very long
time myself...

> --
> You received this message because you are subscribed to the Google Groups
> "HyperGraphDB" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/hypergraphdb/-/hu2kPFa9ys4J.
> To post to this group, send email to hyperg...@googlegroups.com.
> To unsubscribe from this group, send email to
> hypergraphdb...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/hypergraphdb?hl=en.

--
---------------------------------------------------------------
Dave Dolan
http://davedolan.com/blog

Ingvar Bogdahn

unread,
Feb 29, 2012, 6:52:38 PM2/29/12
to hyperg...@googlegroups.com
Hi Dave,

Thanks for the honest answer. Nice to hear of somebody interested in
bioinformatics and graph stuff. Didn't know about bio4j.

The post was not supposed to adress sequence alignment. It was only
inspired by a thought experiment in a larger context, and was the
trigger for this thought experiment. Nothing more at the moment.
I think the main challenge is: get the head around & get the most out
of the HGDB paradigma.


Blast does the speed up mainly by limiting expensive
smith-waterman(-equivalent) to a small corridor where it's predicted
to be relevant. It determines this window with heuristics, i.e. it
maintains a lookup table of words you mention over each given sequence
in the database before a quest, indepedent of the frame, i.e. with
overlapping words, and later, when it found high-scoring matches, it
also evaluates the alignments referring to a substitution matrix.
My idea of a recursive hgdb text type could - theoretically!- also do
most parts of that, each time, having very specialized links, that are
successively index by specialized indices. I.e. a hgdb link
representing a text (~a sequence) would present a sequence of words,
whereas the lockup table could be done this way:
link(WordAtomID, sequenceAtomID, pos1, pos2, pos3...)
You would then filter out set of sequences that have similar arities
for each word. Maybe this would be more practical:
link(WordAtomID, sequenceAtomID, posOfFirstOccurence, distanceToFirst,
distanceToSecond...)
Realizing substitution matrices could be also a pain, not sure if
practically feasible, but at least conceivble:
link(substititionType(AAG->AAC), wrappedMainLink, pos1, pos2).
Gaps are indeed another beast, especially as soon as the frame is
shifted, but it all depends how the recursive text type would be
implemented, and if it could be avoided to access the letter based
text/sequence without flattening it out in the application (hence make
HGDB itself see "ABCDE" instead of link("ABC", "DE") . I guess all
this is not good, gaps and frame shifts and gap penalties and all that
would need to be handled somehow in the application... it would
probably take some phd thesis. Anyway, this was not the intention :-)
The best thing would be to not try to imitate the existing approaches
in HGDB, but to find radically different approaches. With HGDB data
can be linked in a great variety of different ways, which in turn
means, the data can be treated in very different ways.
For example, it would be useful -also outside of the
bioinformatics-sphere- to have some kind of an "simulated
hybridization". Define types, that expose their semantic in some way
that is suitable to be picked up by links&indices, and have some kind
of specific affinity according to some pattern (aka simulate
H-Bonds..). Next thought experiment :-).

ingvar


2012/2/29 Dave Dolan <dave....@gmail.com>:

Borislav Iordanov

unread,
Mar 1, 2012, 12:52:59 AM3/1/12
to hyperg...@googlegroups.com
Hi Invgar,

It's been a while since I was involved in bioinformatics and my
understanding of BLAST is pretty superficial. But representing such
recursive structures in HGDB is certainly possible and I hope
relatively easy. You would just create Java classes representing
Words and the different types of expressions, where each type of
expression is simply a link, with no bean properties or anything of
the sort and that's it. You shouldn't be thinking in terms of Java
type vs. HGDB types - they should be two sides of the same coin. Every
atom (including HGDB types) has a corresponding JVM runtime
representation in the form of a Java object that naturally has a Java
type. When you write your algorithms, you usually write in terms of
the Java types that are mapped somehow to HGDB types.

I can't say more about what the right representation would be since it
very much depends on the type of operations/queries that you'd be
doing.

Boris

On Wed, Feb 29, 2012 at 7:06 AM, Ingvar Bogdahn
<ingvar....@googlemail.com> wrote:

> --
> You received this message because you are subscribed to the Google Groups
> "HyperGraphDB" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/hypergraphdb/-/hu2kPFa9ys4J.
> To post to this group, send email to hyperg...@googlegroups.com.
> To unsubscribe from this group, send email to
> hypergraphdb...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/hypergraphdb?hl=en.

--
http://www.kobrix.com - HGDB graph database, Java Scripting IDE, NLP
http://kobrix.blogspot.com - news and rants

"Frozen brains tell no tales."

-- Buckethead

Ingvar Bogdahn

unread,
Mar 11, 2012, 2:31:53 PM3/11/12
to hyperg...@googlegroups.com
Here is first working attempt: https://gist.github.com/2016844#gistcomment-90524
it's flat, no recursive types, just HGPlainLink on "Word", a Bean that has a String field.
Using assertAtom (for Word) and addUnique (for the phrase aka link of words), it makes sure that for a given datum, only one handle is used.
I didn't manage to just use String instead of Word, because I didn't manage to get a ByPartIndexer for String primitive type.
Is there a trick to assertAtom / addUnique a simple String?
Anyway, now it works that two "phrases" share the same word atoms, and it's possible to query give all phrases that contain that word atom(s) xy.

When I have time again, there is much to improve, such as the mentioned recursive type. I haven't figured out yet how to do it...

I was also thinking, that maybe it would be good in that case to reimplement HGPersistantHandle using the scala.util.Murmurhash, which has convenient and very fast hash functions, also for Arrays. They are also specialized, meaning compiled for each primitive type, generic but evading type erasure - penalty could be low, and it would reduce the overhead to do expensive queries or maintain expensive indices. Any arguments I'm missing that are against that?

Any hints would be always welcome :-)

ingvar

> hypergraphdb+unsubscribe@googlegroups.com.


> For more options, visit this group at
> http://groups.google.com/group/hypergraphdb?hl=en.

Reply all
Reply to author
Forward
0 new messages