"AACGT" is broken first into wordsized chunks of every possible
combination... say it's wordsize 2, you have AA, AC, CG, GT. Now
"ACT" will have a very high similarity score (2 deletions) with N&W
and indeed BLAST, but breaking this into words does not reveal this
unless you run the entire algorithm.
I do believe the algorithm/type suggested by the rest of your
discussion is useful for a lot of things but not necessarily in
performing this type of calculation.
The hypergraph model though could be instrumental in representing the
results after some other parallel computing device has computed said
similarity scores though and save a lot of trouble making sense and
identity of the computed data afterward. This would be useful in much
the same way as Bio4j (http://bio4j.com)
I'm not trying to piss all over your idea, but rather color it,
because I have been thinking about this very issue for a very long
time myself...
> --
> You received this message because you are subscribed to the Google Groups
> "HyperGraphDB" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/hypergraphdb/-/hu2kPFa9ys4J.
> To post to this group, send email to hyperg...@googlegroups.com.
> To unsubscribe from this group, send email to
> hypergraphdb...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/hypergraphdb?hl=en.
--
---------------------------------------------------------------
Dave Dolan
http://davedolan.com/blog
Thanks for the honest answer. Nice to hear of somebody interested in
bioinformatics and graph stuff. Didn't know about bio4j.
The post was not supposed to adress sequence alignment. It was only
inspired by a thought experiment in a larger context, and was the
trigger for this thought experiment. Nothing more at the moment.
I think the main challenge is: get the head around & get the most out
of the HGDB paradigma.
Blast does the speed up mainly by limiting expensive
smith-waterman(-equivalent) to a small corridor where it's predicted
to be relevant. It determines this window with heuristics, i.e. it
maintains a lookup table of words you mention over each given sequence
in the database before a quest, indepedent of the frame, i.e. with
overlapping words, and later, when it found high-scoring matches, it
also evaluates the alignments referring to a substitution matrix.
My idea of a recursive hgdb text type could - theoretically!- also do
most parts of that, each time, having very specialized links, that are
successively index by specialized indices. I.e. a hgdb link
representing a text (~a sequence) would present a sequence of words,
whereas the lockup table could be done this way:
link(WordAtomID, sequenceAtomID, pos1, pos2, pos3...)
You would then filter out set of sequences that have similar arities
for each word. Maybe this would be more practical:
link(WordAtomID, sequenceAtomID, posOfFirstOccurence, distanceToFirst,
distanceToSecond...)
Realizing substitution matrices could be also a pain, not sure if
practically feasible, but at least conceivble:
link(substititionType(AAG->AAC), wrappedMainLink, pos1, pos2).
Gaps are indeed another beast, especially as soon as the frame is
shifted, but it all depends how the recursive text type would be
implemented, and if it could be avoided to access the letter based
text/sequence without flattening it out in the application (hence make
HGDB itself see "ABCDE" instead of link("ABC", "DE") . I guess all
this is not good, gaps and frame shifts and gap penalties and all that
would need to be handled somehow in the application... it would
probably take some phd thesis. Anyway, this was not the intention :-)
The best thing would be to not try to imitate the existing approaches
in HGDB, but to find radically different approaches. With HGDB data
can be linked in a great variety of different ways, which in turn
means, the data can be treated in very different ways.
For example, it would be useful -also outside of the
bioinformatics-sphere- to have some kind of an "simulated
hybridization". Define types, that expose their semantic in some way
that is suitable to be picked up by links&indices, and have some kind
of specific affinity according to some pattern (aka simulate
H-Bonds..). Next thought experiment :-).
ingvar
2012/2/29 Dave Dolan <dave....@gmail.com>:
It's been a while since I was involved in bioinformatics and my
understanding of BLAST is pretty superficial. But representing such
recursive structures in HGDB is certainly possible and I hope
relatively easy. You would just create Java classes representing
Words and the different types of expressions, where each type of
expression is simply a link, with no bean properties or anything of
the sort and that's it. You shouldn't be thinking in terms of Java
type vs. HGDB types - they should be two sides of the same coin. Every
atom (including HGDB types) has a corresponding JVM runtime
representation in the form of a Java object that naturally has a Java
type. When you write your algorithms, you usually write in terms of
the Java types that are mapped somehow to HGDB types.
I can't say more about what the right representation would be since it
very much depends on the type of operations/queries that you'd be
doing.
Boris
On Wed, Feb 29, 2012 at 7:06 AM, Ingvar Bogdahn
<ingvar....@googlemail.com> wrote:
> --
> You received this message because you are subscribed to the Google Groups
> "HyperGraphDB" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/hypergraphdb/-/hu2kPFa9ys4J.
> To post to this group, send email to hyperg...@googlegroups.com.
> To unsubscribe from this group, send email to
> hypergraphdb...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/hypergraphdb?hl=en.
--
http://www.kobrix.com - HGDB graph database, Java Scripting IDE, NLP
http://kobrix.blogspot.com - news and rants
"Frozen brains tell no tales."
-- Buckethead
> For more options, visit this group at
> http://groups.google.com/group/hypergraphdb?hl=en.