indexing strategy

2 views
Skip to first unread message

iain.d...@gsk.com

unread,
Mar 2, 2005, 4:27:18 AM3/2/05
to cp...@googlegroups.com

I'm looking into the possibility of altering the search capability of cPath so that I can search across other specific fields.  For example I am adding new xrefs to interactors and would like to search across only that entry.

I get the impression that this can be done with Lucene but I'm not certain.  Can you tell me if this is possible with lucene and cpath? and if so could you point me in the correct direction.

Thanks for the help

Iain K

Ethan Cerami

unread,
Mar 3, 2005, 10:50:51 AM3/3/05
to cp...@googlegroups.com
Hello Iain,

Lucene makes field specific searching trivially easy. If you want to
index additional fields beyond those already indexed, first check out
the cpath ItemToIndex class. This is a generic interface that I wrote
that is used to automatically index any number of fields. As we add new
data formats to cPath, we implement a new ItemToIndex for each new format.

Currently, we only index PSI interaction records. So, check out
PsiInteractionToIndex. This implements the ItemToIndex interface, and
encapsulates one PSI Interaction record. You would simply need to
modify this class to extract the xrefs, and index them in a new field
(although keep in mind that xrefs are already indexed as part of the
"interactor" field. see the FAQ:
http://www.cbio.mskcc.org/cpath/faq.do#fields).

Let me know if you have any follow-up questions.

Ethan
--
Ethan Cerami
Computational Biology Center
Memorial Sloan-Kettering Cancer Center
http://cbio.mskcc.org
Email: cer...@cbio.mskcc.org
Direct phone: (646) 735-8082
cerami.vcf

Ethan Cerami

unread,
Mar 3, 2005, 10:52:57 AM3/3/05
to cp...@googlegroups.com
cerami.vcf

iain.d...@gsk.com

unread,
Mar 7, 2005, 4:13:52 AM3/7/05
to cp...@googlegroups.com

Great.  We're making use of the PSI-MI interaction attribute list, for some in house data. So I may specifically index across these attributes.

I'll have a look at the classes soon hopefully.

Many  thanks

Iain K




"Ethan Cerami" <cer...@cbio.mskcc.org>

03-Mar-2005 15:52

Please respond to cp...@googlegroups.com

       
To
cp...@googlegroups.com
cc
Subject
Re: indexing strategy


cerami.vcf

iain.d...@gsk.com

unread,
Mar 11, 2005, 8:15:57 AM3/11/05
to cp...@googlegroups.com

Hi Guys,

I have a follow-up question about querying index terms. Please correct any floored assumptions I have made.

I get the impression that the indexing process reviews each interaction at a time; it collects terms from the interactors involved in the interaction as well as terms from the interaction.  These terms, in addition to general content, then make up a lucene record which can be queried against.

My problem lies in querying on distinct interactor information. For example, querying on species will bring back records which match that species, so if an interaction contains 3 interactors from 3 different species and one of them match, the interaction will be returned.  

We currently have data from a number of species and will wish to query interactions which involve exclusively, only one species. I get the impression that this currently can't be done?

A possible solution would be if Lucene allowed to search records which contained only one value for a term; in my example, records which only had one species.  Do you know if this is possible using Lucene? or should I pose it to a Lucene message board?

Thanks for the continued support.

Iain K



"Ethan Cerami" <cer...@cbio.mskcc.org>

03-Mar-2005 15:52

Please respond to cp...@googlegroups.com

       
To
cp...@googlegroups.com
cc
Subject
Re: indexing strategy





cerami.vcf

Ethan Cerami

unread,
Mar 11, 2005, 10:47:36 AM3/11/05
to cp...@googlegroups.com
Hello Iain,

I don't think this is directly possible to do in lucene, but lucene
experts may provide some tips that I am not aware of. Of course, if
there are specific cases that you want to filter out, you could use a
not filter. For example: "homo sapiens" NOT "XXX", but I doubt this is
what you are after.

I can also think of a few (not so great) solutions:

1) it's possible to create separate indexes. For example, you could
create a separate text index for each species.
2) you could create a post-lucene filter. For example, query lucene
for matches, and then run these matches through a non-lucene species
specific filter.

The problem with 1 is lots indexes, and they grow as we add more
species. The problem with 2 is performance.

Do you need this for batch processing or general purpose web viewing?

Ethan

iain.d...@gsk.com wrote:
>
> Hi Guys,
>
> I have a follow-up question about querying index terms. Please correct
> any floored assumptions I have made.
>
> I get the impression that the indexing process reviews each interaction
> at a time; it collects terms from the interactors involved in the
> interaction as well as terms from the interaction. These terms, in
> addition to general content, then make up a lucene record which can be
> queried against.
>
> My problem lies in querying on distinct interactor information. For
> example, querying on species will bring back records which match that
> species, so if an interaction contains 3 interactors from 3 different
> species and one of them match, the interaction will be returned.
>
> We currently have data from a number of species and will wish to query
> interactions which involve exclusively, only one species. I get the
> impression that this currently can't be done?
>
> A possible solution would be if Lucene allowed to search records which
> contained only one value for a term; in my example, records which only
> had one species. Do you know if this is possible using Lucene? or
> should I pose it to a Lucene message board?
>
> Thanks for the continued support.
>
> Iain K
>
>
>
> *"Ethan Cerami" <cer...@cbio.mskcc.org>*
cerami.vcf

iain.d...@gsk.com

unread,
Mar 11, 2005, 11:25:15 AM3/11/05
to cp...@googlegroups.com

Hi Ethan,

Thank for the quick response.

We are planning on a handful of semi ad-hock queries.  We're trying to integrate a tool which needs to use a couple of interactor filters on the queries.  I am pretty sure that the filters will need to be applied to all interactors involved.

Separate indexes sound interesting.  Is it possible to introduce a second "interactor only" index? So queries can be carried out across both indexes? I have no idea how feasible the mechanics of this are though. It doesn't seem too straight forward.

Regards

Iain K



"Ethan Cerami" <cer...@cbio.mskcc.org>

11-Mar-2005 15:47

cerami.vcf

Ethan Cerami

unread,
Mar 11, 2005, 12:02:39 PM3/11/05
to cp...@googlegroups.com
Iain,

It's definitely possible, and relatively straightforward to create an
"interactor only" index. You could just iterate through all
PHYSICAL_ENTIIES (A.K.A. "interactors") in the cpath table, and index
each one at a time. For a while, I thought about having one
"interaction" index, and one "interactor" index. But, in the end, I
found it much simpler to just index the interactions along with
interactors as one unit, and just maintain one index.

I suppose if you had an "interactor" index, you could say something
like: "give me all interactors with the keyword XXX and of type 'homo
sapien'". Then, using this list, you could query cpath directly for a
list of all interactions associated with those interactors. However,
even then, you would still need to write a filter for removing
interactions between two species.

What kind of ad-hoc queries are you thinking of?

Ethan

iain.d...@gsk.com wrote:
>
> Hi Ethan,
>
> Thank for the quick response.
>
> We are planning on a handful of semi ad-hock queries. We're trying to
> integrate a tool which needs to use a couple of interactor filters on
> the queries. I am pretty sure that the filters will need to be applied
> to all interactors involved.
>
> Separate indexes sound interesting. Is it possible to introduce a
> second "interactor only" index? So queries can be carried out across
> both indexes? I have no idea how feasible the mechanics of this are
> though. It doesn't seem too straight forward.
>
> Regards
>
> Iain K
>
>
>
cerami.vcf

Gary Bader

unread,
Mar 11, 2005, 1:05:30 PM3/11/05
to cp...@googlegroups.com
Hi Iain,
It seems that a lucene query, followed by a filter of the result set
would work better for dealing with specific query types that are
important for a user, but are not possible or easy with the Lucene query
language. The case you mention sounds like a good example of this,
because it is dependent on some knowledge of the record structure above
the simple 'field' level. Custom queries are always difficult to deal
with in bioinformatics in a general manner because they can get very
complex. Do you think it would be useful for cPath to support custom
queries programmatically using a 'complex query filter' interface? That
way, anything very complicated can be clearly defined as such and
limited to specific types of classes. (The downside is that it might be
inefficient in the worst case because you return a large set and then
filter it instead of just returning what you want up front)

Thanks,
Gary

iain.d...@gsk.com

unread,
Mar 14, 2005, 4:11:39 AM3/14/05
to cp...@googlegroups.com

Ethan, Gary,

Creating two separate indexes seems simple but as you said, querying efficiently across both may be inefficient.  I wondered if lucene supported any concept of multiple indexes effectively combining results? So querying one index with certain criteria and another index with different criteria, then effectively combining the results (like a relational join). More of a relational database concept than a text index one.

A concrete example is where we are looking for all interactions exclusively to "homo sapiens", so no interactors from other species can be involved.   I get the impression that we don't intend on too many interactor filters, but I'm pushing to get most of them defined now, so I have a better idea of how to tackle the problem.

It looks like a post-query filter may be the only option.  Short term I could use the web service that I have sitting above cpath to filter the results, so I intercept the psi-mi that cpath returns and filter it further.  This is far from ideal as it's inefficient, and paging the results will be awkward.  Gary's suggesting may be useful here.  If there were a programmatic "plugin-filter" mechanism, then I could catch and filter the results at a much better place. I hope this wouldn't be too inefficient. I should carry out some load testing if it gets implemented, as one of our reasons for changing one of our current systems is efficiency problems.

Thanks for the continued help.



"Gary Bader" <ba...@cbio.mskcc.org>

11-Mar-2005 18:05

Gary Bader

unread,
Mar 14, 2005, 12:18:05 PM3/14/05
to cp...@googlegroups.com
Hi Iain,
    Lucene supports Boolean logic in the queries, which is usually good enough for most searches.  The problem with this specific search is that field specific indexing does not differentiate between different interactors in the same interaction.  It would be easy to do if the PSI-MI data model was binary interactions, then you can say give me all records where A=human AND B=human (you can do this in BIND).  But PSI-MI is a set of interactors from 1..n, so the binary thing doesn't work. I think this would be quite a complex query in SQL as well, unless the tables were designed with it in mind.  If you want to do it in lucene, you would have to use NOT and list all of the other organisms in the database for the query.  This would be a long query, but might work out well if there are only a few organisms in there (not a good design, but might be useful as a temporary workaround).
    We likely want to us the filtering mechanism, to keep the core search routines simple (unless we can think of something better), but it would also be interesting to know what kinds of queries are important to users so that we can keep them in mind in the design.

Thanks,
Gary
Reply all
Reply to author
Forward
0 new messages