> Hi,
Hi Abhishek! I'd be happy to help you figure out how to integrate your code with Whoosh.
> The change I want to incorporate:
> I will talk in terms of Vector Space Model.
> I have made some schemes to the matching of a query term with the labels of the inverted index. My matching scemes use LCS(longest common sub-sequences) and LD(Levenshtein Distance) instead of simple direct matching. So I want to replace the matching scheme at the base with my own schemes.
Do you want to be able to say "find all documents containing this term" but do term matching using a custom function? If so, that's easy to do with a custom Query object.
Whoosh already has a FuzzyTerm query that finds documents containing terms within N Damerau–Levenshtein distance of a given search term.
FuzzyTerm uses an FSA of all the terms in a field (that also supports spell checking) for speed, which could possibly also help with LCS.
If you can explain exactly what you want to do in relation to search, I can better explain how to do it with Whoosh.
> Also it would be really helpful if you help me understand the basics of code.
Whoosh's architecture is quite similar to Lucene's:
(I need something like this in the docs ;)
* An Index is based on a Schema, which contains a list of all the Fields. (A schema can have "dynamic fields", so you can say "any time the user tries to index a field that matches *_int, use this Field object".) -- see src/whoosh/fields.py
* A Field defines how input in a certain document field (e.g. "title") is indexed and looked up. The basic function of a Field object is to translate some input into sortable bytestring "terms". For example, the NUMERIC field converts numeric input (ints, floats, Decimal) into sortable bytestrings. -- see src/whoosh/fields.py
* A Format is wrapped by a Field and controls how a posting is represented as bytes on disk. This allows custom per-posting information similar to Lucene's "payload" feature. -- see src/whoosh/formats.py
* A Codec object provides a high-level interface (e.g. "Add this term to the index") to lower-level reading and writing subsystems (e.g. an on-disk hash file). This is not documented yet. -- see src/whoosh/codec/*.py
* A Writer object wraps a Codec and provides a user-level API for adding to the index (e.g. "add_document"). -- see src/whoosh/writing.py
* The current writing architecture uses an append-only log structured merge (LSM) approach, where new documents are written to a new sub-index (called a segment). Over time smaller segments are merged into larger segments.
* A Reader object wraps a Codec and provides a user-level API for reading from the index. -- see src/whoosh/reading.py
* A Searcher object wraps a Reader and adds extra search-related methods based on the lower-level Reader methods. -- see src/whoosh/searching.py
* A Query object represents an abstract query, such as "find all documents containing the term 'foo' in the field 'bar'". It is independent of any particular Index. A complex query is represented as a "tree" of Query objects, e.g. And([Term("x", "y"), Term("w", "z")]). Whoosh has many built-in query types, such as Phrase, Range, FuzzyTerm, Near, And, Or, Nested, etc. -- see src/whoosh/query/*.py
* At search time, a call to the matcher() method on a Query object translates the Query tree into an equivalent Matcher tree specific to a given Searcher object. For example, a FuzzyTerm query object will be trasnlated into a Union matcher of term matchers for all the index terms within N edits of the search term. Matchers do the actual work of examining term information in the index to find matching documents. -- see src/whoosh/matching/*.py
* A Collector object does the work of "running" the Matcher tree and collecting the results. The Collector returns a Results object, which lets you access Hit objects. Whoosh has several collector objects, such as one that only returns the top N results, and collectors that wrap other collectors, such as one that remembers which terms matched in which documents. -- see src/whoosh/collectors.py
Cheers,
Matt