Using whoosh to make search engine over OCRed data

Abhishek Gupta

unread,

Sep 8, 2013, 6:55:27 AM9/8/13

to who...@googlegroups.com

Hi,

I am currently involved in a project to develop a search engine which will search on erroneous(OCRed) data. Its a GSoC project, you can see the proposal here (a more readable version here). I have made some schemes to do so. Now I have to incorporate those schemes in a search engine to test. I initially tried Lucene(PyLucene actually), but even after 1 month swimming inside Lucene source code I am unable to find all the points in code which I have to modify and up till now the number(where I have to modify) is about 8000. This is really huge task for a single person who don't know java very well and Lucene source code is too complex.

So I decided to use pure python solution. Now what I want to know given the change I want to incorporate in whoosh whether it is going to be easy and doable by single person or not. Also it would be really helpful if you help me understand the basics of code. (For Lucene I read the book Lucne In Action).

The change I want to incorporate:

I will talk in terms of Vector Space Model.
I have made some schemes to the matching of a query term with the labels of the inverted index. My matching scemes use LCS(longest common sub-sequences) and LD(Levenshtein Distance) instead of simple direct matching. So I want to replace the matching scheme at the base with my own schemes.

Please help me out!!

Thanking You,

Abhishek Gupta

+91-96247-99165

Matt

unread,

Sep 8, 2013, 2:45:02 PM9/8/13

to who...@googlegroups.com

> Hi,

Hi Abhishek! I'd be happy to help you figure out how to integrate your code with Whoosh.

> The change I want to incorporate:
> I will talk in terms of Vector Space Model.
> I have made some schemes to the matching of a query term with the labels of the inverted index. My matching scemes use LCS(longest common sub-sequences) and LD(Levenshtein Distance) instead of simple direct matching. So I want to replace the matching scheme at the base with my own schemes.

Do you want to be able to say "find all documents containing this term" but do term matching using a custom function? If so, that's easy to do with a custom Query object.

Whoosh already has a FuzzyTerm query that finds documents containing terms within N Damerau–Levenshtein distance of a given search term.

FuzzyTerm uses an FSA of all the terms in a field (that also supports spell checking) for speed, which could possibly also help with LCS.

If you can explain exactly what you want to do in relation to search, I can better explain how to do it with Whoosh.

> Also it would be really helpful if you help me understand the basics of code.

Whoosh's architecture is quite similar to Lucene's:

(I need something like this in the docs ;)

* An Index is based on a Schema, which contains a list of all the Fields. (A schema can have "dynamic fields", so you can say "any time the user tries to index a field that matches *_int, use this Field object".) -- see src/whoosh/fields.py

* A Field defines how input in a certain document field (e.g. "title") is indexed and looked up. The basic function of a Field object is to translate some input into sortable bytestring "terms". For example, the NUMERIC field converts numeric input (ints, floats, Decimal) into sortable bytestrings. -- see src/whoosh/fields.py

* A Format is wrapped by a Field and controls how a posting is represented as bytes on disk. This allows custom per-posting information similar to Lucene's "payload" feature. -- see src/whoosh/formats.py

* A Codec object provides a high-level interface (e.g. "Add this term to the index") to lower-level reading and writing subsystems (e.g. an on-disk hash file). This is not documented yet. -- see src/whoosh/codec/*.py

* A Writer object wraps a Codec and provides a user-level API for adding to the index (e.g. "add_document"). -- see src/whoosh/writing.py

* The current writing architecture uses an append-only log structured merge (LSM) approach, where new documents are written to a new sub-index (called a segment). Over time smaller segments are merged into larger segments.

* A Reader object wraps a Codec and provides a user-level API for reading from the index. -- see src/whoosh/reading.py

* A Searcher object wraps a Reader and adds extra search-related methods based on the lower-level Reader methods. -- see src/whoosh/searching.py

* A Query object represents an abstract query, such as "find all documents containing the term 'foo' in the field 'bar'". It is independent of any particular Index. A complex query is represented as a "tree" of Query objects, e.g. And([Term("x", "y"), Term("w", "z")]). Whoosh has many built-in query types, such as Phrase, Range, FuzzyTerm, Near, And, Or, Nested, etc. -- see src/whoosh/query/*.py

* At search time, a call to the matcher() method on a Query object translates the Query tree into an equivalent Matcher tree specific to a given Searcher object. For example, a FuzzyTerm query object will be trasnlated into a Union matcher of term matchers for all the index terms within N edits of the search term. Matchers do the actual work of examining term information in the index to find matching documents. -- see src/whoosh/matching/*.py

* A Collector object does the work of "running" the Matcher tree and collecting the results. The Collector returns a Results object, which lets you access Hit objects. Whoosh has several collector objects, such as one that only returns the top N results, and collectors that wrap other collectors, such as one that remembers which terms matched in which documents. -- see src/whoosh/collectors.py

Cheers,

Matt

Matt Chaput

unread,

Sep 8, 2013, 9:23:56 PM9/8/13

to who...@googlegroups.com

> * At search time, a call to the matcher() method on a Query object translates the Query tree into an equivalent Matcher tree specific to a given Searcher object. For example, a FuzzyTerm query object will be trasnlated into a Union matcher of term matchers for all the index terms within N edits of the search term. Matchers do the actual work of examining term information in the index to find matching documents. -- see src/whoosh/matching/*.py

I forgot:

* A WeightingModel object represents a document scoring algorithm, e.g. BM25F. At search time, a WeightingModel is used to create a Scorer object which is passed to the leafs in the matcher tree to actually score matching documents. However, it is not REQUIRED for a Matcher's score() function to use the Scorer to calculate the score... a matcher may simply compute its own score value where that makes sense. -- see src/whoosh/scoring.py

Cheers,

Matt

Abhishek Gupta

unread,

Sep 9, 2013, 5:54:06 AM9/9/13

to who...@googlegroups.com

Hey Matt thanks for replying (so fast and explaining all the things though I have already got a gist of whoosh architecture because I was diving in whoosh src code and documentation for last two days).

Now first I will explain in detail what I want to do:

What I have to do:

I am making a search system which is going to search on a erroneous data (because I am searching on the OCRed data). For example suppose I am searching for abhishek and there is no abhishek in my whole corpus but there are bishek, ahiek, .... So I must return them.

My Schemes:

As I already told you that my schemes are based on LCS and LD. But there is a twist I am not using them directly because using them directly would also match twitter & platter as abhishek is matched with bishek. The error in data is due to wrong predictions by OCR but for the case of twitter vs platter ( suppose twitter is my search query and platter is a word appearing in many documents) there is very less chance that the word platter is appeared in the data due to mis-reading of word twitter by OCR because there are very less chance of any letter of twitter to be replaced by any letter of platter by OCE except for i -> l. Also there are other kind of errors introduced by OCR like deletion/removal of letter, insertion of new letter.

So to get rid of these kind of errors I have developed a character error model which lists the probability of a letter to get converted into every other letter by OCR. I developed my LCS and LD schemes to make them probabilistic (PLD and PLCS). For eg. take the two words twitter and platter, the simple LCS would return tter but the PLCS may return itter if the probability of mis-reading i to l is less then the threshold pre-specified. So I am returning all the possible matches with the probability how correct this matching could be.

How it is different from already present schemes in whoosh:

As you suggested to use FuzzyTerms but this is different. In my case there is no problem in query (user need is conveyed properly in the query). But there are errors in the data i.e. I don't have to consider different versions of a query terms instead I have to consider the different versions of every term in the data. And I can't simply consider the N-gram approach because the word may be totally different.
Let me reform it in very simple language:
I am going to give partial yes for the matching of all the following pairs: (i,l), (e,c), (d,l), (b, l), (v,u), (P,R) but your schemes won't consider it.

What I have to do with whoosh:

I have to modify the code of collector to use my matching schemes.
My matching schemes will return a probability which I have to use then in scoring.

What I want from you:
I already trying modifying the collector of Lucene but I got no success because I was unable to understand the code of collector completely as there comes all the things related to utf and how index is stored.

Also I was unable to find all the points in Lucene where I have to do this change because there are multiple collectors and I have to modify them all. Here it would be easy for me as I am able to use a debugger to find the flow of code. I was unable to use any debugger with pyLucene.

Just help me out in understanding the collector and all the points which I have to modify.

Thanking you,

Abhishek Gupta

Cheers,

Matt

--
You received this message because you are subscribed to a topic in the Google Groups "Whoosh" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/whoosh/nz_rJKDnM8A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to whoosh+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

Abhishek Gupta,
897876422, 9416106204, 9624799165

Matt Chaput

unread,

Sep 9, 2013, 12:06:44 PM9/9/13

to who...@googlegroups.com

Sorry, I wasn't clear, I'd like to know how you want it to work with the search engine at a high level. For example, here's a guess:

"I want to be able to search for 'twitter' and have it also find,
for example, documents containing 'twltter' or 'twlHer', based on
my algorithm. I'd like to score each document based on which
generated terms matched."

If that is correct, here's an outline of what you'll need:

* A custom Query type representing "search for this term using Abhishek Gupta's algorithm". The query would be very similar to the "ExpandingTerm" subclasses in Whoosh (FuzzyTerm and Variations). You would need to override the query's expanded_terms() method to "expand" the user's search term into the "error" words to look for in the inverted index.

* A custom Matcher type generated by the Query type. The matcher would wrap a normal term matcher, but return a custom score based on your algorithm (using, I guess, some combination of the sub-matcher's score and the amount of "error" in the term).

* If your scoring algorithm needs to know which terms matched in a document, then you need a custom Collector that will look into the matcher tree at each matched document, find the matching terms, and use them as inputs to the algorithm. (The custom Matcher may need to have properties on it indicating which Query it was generated from, if you need to know which matched terms came from each original search term.)

Unfortunately, these three areas (subclassing ExpandingTerm, custom Matchers, and custom Collectors) are under-documented because they're a bit low level.

If you don't mind, I'll try implementing a quick skeleton of how you would do this in Whoosh (with a fake function plugged in where your algorithm would go). I'm interested to see for myself how easy or hard it will be.

Cheers,

Matt

Abhishek Gupta

unread,

Sep 9, 2013, 3:01:57 PM9/9/13

to who...@googlegroups.com

Yes you are right and great(for all the explanation you have provided)!! I have to do the same.

But there is one error. According to me I don't have to use ExpandingTerm because this would make my search system slow as according to my error model a single word could be expanded in many words. So what I have planned is instead of using 'normal term matcher I have to write my own matcher which will do matching in term of my scheme. Why I am following this way because suppose I can expand the word twitter to 10 different words, based on my scheme (error model) but in the actual corpus only two out of those ten words are present. So in this way I don't have to consider all the cases. Simply there are many possibilities for a word to expand but out of those many only some can appear in the corpus.

So please explain anything else you want to tell for this scheme (not using normal term matcher intead of making my own).

Really appreciate your help but for the sake of learning I first want to do it on my own else I will not be satisfied with my work (or at least my effort). I will continuously disturb you if I have any doubt :P

Abhishek

Cheers,

Matt

--
You received this message because you are subscribed to a topic in the Google Groups "Whoosh" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/whoosh/nz_rJKDnM8A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to whoosh+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Matt Chaput

unread,

Sep 9, 2013, 3:33:43 PM9/9/13

to who...@googlegroups.com

> But there is one error. According to me I don't have to use ExpandingTerm because this would make my search system slow as according to my error model a single word could be expanded in many words. So what I have planned is instead of using 'normal term matcher I have to write my own matcher which will do matching in term of my scheme. Why I am following this way because suppose I can expand the word twitter to 10 different words, based on my scheme (error model) but in the actual corpus only two out of those ten words are present. So in this way I don't have to consider all the cases. Simply there are many possibilities for a word to expand but out of those many only some can appear in the corpus.

You can quickly tell which terms are in the index when you construct the Matcher object in Query.matcher(), e.g.:

Class MyQuery(Query):
...

def matcher(searcher, context=None):
fieldname = self.field()

# Usually a query is created with unicode text
text = self.text

# Ask the field to convert the term text to bytes
fieldobj = searcher.schema[fieldname]
bytestring = fieldobj.to_bytes(text)

if (fieldname, bytestring) in ixreader:
# Searcher.postings() should maybe be called
# Searcher.term_matcher() or something...

return searcher.postings(fieldname, bytestring)

else:
return matching.NullMatcher()

The collector also prunes empty/finished matchers as it goes (in the Matcher.replace() method).

Since the index maps terms to list of documents containing the terms, I'm not sure how your matcher will work if not by generating all the error variations and doing a big Union of them (like the Variations query). You could set the field to have a forward index (vector=True) and then examine the words in every document, but I think that will be slower. But maybe I just don't understand so I'll let you work it out :)

Let me know whenever you need more info :)

Cheers,

Matt

Abhishek Gupta

unread,

Oct 16, 2013, 8:01:59 AM10/16/13

to who...@googlegroups.com

Hi,

I was working on implementing the idea and going on with the flow of the code for a Fuzzy query. But the prefix length I supplied to the query is not working correctly.

You can see the sample code for this query here. As the prefix length is 3 so the second document should not return. But its returning all documents as match:

The query object created for Fuzzy term contain `prefixlength` and `prefix` both and I am unable to understand what is `prefixlength`. See this:

I think there is some problem. So I dived a little bit. I think I have found where prefix(3) is replaced with prefixlength(1).

According to me this is a bug. But if I am wrong please help me figuring out why this is happening.

Cheers,

Matt

--
You received this message because you are subscribed to a topic in the Google Groups "Whoosh" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/whoosh/nz_rJKDnM8A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to whoosh+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

image.png

whooshSampleOutput.png

whoosh_func_calls.png

Matt Chaput

unread,

Oct 16, 2013, 11:58:17 AM10/16/13

to who...@googlegroups.com

> According to me this is a bug.

According to me too :) The plugin isn't grabbing the prefix number for
some reason. I'll try to figure it out when I have a minute.

Thanks,

Matt

Abhishek Gupta

unread,

Oct 16, 2013, 12:57:30 PM10/16/13

to who...@googlegroups.com

I have also mentioned the point where you are passing prefixlength(1) instead of prefix(3). There you should pass prefix instead. I also don't know what is use prefixlength variable given prefix variable is already there!!

Matt

--
You received this message because you are subscribed to a topic in the Google Groups "Whoosh" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/whoosh/nz_rJKDnM8A/unsubscribe.

To unsubscribe from this group and all its topics, send an email to whoosh+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Matt Chaput

unread,

Oct 17, 2013, 2:51:27 PM10/17/13

to who...@googlegroups.com

> According to me too :) The plugin isn't grabbing the prefix number
> for some reason. I'll try to figure it out when I have a minute.

Fixed in 1ea83e32ee34.

Matt

Reply all

Reply to author

Forward