> On Jan 19, 2015, at 11:27 PM, elber kam <
elbe...@gmail.com> wrote:
>
> Hi guys. I am new to IR systems and looking for learning it. I have used samples of whoosh and how it is working under the hood. What are the DS algorithms are employed. is it B-TREE? What is the format of .seg file? Any link or document to read to find out?
- It uses different file formats for terms, stored documents, per-doc values, and postings.
- For terms, it currently uses the cdb algorithm (for hash-based lookups), with the keys written in order and an index of term positions added on to the end of the file for range lookups. In the next major version this will be replaced with something more like a read-only two-level B-tree.
- Whoosh works by writing the different types of information (terms, stored docs, per-doc values, postings) to different files. By default, these are then concatenated into a single .seg file (with an index), and the whole thing is just mmap-ed for reading.
One of the things on the to-do list is to do a big clean-up to make the code more understandable, but there will always be a tension doing things complexly for performance, since Python is slow to begin with.
Cheers,
Matt