"Schmidt" <n...@vbRichClient.com> escribi� en el mensaje
news:kj75aq$3cg$1...@dont-email.me...
Hello Olaf:
What I'm trying to achieve is some "intelligent" text search.
The search term may be a single word, in that case it's not so much complex,
but it also can be a search phrase with several words at the same time.
I have a database with "documents", that have several fields.
One field is the Title, another field is the summary (those are the two more
important ones), and I have also some other text fields where to seach the
phrase.
In total I have 11 fields, one is the Title, another one the Summary, and
for the rest they can be more than one per document (so I have them is
another related table).
I have the search routine half done. I build a list if Documents_IDs (it's
an array) with all the documents IDs that have relevance for the query, also
I have another array with RevelanceValues where I save a number according to
the relevance of that document for the phrase.
First, I search in the title, it's a normal text search, I add the ones that
math and assign a relevance value in RevelanceValues, it's a high value
number because the relevance is strong if the phrase appears in the Title
exactly as written by the user.
After that I perform a normal search in the Summary, also assigning a
relevance value.
I do the same for the other 9 fields.
After that, I go more fine and I want to perform a search by Lemma.
I already have stored in the database all the Spanish words with their
lemmas.
Each word can have more than one lemma.
A lemma is like the entry in a dictionary, the main word without any
inflection.
It's similar to a Stem (I know you have experience with stemming), but it's
not the same thing.
The stem is how the word starts, the "root" (althought "root" is not a
proper word to use because it means something else) or common characters at
the beginning of words that are the same word but with different
inflections.
For example in English "wait", "waited", "waiting" have the stem "wait".
Here there is a more extensive explanation:
http://en.wikipedia.org/wiki/Word_stem
A lemma takes into account the meaning and not only how the word starts.
For example, in English "seen" and "see" has the lemma "see" that in that
case is the same for the stem, also "see", but "saw" has also the lemma
"see", but it has another stem.
In other words, lemmas takes into account that the words can be irregular
and to have another declination or morphology.
Lemma explained in wikipedia:
http://en.wikipedia.org/wiki/Lemma_%28morphology%29
In English this kind of search it's a lesser complex problem, because the
verbs has only three tenses (and only the third person has a difference
adding an "s" or "es" at the end) and the words are very much regular, but
the Spanish have a lot of verbs conjugations, and also suffixes can be added
to the verbs like "informing to us" ("informando a nosostros") can become
"informingus" (inform�ndonos") or "remind it to me" ("recuerda esto a m�")
can be written as "remindmeit" ("recu�rdamelo") , and there are many words
that are not regular so this kind of search becomes more important.
So, first I have all the Spanish words with their lemmas already stored in a
database.
And I've made three tables that are intermediate tables and must be updated
every time that there are editions to the documents or new documents are
added before performing queries. But it's not a problem because it's not
that they will enter new information and to want to perform queries at once.
They will enter all the documents information, prepare the database for the
final user (including updating these intermediate tables) and ship the
database to the final users that will only be able to perfor queries but not
to enter new documents or edit the one it has. So the update of these tables
takes some seconds (before being able to perform queries after editing data)
but it's not a problem.
So the program has two states of working: "data entry" or "final user" -or
"query"-.
The complete list of Spanish words is in a separate database file. This will
not be shipped to the final user and is used only in the data entry
location.
About the three intermediate tables, one of the tables is a list of all the
Spanish words but only the ones used in the documents (I decided not to
include all the Spanish words in the final user database because it would
add 100 Mb to the database file).
So these three tables, one has the Spanish words used by the documents and
with their corresponding lemmas. Each word can have more than one lemma
(adding this to the complexity). For example in English, the word "saw" has
(at least) two lemmas: "to see" (a verb) and "saw" -a tool to cut wood- (a
noun). Most of the words have only one lemma, some has two, to have more
than two is more rare, but there are words that have up to 5 lemmas. So in
this table I have 5 fields for lemmas: Lemma1, Lemma2, Lemma3, Lemma4 and
Lemma5.
Each word in this table is an unique entry, and it has a Word_ID. This is
the primary key.
Also there is another field to point whether the word is a "stop word":
http://en.wikipedia.org/wiki/Stop_words
In the second table I have three fields: Document_ID, Words_ID and a third
field to know how many times that word appears in the document.
The primary key is Document_ID + Words_ID
In the third table I have:
Document_ID
Field_Number
Position
Word_ID
Document_ID Identifies a document. Field_Number is a number from 1 to 11, 1
meaning Title, 2 meaning Summary, and so on. This is were the words were
taken from.
Position is a number with the word position in the text.
Word_ID is the ID of the word, the ID that the word has in the first table.
So, I have all this design (that I'm not so sure any more that it's a good
design), and I'm trying to search for a phrase but taking into account that
it can be with other words inflections. Also, to ignore (or not) stop words,
to threat all the pronouns as same (or not), to be able to find if the words
are, but not exactly in the same order as the user wrote it in the query,
also if they are nearby or far away from each other... and to return a
different relevance number for each case.
Also, to consider the times each word appear in the text, if they are the
exact word or another one with the same lemma.
But I got stuck trying to build a recordset with the list of Document_ID's
that has the three words (but only if they have the three words in the same
Field_Number -in a three words query phrase example: not one word found in
the Title and the other two in the Summary-).
Also, I wanted to return for each phrase all the documents that has not the
exact words but words that have the same lemmas. I already has built the
list of words with the same lemmas for each word (of the query phrase) in an
array. So that's why I said in the example I posted = 100 OR = 102.
I know it's somewhat complex, but I still don't figure how I should have
stored the words in a better proper way to accomplish this.
I still can do what I want not with a full recordset, and with many
FindFirst's searches, but that will be very slow I think.