There are assumed to be 2 input streams S1 and S2, which are modified
versions of the original tag-stripped input stream. The
transformations we apply are:
[1] Strip all punctuation marks so that words like she'll and they'll
are converted to shell and theyll. Also, "and, they went to" is
transformed to "and they went to".
[2] Convert all series of whitespaces to a single space character.
Sentences like "hello world she shall" are transformed to "hello
world she shall".
Next, we choose one of the streams(S2), and create a hash table of all
pairs of words.
Suppose S1 and S2 are as follows:
S1: A B D A C B E R W A S D Q W D R A D B C A B
S2: T Y A B D G C B E R Q A S D Q D R A D C A B H
The hash table H is:
TY, YA, AB, BD, DG, GC, CB, BE, ER, RQ, QA, AS, SD, DQ, QD, DR, RA,
AD, DC, CA, AB, BH
next, we start processing stream S1 pair wise.
The first one is AB. Search for AB in H and if found create a new
region, and add it to an 'active region list'.
In this case, it is added to a list of active regions. The pair AB was
found at location '2' in the stream S2. Now, we move on to the next
pair, BD. BD is found in H, so we first create a new region with BD as
the first element, and since it is also in the previous region, we
extend the previous region. Now, there are 2 active regions. We always
add pairs found in the hash table(H) as a new region in anticipation
that the actual final region will be starting at this pair.
The rest of the algorithm is explained below:
* If we
* find a match to a pair in a region that is not currently in the
* list of regions, we create a new region, and add that to the list
* of regions to be included to considering in the final output. If a
* match occurs in a position which is just after the last offset
* added in a particular region, we increment the last offset counter
* for that region, and increment the cont_pairs variable. Even if the
* new pair is not in the next position, but is within num_mismatches,
* we still set the last_offset variable in the region_struct, but
* don't increment the cont_pairs variable. We don't set nm_mism to
* zero in this case, but we do in the previous case. Again, if in a
* run the region's pairs don't match, the num_mism variable is
* incremented. If num_mism exceeds the num_mismatches variable, then
* the region is discarded. However, if the region's cont_pairs
* variable is sufficiently high(2*continuous_pairs), then the region
* is shifted into the stable stale regions list instead of being
* discarded. This list is also considered when the final choosing
* algorithm for the lists is executed.
HTH,
-Dhruv.
--
-Dhruv Matani.
http://www.geocities.com/dhruvbird/
"Be sure brain is in gear before engaging mouth"
-- Anonymous