How does it work?

Josh Tauberer

unread,

Sep 14, 2011, 6:57:50 PM9/14/11

to SuperFastMatch

Hey guys. Thanks Donovan for going open, and to Tom and the Sunlight
team for helping.

Could you explain a little about what SFM actually does or how it
works? The presentation link in an earlier post is 404ing now --- feel
free to direct me somewhere to read up first. :)

I've been applying text comparison tools to legislative language for a
few years. Finding out when a bill is incorporated into another bill
in Congress has been one of the most requested features on GovTrack.us
(my site), and my sense from reading the Sunlight Labs post on SFM is
that SFM might be able to help with that.

Looking forward to more,

Josh

Donovan Hide

unread,

Sep 14, 2011, 7:06:19 PM9/14/11

to superfa...@googlegroups.com

Hi Josh,

thanks for joining the group and apologies for the lack of documentation so far! A brief overview of the system and future plans can be seen here:

http://superfastmatch.org

Superfastmatch is well suited to indexing and searching of document sets like law and journalism. It works best at finding exact duplicates of text rather than just common terms like other packages such as Lucene or Xapian.

Following is a description of the algorithms involved. I plan to write an introductory tutorial with some screenshots, hopefully next week.

Imagine three short strings/documents normalised to be lower case with a short hash window of 3:

String A: "the theory"

String B: "the theorising"

String C: "the theorizing"

can be tokenised as follows:

Tokens for A: ["the","he ","e t"," th" ,"the","heo","eor","ory"]

Tokens for B: ["the","he ","e t"," th" ,"the","heo","eor","ori","ris","isi","sin","ing"]

Tokens for B: ["the","he ","e t"," th" ,"the","heo","eor","ori","riz","izi","zin","ing"]

yields the following (made-up) hash sequences:

Hashes for A: [099,089,001,023,099,456,032,198]

Hashes for B: [099,089,001,023,099,456,032,020,876,672,101,004]

Hashes for C: [099,089,001,023,099,456,032,020,432,092,014,004]

We can see that the hash for "the" (099) is the most common. If we invert this structure we get:

099: A,B,C

089: A,B,C

001: A,B,C

023: A,B,C

456: A,B,C

032: A,B,C

198: A

020: B,C

876: B

672: B

101: B

004: B,C

432: C

092: C

014: C

Note that the repetition of "the" in all strings is not double counted. Using this structure it is possible to find the documents with the most common substrings. Eg, for string C it is clear that string B has the most common 3 character grams.

Using this information, and given the assumption that collisions are infrequent for a wide enough hash width, we can rank a massive document set in order of number of common grams and then select the top 10,20,... documents for the next stage. An extra algorithm for dealing with high collision rates for narrow hash widths is to walk the hashes in document order and record the last seen position of a match for a document and only increment the count if below a distance threshold. This gives the effect of a low pass filter to remove collisions which is nice!

With these ranked matching string candidates we can then do the next stage which is to build up the full set of common substrings between two documents (lets call them X and Y). This is covered in:

https://github.com/mediastandardstrust/superfastmatch/blob/master/src/association.cc#L93-197

This is the key discovery of superfastmatch and is also the hardest to explain! The first stage is to reuse the same hash sequences as above to build up a set of hashes in Y and at the same time construct a bloom filter which is cheaper to query than a set for membership. Next, loop over X hashes and record the positions of any matches for the set of hash Y in a vector against the matching hash. The next stage filters out any bad matches and creates a new structure which is guaranteed to contain all common hashes, crucially, in document order for both sides. This was the absolutely vital and happy side effect for the next stage of the algorithm to work efficiently.

The next stage is a finite state machine which eats the previous data structure, greedily chomping away to form the longest possible chain of common hashes. I need to reinstate a debug output for this, it's too hard to explain in detail, but it's essentially two autonomous cursors which go as far forwards as they can for X and Y and then retreat to start on the next common string. This is powerful because it can deal with common chunks which are distant from each other in terms of document position in X and Y. This is a problem that diff algorithms have difficulty with in not being able to recognise that something from the bottom of one document moved to the top of another.

After this, there is some whitespace removal which gets rid of whitespace and checks that each common string is still above the chosen window size. Finally the common strings are sorted by length and Hey Presto!

So to go back to strings B and C, we would get this output:

"the theori" 0,0,10

"the" 0,4,3

"ing" 12,12,3

where the numbers are position in B,position in C and length of string.

Hope this helps.

Donovan.

Josh Tauberer

unread,

Oct 1, 2011, 1:48:49 PM10/1/11

to SuperFastMatch

Thanks for the background!

I intend to dig in deeper when I get a chance, hopefully soonish.

Josh

> https://github.com/mediastandardstrust/superfastmatch/blob/master/src...

Reply all

Reply to author

Forward