Crude search implemented

7 views
Skip to first unread message

Donovan Hide

unread,
Jul 20, 2011, 7:06:49 PM7/20/11
to superfa...@googlegroups.com
Hi,

have just done a quick commit that includes a crude search to let you
start doing some analysis of the likely results. Have a look at:

https://github.com/mediastandardstrust/superfastmatch/blob/master/TODO

to see what is coming next! The key missing bits are the association
task and the associations page which will mean that you don't have to
do any manual searches and can review all the results in aggregate
form.

Let me know how you get on!

Cheers,
Donny.

James Turk

unread,
Jul 21, 2011, 12:55:31 PM7/21/11
to superfa...@googlegroups.com
Excellent, I'm currently working from home so I don't have direct
access to the SFM instance I've set up but I'm eager to try out the
search. I have enough model bills now that I should be able to test
for some known collisions as early as tomorrow.

Also, this TODO list looks great, I was actually going to ask about
JSON responses as I have a rudimentary Python client in the works that
I'm finding useful for test purposes and that'll be nice to have.

-James

Tom Lee

unread,
Jul 21, 2011, 1:54:03 PM7/21/11
to superfa...@googlegroups.com
I feel a bit embarrassed to ask this, but could one of you distinguish the search and association tasks for me?

Donovan Hide

unread,
Jul 21, 2011, 2:40:03 PM7/21/11
to superfa...@googlegroups.com, superfa...@googlegroups.com
Hi!

Need to do a glossary! A search is just a phrase for an arbitrary set of associations where an association is a set of text fragments common to two documents. The association task will generate the associations between two sets of documents, typically split by document type.

Basically at the moment you can search using some test pasted into a form field. Soon you will be able to compare every document in a doc type set with every document in another doc type set.

Hope that didn't read like gobbledegook!

Cheers,
Donny.

Sent from my iPhone

Tom Lee

unread,
Jul 21, 2011, 3:08:15 PM7/21/11
to superfa...@googlegroups.com
That did the trick! Thanks, Donny.

Tom Lee

unread,
Aug 1, 2011, 11:25:00 AM8/1/11
to superfa...@googlegroups.com
Hi Donny -- how're things going?  I see some exciting-looking commits, but have to confess I don't know precisely what they mean for us and our ability to run the association task.

Donovan Hide

unread,
Aug 1, 2011, 11:31:05 AM8/1/11
to superfa...@googlegroups.com
Hi Tom,

there will be something that does associations automatically by the start of your working day tomorrow. Sorry for the continual postponements. Had to deal with the case where doc1 associates with doc2 and the converse where doc2 associates with doc1. Got a bit complicated :)

Cheers,
Donny.

Tom Lee

unread,
Aug 1, 2011, 11:53:18 AM8/1/11
to superfa...@googlegroups.com
Exciting! Okay, thanks for letting me know.

Donovan Hide

unread,
Aug 1, 2011, 11:42:01 PM8/1/11
to superfa...@googlegroups.com
Hi,

have committed something to have a look at. You can kick off the association task with this command:


but it's a bit buggy displaying the results on a document page, it could be faster and the filtering of short results need to be tweaked.

In other words, as usual, I probably need a bit more time to get it working well...

I feel a bit up against it and am rushing the code, rather than write all the tests that make good software, so it might be worth having a conference call at some point soon just to set out what specifically you actually need to test and try and set a realistic deadline!

Cheers,
Donny.

Tom Lee

unread,
Aug 2, 2011, 9:54:19 AM8/2/11
to superfa...@googlegroups.com
Thanks, for this, Donny!  A call sounds like a good idea.  What's your availability like the rest of the week?  I imagine we'd want to do a call during our morning -- maybe tomorrow morning at 11 AM ET?

Donovan Hide

unread,
Aug 2, 2011, 9:59:38 AM8/2/11
to superfa...@googlegroups.com
11AM ET == 4PM BST

which sounds good to me. Not sure if Martin can make it, he's in the remotest part of Ireland in a world without the Internet (lucky!).

Speak tomorrow.

Cheers,
Donny.

Tom Lee

unread,
Aug 2, 2011, 10:17:07 AM8/2/11
to superfa...@googlegroups.com
Great. Talk to you then.

James Turk

unread,
Aug 2, 2011, 11:34:26 AM8/2/11
to superfa...@googlegroups.com
One of the things I'm hoping we can look at is what we can do to get
better results out of our current contrived test case.

I'm loading all of the AZ bills that I have, and then running search
with a model bill that I know exists, the results I'm getting are
short strings like "pursuant to the subsection" but I'm not getting
the entire paragraphs that clearly appear in both documents. I've
attached AZ SB 1070 and the relevant model bill for you to look at, as
you can see entire paragraphs of the model bill (7K5..) appear in
1070. If you could look at why the match algorithm might be missing
these large overlaps that'd be helpful so we can tune our approach
accordingly.

-James

SB 1070_Introduced Version_1.htm.txt
7K5-No_Sanctuary_Cities_for_Illegal_Immigrants

Donovan Hide

unread,
Aug 2, 2011, 11:39:28 AM8/2/11
to superfa...@googlegroups.com
Ace, perfect test case! The association part of the document view is currently buggy, will get this working with the supplied docs ASAP!

Donovan Hide

unread,
Aug 2, 2011, 5:17:55 PM8/2/11
to superfa...@googlegroups.com
That example was really useful. It is amazing how much language is re-used in law!

Have grouped all identical fragments into a single row in the association table for both search results and the document view. It seems like the interesting stuff is usually more than a 100 characters long, so I'd recommend running superfastmatch with a larger window size:

superfastmatch -window_size 100

What was really interesting was seeing how the boilerplate phrases have high frequencies even in a single document (indicated by comma-separated positions in the table with more entries). This will be very helpful with filtering. I can put the jquery tablesorter on the table view if that helps, and maybe colour-code the rows depending on frequency?

It's starting to look useful, and is definitely more usable than it was this morning :)

Cheers,
Donny.

James Turk

unread,
Aug 3, 2011, 10:21:06 AM8/3/11
to superfa...@googlegroups.com
Hi,

I pulled the changes and did a make clean and then tried rebuilding
and rerunning with the same test data I had been using.

Unfortunately I'm now getting segfaults during document load,
generally 500-1000 documents in. I hadn't seen this happen before and
I'd loaded tens of thousands of documents. I'm pretty reliably
getting a segfault now, but not on any particular document.

I'm afraid there isn't that much info I can give besides that right
now, let me know if there is a way I can help debug if you can't
reproduce it.

2011-08-03T09:18:46.545504-05:00: [INFO]: Queued document:
Document(1,744) for indexing queue id:744 Response Time: 0.0000 secs
2011-08-03T09:18:46.545521-05:00: [INFO]: (127.0.0.1:41350): PUT
/document/1/744/ HTTP/1.1: 202
Segmentation fault
make: *** [run] Error 139

-James

Donovan Hide

unread,
Aug 3, 2011, 10:26:32 AM8/3/11
to superfa...@googlegroups.com
Hi James,

just pushed the fix, there was a big nasty memory leak on the document loading code where the bloom filter was being overwritten and the original never being deleted. Out of interest are you PUT-ting or POST-ing the documents when you add them? A PUT will associate immediately, and a POST will defer until a POST to /association/ occurs.

Anyway, will speak in 30 minutes!

Cheers,
Donny.
Reply all
Reply to author
Forward
0 new messages