New Release

Donovan Hide

unread,

Oct 7, 2011, 7:43:06 PM10/7/11

to superfa...@googlegroups.com

Hi all,

have pushed a fairly large set of commits with the following features:

* Much more stable when dealing with large numbers of documents (James
- I think this will fix your problem from the other day).
* Improved performance through using a rolling hash and other optimisations.
* A query interface for documents which paves the way for powerful
selective associating. Currently only used on the documents list page,
but will shortly roll out onto search and association API calls. eg:

http://127.0.0.1:8080/document/1;3/
http://127.0.0.1:8080/document/1-2;4-5/
http://127.0.0.1:8080/document/?order_by=characters
http://127.0.0.1:8080/document/?order_by=-title&limit=10
http://127.0.0.1:8080/document/?limit=10&order_by=title&cursor=merchant_of_venice.txt:3:3

* Better search results. Previously the associations tended towards
longer documents. Now they are based on total number of matches.
* Simple load.sh helper script that will load all text files in a
folder structure with a doctype per directory and added metadata based
on each documents parent folder.

The next two things which I aim to implement very shortly are
multi-threaded associations, which should scale linearly with cores,
as we are CPU-bound, and JSON templates.

Any questions or bugs let me know!

Cheers,
Donny.

Tom Lee

unread,

Oct 7, 2011, 8:13:54 PM10/7/11

to superfa...@googlegroups.com

This all sounds great, Donny -- thanks as always. I will admit to vague ambitions of tackling the JSON templating myself, but given how many years it's been since I wrote a line of C++, you should certainly not wait on me!

Donovan Hide

unread,

Oct 7, 2011, 8:31:17 PM10/7/11

to superfa...@googlegroups.com

Don't be put off :)

It's mostly just template stuff documented here:

http://google-ctemplate.googlecode.com/svn/trunk/doc/guide.html

It's quite a traditional templating system in the sense that it
doesn't allow any logic in the template whatsoever!! But that's quite
reassuring in a way. I'm fairly certain it is the code that delivers
the Google search results, especially when you look at the examples!

Have been thinking about how to make interesting use of the JSON
output. I'm sure you'll have come across d3.js. I really like this
example:

http://mbostock.github.com/d3/ex/splom.html

Especially the multiple selection when you drag across an individual
graph. Have thought about a set of concentric rings where each ring
represents a document, and common sections of text are highlighted as
a sectioned arc. Hovering over an arc highlights all the other
matching arcs. Would make it very quick and easy to clusters of
matches between multiple documents at once. Need to do a demo...

A bit like this, but not quite:

http://mbostock.github.com/d3/ex/sunburst.html

Cheers,
Donny.

Tom Lee

unread,

Oct 7, 2011, 8:50:48 PM10/7/11

to superfa...@googlegroups.com

Very nice! Yes, I agree there are some great possibilities here. And I've dug far enough into the code that I think I could implement the JSON template stuff, I just need to find some time to give it a shot and see if I'm fooling myself.

Reply all

Reply to author

Forward