thanks for the feedback! Can I ask three questions?
* Can you paste the output of "git show" into a reply? Just want to
check what commit you have checked out and built.
* Do the documents you have loaded contain Unicode characters?
* Are you seeing the error in the User Interface or in the JSON output
(assuming you have a recent build with JSON output)?
I'm guessing you might have an old version which doesn't include this fix:
There was an error where offsets where incorrect if documents
contained unicode characters and this tended to worsen the further
down the document you went, presuming that each multi-byte rune adds
an incorrect position of between 1 and 3 characters. You should be
able to correct this with a simple "git pull" and "make".
In the long run I'd prefer to use this library:
for dealing with the situation a bit more cleanly.
Can I be nosy and ask ask what your use case is? Very interested in
people's applications of SFM :)
Cheers.
Donny.
> I don't think we installed it using git. Is there some other way I
> can check the version. One of our sys admins is installing the
> latest, which you pointed to. Our documents are all plain text with
> no Unicode characters. We're just examining the html output, which
> shows the To and From columns. The numbers there don't correspond to
> the character counts of the positions.
I think if you're still seeing HTML you probably have quite an old
version :) Give the latest a try, you might have to reload the
documents along with a -reset flag on the command line. It would be
really useful to see a screenshot of the error if it occurs with the
latest build. The current UI highlights the text as you scroll over
each fragment.
Also, after associating your corpus, have a look at the JSON output
from http://127.0.0.1:8080/document/12/99/ where you have substituted
the 12 for the doctype and 99 for the docid. If the fragments are
incorrect, I'd be very interested to see the two texts that are
associated, if possible. Are you using a public domain corpus?
> Actually, SFM is going to be extremely useful for us. I'm an English
> grad student at UVA. We're incorporating SFM into Juxta, which is a
> text collation tool suite. The problem with Juxta is that it can't
> find far-distance differences between texts. With SFM we can find the
> far-distance similarities and, by inversion, the differences. In the
> future we might ask you guys to help us completely integrate SFM with
> Juxta. At the moment, I'm just taking the results of SFM and
> uploading them to Juxta through an API.
Sounds very interesting! Hopefully the JSON output will help you speed
up the Juxta import process. If you want to specify an SFM export
format that works well with Juxta I'd be more than happy to bolt it on
to the API :) All the JSON output is created using Google ctemplate:
http://code.google.com/p/ctemplate/
so it's relatively easy to add alternative rendering in forms like
XML/SQL/GraphViz ...
Must admit I'm very curious about your corpus and research!. Have you
tried searching long texts against themselves? It's really interesting
to pick out the repetitions in fiction, for instance "The gentleman in
the white waistcoat" in Oliver Twist. Another idea is comparing the
works of a single author. With Dickens, you can definitely see turns
of phrase being reused between books. In some cases, you can almost
tell which of his other books he might have been re-reading as he
wrote a chapter in another, the clusters are that apparent!
Anyway, let me know how you get on.
Cheers,
Donny.
> Is there any way you could work with that? Can I see the JSON output
> somewhere other than the browser?
have posted an example output from a corpus I'm working with:
https://gist.github.com/1671797
This can be viewed in a browser by visiting:
http://127.0.0.1:8080/document/5/1
or viewed on the command line using a tool like curl:
curl http://127.0.0.1:8080/document/5/1
or accessed using any HTTP client such as urllib in Python or
equivalent in your favourite programming language.
> We really just need an array of From and To positions, and lengths.
The interesting data for you is the fragments key for each matching
document which contains an array of arrays, eg:
"fragments" : [[410,2732,26,984972577],[423,2745,39,2859269951],[3154,6843,34,3835548155]],
Each entry is a match. Each match is an array of [from position,to
position,length,hash] where the hash is an aid to grouping identical
matches without doing a substring operation.
You might have to rewrite your importer to cope with this format, but
it is likely to be fairly stable now. The previous HTML output was
just a stopgap prior to this.
> Also, could we correspond via email? I forget to check the forum here
> sometimes.
Sure, my email is donov...@gmail.com, but probably makes sense to
CC in the mailing list on stuff that is of general interest.
Cheers,
Donny.
Updated your subscription, you should get this in your inbox :)
> Have a look at our sample here: http://chandra.village.virginia.edu:8000/
Great to see the UI load from the other side of the Atlantic!
> You'll notice that SFM breaks up the prose at the beginning of each
> document. It should include more text in each match, breaking only at
> the hyphens that break up words. See what I mean?
The Fragments window orders by from position by default, i.e. in
document order. To get the long matches first click the "Length" grid
header twice. The long matches are truncated and have an ellipsis mark
at the end, hover over them and an orange highlight should show the
full match.
The shorter matches seem to be a bit too short because of the white
space threshold set too low. With a setting of 0.5, if half of a
window is non-alphanumeric, then the match will stop. Try a higher
setting up to a limit of 1.0 to see if that helps. It's useful as a
filter when documents have lots of useless formatting like "* * * *"
chapter breaks.
Let me know if this works!
Cheers,
Donny.
https://github.com/mediastandardstrust/superfastmatch/blob/master/src/registry.cc#L31-32
The command line parameter is "white_space_threshold"
ie.
superfastmatch -white_space_threshold 0.8
Differing linebreaks between the two editions do seem to be the
problem! If it's ok, can you send the me the two untouched text files
as attachments? Did you load them using the load.sh example script?
A carriage return should be counted as a single whitespace and
therefore be equivalent to a straightforward space. Bit confused, the
original documents will help me solve it.
Cheers,
Donny.
the latest commits (I hope!) contain a fix for the line ending problem
you were having. Line breaks were only being normalised for the
purposes of counting the amount of whitespace in a window and not for
calculating the actual hash. I've fixed this by clamping all ASCII
codes below 47 to a single value of 47. This is UTF-8 compatible and
seems to have improved things a lot in the sense that fewer, but
longer fragments are found.
Not sure I can easily deal with the added hyphenation that your two
examples have as this adds an extra character, but maybe the layout
decisions are interesting in themselves :)
Let me know if it works ok for you.
Cheers,
Donny.
> Search for: "ton cri d'oiseau surpris et de fascine" in the right-hand
> text. And click on the selection to align the left-hand text properly.
> Notice how the two matching lines following this line are broken up.
looks like Juxta might well be triming trailing and leading whitespace
before processing. You can do something similar using sed:
sed 's/^[ \t]*//;s/[ \t]*$//' chiens_2.txt > chiens_2.stripped.txt
or see:
http://www.cyberciti.biz/tips/delete-leading-spaces-from-front-of-each-word.html
I could do this in SFM itself, but it would be hard to calculate the
positions for the untouched original document.
> Is there a way to display the SFM results on a scrolling page, so that I can search through them easily?
I'm currently working on a side-by-side view, much like Juxta's one,
which will greatly assist in seeing matched fragments in the context
of both documents. Will let you know when it's ready for a test!
Hope that helps,
Donny.
I didn't realise that Juxta was showing SFM matches! It is a really
nice side-by-side view. Have you got a contact at Performant for the
developer behind the jQuery code?
I think the splits in the match are due to the fact that the leading
whitespace lengths are different between the two documents. If you
normalise both documents before loading them into SFM, by running them
through the sed snippet first, then the match should be longer and
correct. In other words, the error is in the source document which is
then reflected in both SFM and Juxta.
Cheers,
Donny.